How can I do text mining to relate a number and word of a doc and relate both into a dataset ?
How can I do text mining to relate a number and word of a doc and relate both into a dataset (each one as an attr)?
The idea is taking a doc (similar as a "bill of sale"), read it and process in a way that I can have a simple exampleset as...
Access key | Product code | Product name
xxxxx | yyyyy | XPTO
Do you have any idea or solution on another topic that I haven't found? It will help a lot
Thanks. Best,
G.
The idea is taking a doc (similar as a "bill of sale"), read it and process in a way that I can have a simple exampleset as...
Access key | Product code | Product name
xxxxx | yyyyy | XPTO
Do you have any idea or solution on another topic that I haven't found? It will help a lot
Thanks. Best,
G.
Tagged:
0
Best Answers
-
kayman MemberPosts:662UnicornI think generate attributes in combination with regex is a good candidate, as long as your content is pretty distinguishable.
If for instance your acces key would be always 8 digits big you could create a function that checks if your 'base attribute' contains an isolated 8 digit pattern, and if so take the pattern and store it in your new access key attribute. If no match don't add anything.
And this for all of your new attributes.
You will always need recognizable patters, otherwise it will never work.
If you need some support on the regex you can always share some examples, happy to help with that5 -
kayman MemberPosts:662UnicornSure, no promises but happy to help5
Answers
I really appreciate if you can support me on the regex. I am sharing a process with the document that I need to do what I described.
About the patterns, I have three kind of docs, PDF ones (with a pattern), scanned docs (images that I need to do the same thing, read, identify, separate in a exampleset, etc. with another pattern) and another scanned docs. I will need to build a process to each one because of the patterns
attached are the process and a notepad with the XML.
Thanks again.
It looks as if you start with a pdf that you convert to a text file, so it might be better to start with using the pdf table extractor extension (available on the market placehttps://marketplace.www.kenlockard.com/UpdateServer/faces/product_details.xhtml?productId=rmx_pdf_table_extraction)
This may reduce the complexity a lot as you seem to have quite some columns originally. Combining a few techniques together may work out better than.
Attached an example extracting the Access Key and storing it as a new attribute.
Can I send you a private message? Then I could share with you an image of the structure of the document. If you have time to do it, of course, it would be wonderful. Let me know if this is feasible