Extract e-mail adresses out of a pdf
marcel_hanselma
MemberPosts:3Learner I
Hello dear Rapidminer community,
I have a pdf full of adresses (name, street, phonenumber, email). What I want is to extract only all the e-mail adresses and store them line per line in an excel or csv. How is the approach to this? (I am really a Rapidminer newbie)
Greetings, Marcel
I have a pdf full of adresses (name, street, phonenumber, email). What I want is to extract only all the e-mail adresses and store them line per line in an excel or csv. How is the approach to this? (I am really a Rapidminer newbie)
Greetings, Marcel
Tagged:
0
Best Answer
-
lionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195UnicornHi@marcel_hanselma,
Although your PDF is a scan and is not nicely formatted, it is workable : We can extract the email addresses. I used "Read Document" operator as mentioned by Jacob. Here the result :
I used a Python script to search, extract and display the e-mail addresses because it is very easy with this language.
(With RapidMiner native operator(s), I was unable to extract ALL the occurrences : I'm just able to find and extract the first occurrence.)
Thus to run the process in attached file, you will need :
- to install Python in your machine (you can install it via Anaconda)
- to install thePython scriptingextension from the marketplace. Don't forget to set in the Rapidminer settings, the path where yourPython.exefile is installed.
Hope this helps,
Regards,
Lionel
PS : Given that there are more than 1700 e-mails addresses in your document, the process computation is not instantaneous : You have to wait around 2 minutes...
7
Answers
Can you provide your .pdf file in order we can see how to extract the e-mail adresses ?
You can send it via private message if it is not confidential...
Regards,
Lionel
It worked flawless. :-)