How to extract contents by topic from a document?

I am trying to extract information from resumes. I tried the pdfminer for the text extraction. But I need to extract the contents from a resume with respect to its title.

For example: I will be giving my educational details under a title EDUCATIONAL BACKGROUND, so I have to extract the content topic wise.

Is it possible to extract like that?

What will be the process behind that?

Is it possible to approach the problem in a segmentation manner.

Topic semantic-segmentation information-extraction deep-learning nlp machine-learning

Category Data Science


Here are a list of tools you can look into:

  1. https://tika.apache.org/
  2. https://jsoup.org/
  3. https://poi.apache.org/

This was a neat read detailing the steps. The author was doing something similar to what you are trying.

https://towardsdatascience.com/how-to-build-a-resume-parsing-tool-ae19c062e377


pyresparser is useful for extracting information from resumes. I believe this should work in your case.

Check out the more details on the same here https://pypi.org/project/pyresparser/

Let me know if it works!

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.