How to extract important phrases (which may contain company name) from resume?

I have thousands of CV / resumes with me. We want to build a parser which can extract company names from resume.

So far we have tried

  1. Maintained a list of common words present in companies (Eg. Org, Ltd, Limited, Technologies etc.) and use them to identify probable companies. But this list is limited and many times many companies don't get extracted.

  2. Using HTML of CV we have tried to give more score to probable companies which have a certain feature (like Bold, Italics)

Since CV is not only text and we always have some structural information along with it. There should be better ways to extract information. Maybe training some model which could predict companies mentioned in the resume. We are open to any better approaches/suggestions we could incorporate into our system for better accuracy. The precision so far is really bad (less than 45%).

We have already done the segmentation of work experience in CV. So we are able to extract the segment containing work experience with very high precision.

We also have a comprehensive list of companies (millions). Although it contains duplicates and needs significant amount of cleaning. But yes we have a lot of data

Edit

Other approaches we are trying - We try to predict the important phrases inside text using N-Grams and then mark them as probable companies. Then we do a search on companies corpus with us to find any match. How useful is this technique ? Any better approaches ?

Topic parsing nlp data-mining machine-learning

Category Data Science


Have you tried the XML package? In a similar question here, in SE, the most upvoted answer suggested using some packages for that.

Here: https://stackoverflow.com/questions/3195522/is-there-a-simple-way-in-r-to-extract-only-the-text-elements-of-an-html-page

Here you find further instructions: https://stackoverflow.com/questions/1844829/how-can-i-read-and-parse-the-contents-of-a-webpage-in-r


Sounds like you want named entity recognition. There are a variety of approaches to NER, and plenty of implementations, like the Stanford NER package.

After you find named entities, determining what the named entity refers to is called concept normalization.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.