Extract company names/job titles from free text

I have a complete Hadoop platform with HDFS, MR, Hive, PIG, Hbase, etc., Python, R, Java. All data sets have a large size.

The data set A, describing the jobs of people working in a company, is composed of the following fields:

  • Id Person: a unique alphanumeric identifier per person.
  • Start Date: a date format iso entry in the post
  • End Date: iso size release date of the position. If the date is not given, it is the current position

  • Job Title: a text field containing the title and the name of the company. The text is free, non-standardized, French and / or English and can contain typos. Ex: Director Big Data Analytics with Google, Commercial Manager at [missing text] , Manager at googole ...

My question is; how can I create a feature to easily process the name of company of the job (jobtitle)?

Thank you in advance

Topic hbase apache-pig apache-hadoop

Category Data Science


I think what you want is to extract company names from "Job Title". In natural language process, we call this kind of research as "Name Entity Recognition(NER)". You can try to use Stanford Named Entity Recognizer (NER)[http://nlp.stanford.edu/software/CRF-NER.shtml]. Stanford NER performs very well on English contents and there are lots packages for many programming language:

UIMA: Florian Laws made a Stanford NER UIMA annotator using a modified version of Stanford NER, which is available on his homepage. [Old version.]

Perl: Kieren Diment has written Text-NLP-Stanford-EntityExtract, a Perl module that provides an interface to Stanford NER running as a server.

Ruby: tiendung has written a Ruby Binding for the Stanford POS tagger and Named Entity Recognizer.

Python: Dat Hoang wrote pyner, a Python interface to Stanford NER. [Old version.] NLTK (2.0+) contains an interface to Stanford NER written by Nitin Madnani: documentation (note: set the character encoding or you get ASCII by default!), code, on Github.

F#/C#/.NET: Sergey Tihon has ported Stanford NER to F# (and other .NET languages, such as C#), using IKVM. See also pages on: GitHub and NuGet.

PHP: PHP-Stanford-NLP. Supports POS Tagger, NER, Parser. By Anthony Gentile (agentile).

If you are not satisfied with the performance of Stanford NER, you can also train you own models to extract company names by crawl company names from several popular sites with company names, such as Linkedin/Facebook/Glassdoor...etc

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.