Restrict Date parser in certain cases

Sorry if the title wasn't self-explanatory. Here is a detailed version.

I created a data parser to parse dates from resumes. The ultimate goal is to find how many years of work experience a candidate has based on the resume. The parser can catch dates in all formats like:

  1. MM/DD/YY - MM/DD/YY
  2. MM/DD/YYYY - MM/DD/YYYY
  3. Apr 09 - Jul 11
  4. 03/09 - 07/11
  5. 2007 - 2010 etc.

The way in which the parser works is it first extracts all the dates and then converts them into Apr 2009 - Jul 2011 format (the difference is, whatever format the date is written in the resume, it will be converted like this regardless) and be stored in a list

The Next step after conversion of all the dates is, sorting them based on the last four characters of each element. For eg: if the parsed dates are

[Apr 2009 - Jul 2011, Jan 2014 - Oct 2018, Feb 2013 - Jun 2014, Nov 2018 - Aug 2021, Mar 2010 - Sep 2012, Jan 2005 - Mar 2008]

The last four Apr 2009 - Jul 2011 is 2011, likewise the whole list is sorted (descending order) based on this criteria. So the list now becomes like this:

[Nov 2018 - Aug 2021, Jan 2014 - Oct 2018, Feb 2013 - Jun 2014, Mar 2010 - Sep 2012, Apr 2009 - Jul 2011,Jan 2005 - Mar 2008]

And the total number of years, months are calculated by just summing the dates. I also ignore overlapping dates. In the above case if u notice the dates Mar 2010 - Sep 2012 Apr 2009 - Jul 2011 have overlapping months. So only Mar 2010 - Sep 2012 is considered and Apr 2009 - Jul 2011 is discarded.

Now coming to the issue I am facing, sometimes in CVs people tend to mention the software they have used in the past. (Un)fortunately, some of that software come along with their year as version For eg: SQL Server 2005 2008 or Windows Server 2010 - 2011 etc.

What's happening is.. my date parser is catching these years too. Luckily, sometime's these dates are discarded due to overlapping. Unfortunately, sometimes these dates are being considered and actual dates which are supposed to be considered, are being discarded due to overlapping with the versions.

I can't think of any generalized way(atleast not yet) to avoid catching the year as version pattern. Can someone help me out on this? How can I prevent my parser from catching such patterns?

TIA.

Topic stanford-nlp text-mining nlp python machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.