How to classify named entities of the same type?

I am doing a project where I am extracting date/time entities from text. I'm using a rule-based system to extract the temporal expressions and ground them to an actual date/time.

The second part of the problem I hope to solve is label the role of each entity discovered. For example, consider the following text: "Leaving at 2pm and back at 4pm". I correctly identified 2pm and 4pm as date/time entities. However, I'm unable to say whether the entity is "start-time", "end-time", or neither.

The question is how do I do this?

I'm new to NLP and ML. Here is an idea I have please tell me if I'm going the right direction:

The plan is to train a logistic regression (or naive bayes?) classifier using the following features:

  1. The average of the word embedding for each word within a window of the date/time phrase.
  2. The POS tags for each word within a window of the date/time phrase??(Not sure how to pass this in to a logistic regression classifier but just a thought)
  3. The word shapes of the words in the temporal expression??

I'm a little confused as to where to start and would really appreciate some pointers on how to select my features and what classifier would be appropriate.

I'm also open to suggestions on learning resources. There's a lot of NER resources online but not many on how to "role classify" found entities.

Topic named-entity-recognition nlp machine-learning

Category Data Science


You should apply the same plan that you used to extract times to categorize those times:

  1. Start with a rule-based system
  2. Then try a machine learning approach

In order to build a machine learning, you'll need a collection of text labeled as "start-time", "end-time", or "neither". You can first try traditional algorithms like logistic regression or naive bayes. Given this a relatively nuanced problem since you doing conditional classification, you might have be to build a more complex system that uses contextual information like a conditional random field (CRF).


You might be interested in resources built around TimeML, I think there are some corpora and specific parsers specialized in extracting the time details of events. I don't remember any specifics but I tried to google "timeml extract time" and found a few related resources, that might give you at least some inspiration about how people have dealth with similar problems.

If you don't find anything that suits your need, in general the best approach would be to train a custom NER using your own annotated dataset with "start-time", "end-time", "neither" labels.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.