Convert natural language text to structured data

Convert natural language text to structured data.

I'm developing a bot to help user assist in identifying Apparels. The problem is to convert natural language text to structured data (list of apparels) and query the store's inventory to find the closest match for each item.

For example, consider the following user input to the bot.

"I would like to order regular fit blue jeans with hip size 32 inches"

and the desired output will be the following

[
  {
    "quantity": 1,
    "size": "32 inches",
    "category": "jeans",
    "attributes":[
      {"colour": "blue"},
      {"fit": "regular fit"}
    ]
  }
]

I've attempted to solve the problem by splitting it into two parts.

Part 1: Named entity recognition using conditional random fields (CRF). - I've used approached discussed here to tag individual tokens and I'm able to extract entities like apparel type, apparel size and attributes etc.

example output (representation) of tagger :

I would like to order regular fit   blue   jeans     with hip size   32 inches
|                     |   |         |   |  |   |   |     |           |   |       |
+----------------------   +---------+   +--+   +---+     +-----------+   +-------+
OTHERS                    FIT           COLOR  CATEGORY  ATTR_TYPE       SIZE

Part 2: Rule-based grammar - Assuming a query from a user will always be a combination of defined entities (like type, color, fit, etc), I've written rules to capture the sequence of tags and their respective tokens and transform them into the required format.

Following are a few examples of commonly occurring sequences:

OTHERS ~ FIT ~ COLOR ~ CATEGORY ~ ATTR_TYPE ~ SIZE ~ OTHERS
OTHERS ~ CATEGORY ~ OTHERS ~ COLOR ~ FIT ~ OTHERS ~ SIZE
OTHERS ~ COLOR ~ CATEGORY ~ FIT ~ SIZE ~ OTHER ~ ATT_STYLE    
QTY ~ COLOR ~ ATT_MATERIAL ~ CATEGORY ~ OTHERS
COLOR ~ FIT ~ ATT_STYLE ~ CATEGORY

I've made some assumptions and mined frequently occurring sequences to write these rules.

The second part is not scalable and becomes a bottleneck. I cannot keep adding rules for capturing additional data points or handling new patterns that the system has not seen.

I'm looking for a generalized solution/data pipeline that can extract entities (relational) from natural language and convert them to structured data.

I would appreciate any ideas.

More examples to help understand the problem better:

Example 1:

"find jeans with black color, slim fit and size 28"

find a   jeans     with     black color,     slim fit    and       size 28
|    |   |   |     |  |     |         |      |      |    | |       |     | 
+----+   +---+     +--+     +---------+      +------+    +-+       +-----+
OTHERS   CATEGORY  OTHERS   COLOR            FIT         OTHERS    SIZE

[
  {
    "quantity": 1,
    "size": "28",
    "category": "jeans",
    "attributes":[
      {"colour": "black"},
      {"fit": "slim fit"}
    ]
  }
]

Example 2:

"I would like to find a white shirt, slim fit, XL with long sleeve, one maroon silk tie, and a black color regular fit flat front trousers"

I would like to find a  white   shirt,     slim fit, XL    with       long sleeve. 
|                    |  |   |   |   |      |      |  ||    |  |       |         |  
+--------------------+  +---+   +---+      +------+  ++    +--+       +---------+  
OTHERS                  COLOR   CATEGORY   FIT       SIZE  OTHER      ATT_STYLE    

one   maroon    silk         tie       and a 
| |   |    |    |  |         | |       |   | 
+-+   +----+    +--+         +-+       +---+ 
QTY   COLOR     ATT_MATERIAL CATEGORY  OTHERS

black color  regular fit   flat front   trousers
|         |  |         |   |        |   |      |
+---------+  +---------+   +--------+   +------+
COLOR        FIT           ATT_STYLE    CATEGORY

[
  {
    "quantity": 1,
    "size": "XL",
    "category": "shirt",
    "attributes":[
      {"colour": "white"},
      {"fit": "slim fit"},
      {"sleeve_length": "long sleeve"}
    ]
  },
  {
    "quantity": 1,
    "size": "STANDARD",
    "category": "tie",
    "attributes": [
      {"color": "maroon"},
      {"material": "silk"},
    ]
  },
  {
    "quantity": 1,
    "size": null,
    "category": "trousers",
    "attributes":[
      {"fit": "regular fit"},
      {"style": "flat front"},
      {"color": "black"}
    ]
  }
]

Edit 1: I'm trying to parse the sequence of entities and transform it into structured data using rules. The current rule-based system has limitations like maintaining rules require skilled experts, they need to be manually crafted and enhanced all the time. Is there a way to overcome these limitations using ML? Replace the rule-based parser with an ML-based parser?

Topic structured-data nlp

Category Data Science


  1. prepare an excel sheet with columns as - sentence, color,material,fit ,style,sleeve_lenth, etc. Treat it as training data.

  2. https://spacy.io/usage/spacy-101#training, use spacy to train a model where you feed the model - sentence ,tags and spantokens , something like - [{'red color shirt':[['color',(0,3)],['type',(10,14)]].

  3. learn how to create a train the model given in spacy tutorial and train a model.

  4. you can also use a fast,rule based method - https://spacy.io/usage/rule-based-matching of spacy on a simple english model and adding patterns to it.

  5. I would prefer point 4 at the beginning, once the rule based model is prepared, you can run it on a bunch of data to form the excel data mentioned in point 1 for training.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.