How to extract and classify data from a column in excel?

I have a column in an Excel sheet that contains a lot of data separated by || delimiters. The data can be classified to some classes like Entity, IFSC codes, transaction reference id, etc.

A single cell looks like this:

EFT INCOMING||0141201||NHFI0141201||UTR||SBIN118121948660 M S||some-name ||some-purpose||TRN REF NO:a1b2c3d4e5

Not every cell has the same number of classes or even the same type of classes. Another example:

COMM/CHARGES/FEES||CHECK/REF.6546644473||BILPAY CCTY BEARING C||00.00||00012||18031358||BLPY||TRN REF NO:a1b2c3d4e5

I tried extracting this information using regular expressions and am able to get a list of ref-ids or IFSC codes extracted as a single list. But I need to break a cell to multiple cells with individual information. If some cell does not has that class data, it shall remain blank.

I also tried using named entity recognition but the same problem arises, I get the list of entities as output, not the breakdown.

Please help me in identifying what kind of problem this is? A text classification? And what would be the approach to solve it?

Topic text preprocessing named-entity-recognition classification python

Category Data Science


A simpler yet powerful solution can be like this:

  1. Based on your delimiters, clean the data in your excel
  2. Ensure that, there is correct mapping of your data to its corresponding headers in your excel
  3. Store it in your compatible data structure viz dataframe, 2D list etc
  4. Perform intent classification using tools like RASA-NLU where your columns like Entity, IFSC codes, transaction reference id are intents
  5. Map your data to the intents classified for each column by RASA and store the final results in a csv file

Note: You can read about RASA framework here

Thanks !!


You need to perform a few preprocessing steps.

  1. Convert your excel file into some sort of text file (csv is possibly the easiest)
  2. Manipulate the file with python, either directly reading the file or with libraries such as the csv python module or pandas.

Last word of advice: Regular expressions are wonderful, but I think you might be using them for the wrong task, I strongly recommend you to take a programmatic approach.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.