Natural language processing

I am new to NLP. I converted my JSON file to CSV with the Jupyter notebook. I am unsure how to proceed in pre-processing my data using techniques such as tokenization and lemmatization etc. I normalised the data before converting it to a CSV format, so now i have a data frame. Please how do I apply the tokenisation process on the whole dataset and using the split() function is giving me an error?

Topic nltk deep-learning pandas nlp machine-learning

Category Data Science


Ok, something like this should work:

import json
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

with open('data_full.json','r') as f:
    data0 = f.read()
rawdata = json.loads(data0)

for dataset,instances in rawdata.items():
    for instance in instances:
        sentence = instance[0]
        label = instance[1]
        tokens = word_tokenize(sentence)
        print('in ',dataset,': ', '|'.join(tokens),'; label:', instance[1])
        lemmas = [ lemmatizer.lemmatize(token) for token in tokens ]
        print('          lemmas = ','|'.join(lemmas))

Note: you will probably need to install a few resources for nltk, follow the instructions given in the error messages.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.