Unable to resolve Type error using Tokenizer.tokenize from NLTK

I want to tokenize text data and am unable to proceed due to a type error, am unable to know how to proceed to rectify the error, To give some context - all the columns - Resolution code','Resolution Note','Description','Shortdescription' are text data in English- here is the code that I have written :

#Removal of Stop words:

    from nltk.tokenize import sent_tokenize, word_tokenize 
    from nltk.corpus import stopwords 
    from nltk.tokenize import RegexpTokenizer

    tokenizer = RegexpTokenizer(r'\w+')        
    stop_words = set(stopwords.words('english'))

    tokenizer = RegexpTokenizer(r'\w+') 
    dfclean_imp_netc=pd.DataFrame() 
    for column in ['Resolution code','Resolution Note','Description','Shortdescription']:
        dfimpnetc[column] = dfimpnetc[column].apply(tokenizer.tokenize)        
    for column in ['Resolution code','Resolution Note','Description','Short description']:
        dfclean_imp_netc[column] = dfimpnetc[column].apply(lambda vec: [word for word in vec if word not in stop_words]) 
    dfimpnetc['Resolution Note'] = dfclean_imp_netc['Resolution Note'] 
    dfimpnetc['Description'] = dfclean_imp_netc['Description'] 
    dfimpnetc['Short description'] = dfclean_imp_netc['Short description'] 
    dfimpnetc['Resolution code'] = dfclean_imp_netc['Resolution code']  

My error output is attached below:

Topic tokenization nltk python

Category Data Science


I agree with S van Balen in that it's not clear where and whether you actually load the data. Even if you loaded it earlier, initializing a new DataFrame object might erase it from memory if you're using the same variable name to store it.

Anyway, assuming the DataFrame 'dfclean_imp_netc''s rows and columns have indeed been filled with values, then I think the issue is that you initialize the frame as dfclean_imp_netc but then you apply the tokenizer on a different variable, dfimpnetc. I think you need to move the assignment of values to dfimpnetc before the for loops, as shown in the code snippet below.

Please note also that the two for loops are not assigning values to the same variable: the first loop updates dfimpnetc but the second loop updates dfclean_imp_netc. I get the sense you may want to be updating the same variable in both cases.

from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.corpus import stopwords 
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')        
stop_words = set(stopwords.words('english'))

tokenizer = RegexpTokenizer(r'\w+') 
dfclean_imp_netc=pd.DataFrame() 
dfimpnetc['Resolution Note'] = dfclean_imp_netc['Resolution Note'] 
dfimpnetc['Description'] = dfclean_imp_netc['Description'] 
dfimpnetc['Short description'] = dfclean_imp_netc['Short description'] 
dfimpnetc['Resolution code'] = dfclean_imp_netc['Resolution code']

for column in ['Resolution code','Resolution Note','Description','Shortdescription']:
    dfimpnetc[column] = dfimpnetc[column].apply(tokenizer.tokenize)        
for column in ['Resolution code','Resolution Note','Description','Short description']:
    dfclean_imp_netc[column] = dfimpnetc[column].apply(lambda vec: [word for word in vec if word not in stop_words]) 

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()

dfimpnetc[column] = dfimpnetc[column].apply(lambda x: [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(x)])

Try the above code.

Mark as correct if this helps ;)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.