Unable to resolve Type error using Tokenizer.tokenize from NLTK

I want to tokenize text data and am unable to proceed due to a type error, am unable to know how to proceed to rectify the error, To give some context - all the columns - Resolution code','Resolution Note','Description','Shortdescription' are text data in English- here is the code that I have written : #Removal of Stop words: from nltk.tokenize import sent_tokenize, word_tokenize from nltk.corpus import stopwords from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r'\w+') stop_words = set(stopwords.words('english')) tokenizer = RegexpTokenizer(r'\w+') dfclean_imp_netc=pd.DataFrame() …
Category: Data Science

How to find possible subjects for given verb in everyday object domain

I am asking for tools (possibly in NLTK) or papers that talk about the following: e.g. Input: Vase(Subject1) put(verb) Ans I am looking for: flower, water Is there a tool that can output subjects (objects) that can be associated to this verb? (I was going through VerbNet but didn't find anything)
Topic: nltk nlp
Category: Data Science

Trying to compress text with NLP

For a university project, I need to send text in Spanish via SMS. As these have a cost, I am trying to compress this text in an inefficient way. This consists of first generating a permutation of codes formed by two characters of many alphabets (fines, Cyrillic, etc.) to which I assign a word that has more than two characters (to say that it is being compressed). Then I take each word in a sentence and assign it its associated …
Category: Data Science

Looking for a generalized (extended) lemmatizer

Whenever I lemmatize a compound word in English or German, I obtain a result that ignores the compound structure, e.g. for 'sidekicks' the NLTK WordNet lemmatizer returns 'sidekick', for 'Eisenbahnfahrer' the result of the NLTK German Snowball lemmatizer is 'eisenbahnfahr'. What I need, however, is something that would extract the primary components out of compound words: ['side', 'kick'] and, especially, ['eisen', 'bahn', 'fahr'] (or 'fahren' or in whatever form for the last item). I am especially interested in segmentizing compound …
Topic: nltk nlp
Category: Data Science

Using VADER Sentiment Analysis makes distributions overlap : how to improve my model

I use VADER Sentiment Analysis on a "customer reviews" dataset. VADER breaks down feelings of satisfaction and dissatisfaction into neutral and positive negative components. Plotting the distributions, I see that those of satisfied and dissatisfied customers overlap quite a bit. I would like to know if I can improve my model: I imagined training it only on "non-overlapping" dataset values, is this a correct method? Is there another method? I am a beginner, please be cool, thanks again for your …
Category: Data Science

Which Pointers from WordNet are Used for Synset in NLTK

I'm trying to create a custom parser for wordnet and hit a roadblock. I see that there are tons of different pointer_symbols and lots of them seem almost like synonyms but not exactly synonyms. I'm trying to extract the synonyms for each word and I'm not sure which pointers should be considered. I couldn't find anything through NLTK as well as to which pointer_symbols does it use for it's task. Any hints on what should I use?
Topic: nltk nlp
Category: Data Science

Stemming/lemmatization for German words

I have a huge dataset of German words and their frequency in a text corpus (so words like "der", "die", "das" have a very high frequency, whereas terminology-like words have a very low frequency). Different forms of the same word, such as plural or 3rd person forms do appear, but there is no guarantee that this happens for every word. I tried using spacy.load('de_core_news_sm') but it says it can't find the model. Other older posts don't mention anything reliable in …
Category: Data Science

How to find syntactic dependencies in text using unsupervised method and context information?

I know there are ready libraries to find syntactic dependencies and besides supervised methods, I have studied some of the unsupervised dependency parsing which uses POS tags and other mathematical and statistical techniques to solve the problem. I am working on a challenge to find out that is there any way to find syntactic dependencies in an unsupervised way and only by using the co-occurrence of words with each other and their context information? For example is there any way …
Category: Data Science

Natural language processing

I am new to NLP. I converted my JSON file to CSV with the Jupyter notebook. I am unsure how to proceed in pre-processing my data using techniques such as tokenization and lemmatization etc. I normalised the data before converting it to a CSV format, so now i have a data frame. Please how do I apply the tokenisation process on the whole dataset and using the split() function is giving me an error?
Category: Data Science

Elbow method for cosine distance

I have clustered vectors by cosine distance using nltk clusterer. If I understand correctly, Y axis for elbow method in euclidian distance would be the sum of every distance (squared) between centroid of the cluster with vectors that belongs to that cluster. My question is: Would it be the same for clusters using cosine distance? EDIT: ok, so i tried sum of squares with cosine distance and it seems, that it's returning the same values... heres my code: EDIT2: My …
Category: Data Science

How can I find synonyms and antonyms for a word?

I found some code online where I can feed in a word, and find both synonyms and antonyms for this word. The code below does just that. import nltk from nltk.corpus import wordnet #Import wordnet from the NLTK syn = list() ant = list() for synset in wordnet.synsets("fake"): for lemma in synset.lemmas(): syn.append(lemma.name()) #add the synonyms if lemma.antonyms(): #When antonyms are available, add them into the list ant.append(lemma.antonyms()[0].name()) print('Synonyms: ' + str(syn)) print('Antonyms: ' + str(ant)) My question is, how …
Topic: nltk nlp python
Category: Data Science

Converting to lowercase while creating dataset for NER using spacy

I am trying to make a custom entity model for an NER application using spacy. In several NLP projects, I have converted all the data to lowercase and applied several ML techniques. For NER also should I have to convert the data to lowercase. Or why it is necessary to convert to lower case. Is it a mandate one which will affect the accuracy of the model adversely if not converted to lowercase.
Category: Data Science

Need help to increase classification accuracy for classified ads posting

I have to predict the category under which ad was posted using the provided data; I cannot gain accuracy more than 74% for my model. I am not sure what I am missing. What I have done so far: Cleaned the text using re & nltk Used stemmer CountVectorizer & Tfidftransformer Used MultinomialNB, LinearSVC & RandomForestClassifier Following is my code : import json import pandas as pd from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.naive_bayes import MultinomialNB from …
Category: Data Science

nltk.corpus for data science related words?

from job description I scraped from the internet, I've went through all nlp processes and I've got to place where I found: freq = nltk.FreqDist(lemmatized_list) most_freq_words = freq.most_common(100) which outputs: [('data', 179), ('experience', 86), ('work', 78), ('business', 71), ('team', 59), ('learn', 56), ('model', 49), ('skills', 47), ('science', 41), ('use', 41), ('build', 39), ('machine', 37), ('ability', 36),..... and so on. My problem is I do not want to consider words like "experience", "work", and only consider keywords related to data science. …
Topic: nltk nlp python
Category: Data Science

Compare Books using book categories list NLP

I have a database of books. Each book has a list of categories that describe the genre/topics of the book (I use Python models). The categories in the list most of the time are composed of 1 to 3 words. Examples of a book category list: ['Children', 'Flour mills', 'Jealousy', 'Nannies', 'Child labor', 'Conduct of life'], ["Children's stories", 'Christian life'], ['Children', 'Brothers and sisters', 'Conduct of life', 'Cheerfulness', 'Christian life'], ['Fugitive slaves', 'African Americans', 'Slavery', 'Plantation life', 'Slaves', 'Christian life', …
Category: Data Science

How do I split contents in a text that would include two or more different themes (context) in NLP?

For example, a text: "The airlines have affected by Corona since march 2020 a crime has been detected in Noia village this morning" the output should be: The airline companies have affected by Corona since march 2020 a crime has been detected in Noia village this morning the text has no Breaks. I know it is not a one-click solution, but if anyone knows a methodology or techniques to solve such a problem, please provide me with resources.
Category: Data Science

find bigrams in pandas

I have a DataFrame with 4 columns: 'Headline', 'Body_ID', 'Stance', 'articleBody', with 'Headline' and 'articleBody containing cleaned and tokenized words. I want to find bi-grams using nltk and have this so far: bigram_measures = nltk.collocations.BigramAssocMeasures() articleBody_biGram_finder = df_2['articleBody'].apply(lambda x: BigramCollocationFinder.from_words(x)) I'm having trouble with the last step of applying the articleBody_biGram_finder with bigram_measures. I've tried multiple iterations of lambda with list comprehension but am getting nowhere. my most recent attempts: df_2['articleBody_scored'] = score_ngrams(bigram_measures.raw_freq) for item in articleBody_biGram_finder df_2['articleBody_scored'] = articleBody_biGram_finder.apply(lambda …
Category: Data Science

train NER using NLTK with custom corpora (non-english) must use StanfordNER?

I have searched about customization NER corpora for trainig the model using NLTK library from python, but all of the answer direct to nltk book chapter 7 and honestly makes me confuse how to train the corpus with correct flow and data set that has structure like this below: Eddy N B-PER Bonte N I-PER is V O woordvoerder N O van Prep O diezelfde Pron O Hogeschool N B-ORG . Punc O I have some questions: I found so …
Category: Data Science

What is "Interpolated Absolute Discounting" smoothing method

I'm asked to implement "Interpolated Absolute Discounting" for a bigram language model for a text. First, I don't know what is it exactly. I guess it is an interpolation between different ngrams (unigram, bigram, ), whose parameters needs to be learned Second, what is the implemented probability distribution for this technique in nltk package? Moreover, I must learn the parameters from a corpus. How can I do that?
Topic: nltk
Category: Data Science

Weighting of words in lexicon based sentiment analysis

I have a a question regarding my current project, i am trying to do a lexicon based sentiment analysis on my data, where i calculate the sentiment score as following: $$ Score = \frac{\sum_{i}{word_i}}{\mid words \mid} $$ So according to the score the word will be classified in either negative or positive. But i have also calculated for every word in the article its salience and frequency and would like to know if its possible to use them in my …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.