nltk - Geeks Mental

Unable to resolve Type error using Tokenizer.tokenize from NLTK

Vivek Rmk

2022年5月19日 09:01

I want to tokenize text data and am unable to proceed due to a type error, am unable to know how to proceed to rectify the error, To give some context - all the columns - Resolution code','Resolution Note','Description','Shortdescription' are text data in English- here is the code that I have written : #Removal of Stop words: from nltk.tokenize import sent_tokenize, word_tokenize from nltk.corpus import stopwords from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r'\w+') stop_words = set(stopwords.words('english')) tokenizer = RegexpTokenizer(r'\w+') dfclean_imp_netc=pd.DataFrame() …

Topic: tokenization nltk python

Category: Data Science

How to find possible subjects for given verb in everyday object domain

ARPIT PRASHANT BAHETY

2022年5月15日 10:02

I am asking for tools (possibly in NLTK) or papers that talk about the following: e.g. Input: Vase(Subject1) put(verb) Ans I am looking for: flower, water Is there a tool that can output subjects (objects) that can be associated to this verb? (I was going through VerbNet but didn't find anything)

Topic: nltk nlp

Category: Data Science

Trying to compress text with NLP

Fmkit

2022年5月13日 02:02

For a university project, I need to send text in Spanish via SMS. As these have a cost, I am trying to compress this text in an inefficient way. This consists of first generating a permutation of codes formed by two characters of many alphabets (fines, Cyrillic, etc.) to which I assign a word that has more than two characters (to say that it is being compressed). Then I take each word in a sentence and assign it its associated …

Topic: stanford-nlp nltk regression nlp machine-learning

Category: Data Science

Looking for a generalized (extended) lemmatizer

Stanislav Koncebovski

2022年5月1日 20:04

Whenever I lemmatize a compound word in English or German, I obtain a result that ignores the compound structure, e.g. for 'sidekicks' the NLTK WordNet lemmatizer returns 'sidekick', for 'Eisenbahnfahrer' the result of the NLTK German Snowball lemmatizer is 'eisenbahnfahr'. What I need, however, is something that would extract the primary components out of compound words: ['side', 'kick'] and, especially, ['eisen', 'bahn', 'fahr'] (or 'fahren' or in whatever form for the last item). I am especially interested in segmentizing compound …

Topic: nltk nlp

Category: Data Science

Using VADER Sentiment Analysis makes distributions overlap : how to improve my model

prog-amateur

2022年4月14日 16:05

I use VADER Sentiment Analysis on a "customer reviews" dataset. VADER breaks down feelings of satisfaction and dissatisfaction into neutral and positive negative components. Plotting the distributions, I see that those of satisfied and dissatisfied customers overlap quite a bit. I would like to know if I can improve my model: I imagined training it only on "non-overlapping" dataset values, is this a correct method? Is there another method? I am a beginner, please be cool, thanks again for your …

Topic: machine-learning-model nltk sentiment-analysis

Category: Data Science

Which Pointers from WordNet are Used for Synset in NLTK

FoundABetterName

2022年4月8日 17:36

I'm trying to create a custom parser for wordnet and hit a roadblock. I see that there are tons of different pointer_symbols and lots of them seem almost like synonyms but not exactly synonyms. I'm trying to extract the synonyms for each word and I'm not sure which pointers should be considered. I couldn't find anything through NLTK as well as to which pointer_symbols does it use for it's task. Any hints on what should I use?

Topic: nltk nlp

Category: Data Science

Stemming/lemmatization for German words

johnnydoe

2022年4月7日 00:02

I have a huge dataset of German words and their frequency in a text corpus (so words like "der", "die", "das" have a very high frequency, whereas terminology-like words have a very low frequency). Different forms of the same word, such as plural or 3rd person forms do appear, but there is no guarantee that this happens for every word. I tried using spacy.load('de_core_news_sm') but it says it can't find the model. Other older posts don't mention anything reliable in …

Topic: nltk scipy nlp python

Category: Data Science

How to find syntactic dependencies in text using unsupervised method and context information?

hamid

2022年4月4日 05:47

I know there are ready libraries to find syntactic dependencies and besides supervised methods, I have studied some of the unsupervised dependency parsing which uses POS tags and other mathematical and statistical techniques to solve the problem. I am working on a challenge to find out that is there any way to find syntactic dependencies in an unsupervised way and only by using the co-occurrence of words with each other and their context information? For example is there any way …

Topic: unsupervised-learning nltk nlp python

Category: Data Science

Natural language processing

Judith

2022年3月31日 17:19

I am new to NLP. I converted my JSON file to CSV with the Jupyter notebook. I am unsure how to proceed in pre-processing my data using techniques such as tokenization and lemmatization etc. I normalised the data before converting it to a CSV format, so now i have a data frame. Please how do I apply the tokenisation process on the whole dataset and using the split() function is giving me an error?

Topic: nltk deep-learning pandas nlp machine-learning

Category: Data Science

Elbow method for cosine distance

Ruuza

2022年3月25日 21:04

I have clustered vectors by cosine distance using nltk clusterer. If I understand correctly, Y axis for elbow method in euclidian distance would be the sum of every distance (squared) between centroid of the cluster with vectors that belongs to that cluster. My question is: Would it be the same for clusters using cosine distance? EDIT: ok, so i tried sum of squares with cosine distance and it seems, that it's returning the same values... heres my code: EDIT2: My …

Topic: cosine-distance nltk

Category: Data Science

How can I find synonyms and antonyms for a word?

ASH

2022年3月25日 16:04

I found some code online where I can feed in a word, and find both synonyms and antonyms for this word. The code below does just that. import nltk from nltk.corpus import wordnet #Import wordnet from the NLTK syn = list() ant = list() for synset in wordnet.synsets("fake"): for lemma in synset.lemmas(): syn.append(lemma.name()) #add the synonyms if lemma.antonyms(): #When antonyms are available, add them into the list ant.append(lemma.antonyms()[0].name()) print('Synonyms: ' + str(syn)) print('Antonyms: ' + str(ant)) My question is, how …

Topic: nltk nlp python

Category: Data Science

Converting to lowercase while creating dataset for NER using spacy

Aniiya0978

2022年3月23日 16:00

I am trying to make a custom entity model for an NER application using spacy. In several NLP projects, I have converted all the data to lowercase and applied several ML techniques. For NER also should I have to convert the data to lowercase. Or why it is necessary to convert to lower case. Is it a mandate one which will affect the accuracy of the model adversely if not converted to lowercase.

Topic: spacy nltk nlp python

Category: Data Science

Need help to increase classification accuracy for classified ads posting

Omair

2022年3月17日 23:07

I have to predict the category under which ad was posted using the provided data; I cannot gain accuracy more than 74% for my model. I am not sure what I am missing. What I have done so far: Cleaned the text using re & nltk Used stemmer CountVectorizer & Tfidftransformer Used MultinomialNB, LinearSVC & RandomForestClassifier Following is my code : import json import pandas as pd from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.naive_bayes import MultinomialNB from …

Topic: nltk classification machine-learning

Category: Data Science

nltk.corpus for data science related words?

haneulkim

2022年3月3日 04:05

from job description I scraped from the internet, I've went through all nlp processes and I've got to place where I found: freq = nltk.FreqDist(lemmatized_list) most_freq_words = freq.most_common(100) which outputs: [('data', 179), ('experience', 86), ('work', 78), ('business', 71), ('team', 59), ('learn', 56), ('model', 49), ('skills', 47), ('science', 41), ('use', 41), ('build', 39), ('machine', 37), ('ability', 36),..... and so on. My problem is I do not want to consider words like "experience", "work", and only consider keywords related to data science. …

Topic: nltk nlp python

Category: Data Science

Compare Books using book categories list NLP

Eitan Rosati

2022年2月27日 21:30

I have a database of books. Each book has a list of categories that describe the genre/topics of the book (I use Python models). The categories in the list most of the time are composed of 1 to 3 words. Examples of a book category list: ['Children', 'Flour mills', 'Jealousy', 'Nannies', 'Child labor', 'Conduct of life'], ["Children's stories", 'Christian life'], ['Children', 'Brothers and sisters', 'Conduct of life', 'Cheerfulness', 'Christian life'], ['Fugitive slaves', 'African Americans', 'Slavery', 'Plantation life', 'Slaves', 'Christian life', …

Topic: spacy nltk nlp python machine-learning

Category: Data Science

How do I split contents in a text that would include two or more different themes (context) in NLP?

Ahmad Aburoman

2022年2月26日 01:04

For example, a text: "The airlines have affected by Corona since march 2020 a crime has been detected in Noia village this morning" the output should be: The airline companies have affected by Corona since march 2020 a crime has been detected in Noia village this morning the text has no Breaks. I know it is not a one-click solution, but if anyone knows a methodology or techniques to solve such a problem, please provide me with resources.

Topic: text-classification nltk text-mining nlp

Category: Data Science

find bigrams in pandas

Matan

2022年2月19日 22:07

I have a DataFrame with 4 columns: 'Headline', 'Body_ID', 'Stance', 'articleBody', with 'Headline' and 'articleBody containing cleaned and tokenized words. I want to find bi-grams using nltk and have this so far: bigram_measures = nltk.collocations.BigramAssocMeasures() articleBody_biGram_finder = df_2['articleBody'].apply(lambda x: BigramCollocationFinder.from_words(x)) I'm having trouble with the last step of applying the articleBody_biGram_finder with bigram_measures. I've tried multiple iterations of lambda with list comprehension but am getting nowhere. my most recent attempts: df_2['articleBody_scored'] = score_ngrams(bigram_measures.raw_freq) for item in articleBody_biGram_finder df_2['articleBody_scored'] = articleBody_biGram_finder.apply(lambda …

Topic: nltk pandas nlp python

Category: Data Science

train NER using NLTK with custom corpora (non-english) must use StanfordNER?

Mico S

2022年2月16日 18:00

I have searched about customization NER corpora for trainig the model using NLTK library from python, but all of the answer direct to nltk book chapter 7 and honestly makes me confuse how to train the corpus with correct flow and data set that has structure like this below: Eddy N B-PER Bonte N I-PER is V O woordvoerder N O van Prep O diezelfde Pron O Hogeschool N B-ORG . Punc O I have some questions: I found so …

Topic: nltk named-entity-recognition nlp

Category: Data Science

What is "Interpolated Absolute Discounting" smoothing method

Ahmad

2022年2月11日 20:02

I'm asked to implement "Interpolated Absolute Discounting" for a bigram language model for a text. First, I don't know what is it exactly. I guess it is an interpolation between different ngrams (unigram, bigram, ), whose parameters needs to be learned Second, what is the implemented probability distribution for this technique in nltk package? Moreover, I must learn the parameters from a corpus. How can I do that?

Topic: nltk

Category: Data Science

Weighting of words in lexicon based sentiment analysis

voltage

2022年2月10日 23:08

I have a a question regarding my current project, i am trying to do a lexicon based sentiment analysis on my data, where i calculate the sentiment score as following: $$ Score = \frac{\sum_{i}{word_i}}{\mid words \mid} $$ So according to the score the word will be classified in either negative or positive. But i have also calculated for every word in the article its salience and frequency and would like to know if its possible to use them in my …

Topic: nltk sentiment-analysis nlp

Category: Data Science

About