text-mining

How to calculate lexical cohension and semantic informaticveness for a given dataset?

J Cena

2022年6月4日 14:00

In 'Automatic construction of lexicons, taxonomies, ontologies, and other knowledge structures' they have mentioned; There are two slightly different classes of measure: lexical cohesion (sometimes called ‘unithood’ or ‘phraseness’), which quantifies the expectation of co-occurrence of words in a phrase (e.g., back-of-the-book index is significantly more cohesive than term name); and semantic informativeness (sometimes called ‘termhood’), which highlights phrases that are representative of a given document or domain. However, the review does not include the ways to calculate/derive these measures. …

Topic: text-mining nlp statistics data-mining

Category: Data Science

Inference from text data without label or Target

sayan_sen

2022年6月2日 06:04

I have a use case where I have text data entered by an approver while approving of some loan. I have to make some inferences as to what could be the reasons for approval using NLP. How should I go about it? It's a Non english language. Can Clustering of text help?? Is it possible to cluster TEXT OF non English language using python libraries.

Topic: text-mining nlp clustering

Category: Data Science

NLP - Simple approach to identify commonalities in text comments between people

bazooka720

2022年5月29日 09:01

For something we are working on, we were looking for a simple way to compare from review/feedback data against a question (for which there are multiple responses from multiple people), the following: What are the common things (things defined as phrases/sentences) they are saying (Some way to quantify the commonality too if possible). The point is to identify what seems to be areas of agreement about their review What are things that are not common (basically...what are those on-off sentences/phrases …

Topic: text-mining nlp

Category: Data Science

Building a graph out of a large text corpus

kevin_was_here

2022年5月28日 10:19

I'm given a large amount of documents upon which I should perform various kinds of analysis. Since the documents are to be used as a foundation of a final product, I thought about building a graph out of this text corpus, with each document corresponding to a node. One way to build a graph would be to use models such as USE to first find text embeddings, and then form a link between two nodes (texts) whose similarity is beyond …

Topic: similar-documents graphs text-mining nlp similarity

Category: Data Science

How to choose threshold for gensim Phrases when generating bigrams?

lefnire

2022年5月28日 08:02

I'm generating bigrams with from gensim.models.phrases, which I'll use downstream with TF-IDF and/or gensim.LDA from gensim.models.phrases import Phrases, Phraser # 7k documents, ~500-1k tokens each. Already ran cleanup, stop_words, lemmatization, etc docs = get_docs() phrases = Phrases(docs) bigram = Phraser(phrases) docs = [bigram[d] for d in docs] Phrases has min_count=5, threshold=10. I don't quite understand how they interact, they seem related? Anyway, I see threshold having values in different tutorials ranging 1->1000, described as important in determining the number of …

Topic: gensim text-mining lda nlp

Category: Data Science

Apply Labeled LDA on large data

Xiancheng Li

2022年5月28日 05:05

I'm using a dataset contains about 1.5M document. Each document comes with some keywords describing the topics of this document(Thus multi-labelled). Each document belongs to some authors(not just one author for a document). I wanted to find out the topics interested by each author by looking at documents they write. I'm currently looking an LDA variation (labeled-LDA proposed by D Ramaga: https://www.aclweb.org/anthology/D/D09/D09-1026.pdf .). I'm using all the documents in my dataset to train a model and using the model to …

Topic: supervised-learning text-mining lda python

Category: Data Science

How to segment old digitized newspapers into articles

Tetro

2022年5月23日 12:06

I'm working on a large corpus of french daily newspapers from the 19th century that have been digitized and where the data are in the form of raw OCR text files (one text file per day). In terms of size, one year of issues is around 350 000 words long. What I'm trying to achieve is to detect the different articles that form a newspaper issue. Knowing that an article can be two or thee lines long or very much …

Topic: ocr text-mining nlp

Category: Data Science

Recurrent Neural Networks Over Multiple Documents Over Time

Ryan

2022年5月15日 21:03

So in my head, I have an idea about what this architecture should look like, or at least behave, but I am having trouble implementing it. So let me describe the problem, and if anyone has an idea on how to actually implement it let me know. Or if I am over-thinking a solution. I am trying to classify accounts into one of two groups, good and bad. I have multiple text documents per account. What I want to do …

Topic: rnn text-mining neural-network

Category: Data Science

Suggestion for a better way to organize data to generate frequent item-sets?

never_mind

2022年5月12日 22:04

I have a data of a bag of words in a document. The data has 3 columns: {document number, word number, count of the word in the number}. I am supposed to generate frequent item-sets of a particular size. I thought that I would make list of all words that appear in a document, create a table of this list, and then generate frequent item-sets using Mlxtend or Orange . However, this approach does not seem to be efficient.

Topic: orange3 orange text-mining data-mining

Category: Data Science

Extract details from bibliometrics data

J Cena

2022年5月11日 09:04

I have set of bibliometrics data (references). I want to extract the author names, title and the name of the conference/journal from it. Since the referencing style used by different papers vary, I am interested in knowing if there are any per-existing tools to do it? I am happy to provide examples if needed :)

Topic: text-mining social-network-analysis nlp data-mining

Category: Data Science

How to deal with one output for multiple inputs?

nilosch

2022年5月7日 16:17

Hei! I want to train a model, that predicts the sentiment of news headlines. I've got multiple unordered news headlines per day, but one sentiment score. What is a convenient solution to overcome the not 1:1 issue? I could: Concatenate all headlines to one string, but that feels a bit wrong, as an LSTM or CNN will use cross-sentence word relations, that don't exist. Predict one score per headline (1:1), and take the average in the application. But that might …

Topic: deep-learning sentiment-analysis text-mining neural-network

Category: Data Science

Error using corpus widget in Orange v 3.25.0 text mining and no import document option?

vipp.k0000

2022年5月7日 04:04

I am experimenting with Orange data mining tool. When I use the 'Corpus' widget from the text mining section it gives me the error: [![Corpus widget error][1]][1] I have tried many things, but am still unable to resolve this issue. [1]: https://i.stack.imgur.com/X4tKu.png Besides that, in text mining, it does not show import document option in orange v 3.25. I just want to know the options to import text files into the Orange.

Topic: corpus orange3 orange text-mining data-mining

Category: Data Science

Association Rule Mining across two market baskets

HakunaMaData

2022年5月6日 13:04

I am quite familiar with Association Rule mining but I need to use it to associate ACROSS two market baskets instead of finding support WITHIN a market basket. Imagine customers come to a Store A and buy a certain number of products. The same customers go to Store B and buy another set of products. I want to associate between the two Stores and not within the Store. So I want to make "A --> B" statements like "Customers that …

Topic: market-basket-analysis association-rules text-mining machine-learning

Category: Data Science

Read corpus from a csv file in Orange3

Huy Truong

2022年5月3日 05:31

I have twitter text as an Excel file: every line is one one tweet. How do I view this corpus in Orange3? I don't understand why I can't simply see this corpus. As you can see in the image below, the channel is red and there's nothing in Corpus View, while Data Table shows some data

Topic: orange3 text-mining

Category: Data Science

Attitude to text mining and preparing tokens, irrelevant words, low accuracy

heisenberg7584

2022年5月2日 10:01

For purpose of quite big project I am doing a text mining on some documents. My steps are quite common: All to lower case Tokenization Stop list and stop words Lemmatizaton Stemming Some other steps like removing symbols. Then I prepare bag of words, make DTF and classify to 3 classes with SVM and Naive Bayes. But the accuracy I get is not too high (50-60%). I think that may be because in array of words after all the steps …

Topic: classifier naive-bayes-classifier text-mining classification

Category: Data Science

Data Entry Automation with ML

praventz

2022年4月30日 07:01

I am working on a data entry task with approximately 6000 entries to go over. The source comes in the form of a string and can look something like this: Air Canada B737 FFS From this I can extract the following information: Company: Air Canada Model: B737 Technology: FFS For my initial plan of attack I iterated over the source strings using Regular Expression to extract as many keywords as possible, the problem is there are so many different Companies, …

Topic: text-mining machine-learning

Category: Data Science

How to cluster sentences based on company names from a post(s) containing several company names using similarity metric.

mbajpai

2022年4月24日 22:00

My corpus contains several posts having text for several companies i.e. each post contains information about several companies. I want to cluster the information based on few company names that I can specify. Clustering should be based on some similarity matrix such as euclidean or cosine similarity. Which algorithm to use based on company name that I can specify and which similarity method to use?

Topic: text-mining nlp python

Category: Data Science

Get row wise frequency count of words from list in text column pandas

shivanshu dhawan

2022年4月24日 21:02

I have a data frame with a Audio Transcript column from customer care phone conversation. I have created one list with words and sentences words = ["rain", "buy new house", "tornado"] What I need to do is create a column in the data frame which checks these words in the text column row by row and if it presents then update the column with word and it's frequency. For example first row text "I was going to buy new house …

Topic: python-3.x word-embeddings text-mining nlp

Category: Data Science

Is there an algorithm or NN to match two documents, basically not closely similar?

Yuriy P

2022年4月19日 06:07

Is there an algorithm or NN to match two documents? One is a claim description (e.g. a CV or product offer) and another is a requirements description (e.g. vacancy description or RFP). They are not similar, so basically it's not a docs similarity per se. What's it better embedding to use on document corps (Doc2vec, Word2vec or just TF-IDF? etc) and what kind of further NN architecture would work to basically find a matching scores vector/matrix as output on how …

Topic: deep-learning text-mining neural-network similarity machine-learning

Category: Data Science

Extract a numeric attribute from partially unstructured text for each word of a vocabulary

sound wave

2022年4月12日 10:02

Given a vocabulary v = {'sales', 'units', 'parts', 'operators', 'revenue'} and strings such as s1 = 'total of 1138 units, repaired 7710 parts, sales increased 588 (+34), decreasing of operator 413 (-14)' s2 = 'part 7710 (repaired), units are 1138, revenue 1212, operators variation is -14, salles increment +34 (588 total)' I have to associate each key of v with the corresponding number from s1 and s2 (for sales and operators I need the variations (numbers with a sign in …

Topic: text-mining neural-network

Category: Data Science

About