In 'Automatic construction of lexicons, taxonomies, ontologies, and other knowledge structures' they have mentioned; There are two slightly different classes of measure: lexical cohesion (sometimes called ‘unithood’ or ‘phraseness’), which quantifies the expectation of co-occurrence of words in a phrase (e.g., back-of-the-book index is significantly more cohesive than term name); and semantic informativeness (sometimes called ‘termhood’), which highlights phrases that are representative of a given document or domain. However, the review does not include the ways to calculate/derive these measures. …
I have a use case where I have text data entered by an approver while approving of some loan. I have to make some inferences as to what could be the reasons for approval using NLP. How should I go about it? It's a Non english language. Can Clustering of text help?? Is it possible to cluster TEXT OF non English language using python libraries.
For something we are working on, we were looking for a simple way to compare from review/feedback data against a question (for which there are multiple responses from multiple people), the following: What are the common things (things defined as phrases/sentences) they are saying (Some way to quantify the commonality too if possible). The point is to identify what seems to be areas of agreement about their review What are things that are not common (basically...what are those on-off sentences/phrases …
I'm given a large amount of documents upon which I should perform various kinds of analysis. Since the documents are to be used as a foundation of a final product, I thought about building a graph out of this text corpus, with each document corresponding to a node. One way to build a graph would be to use models such as USE to first find text embeddings, and then form a link between two nodes (texts) whose similarity is beyond …
I'm generating bigrams with from gensim.models.phrases, which I'll use downstream with TF-IDF and/or gensim.LDA from gensim.models.phrases import Phrases, Phraser # 7k documents, ~500-1k tokens each. Already ran cleanup, stop_words, lemmatization, etc docs = get_docs() phrases = Phrases(docs) bigram = Phraser(phrases) docs = [bigram[d] for d in docs] Phrases has min_count=5, threshold=10. I don't quite understand how they interact, they seem related? Anyway, I see threshold having values in different tutorials ranging 1->1000, described as important in determining the number of …
I'm using a dataset contains about 1.5M document. Each document comes with some keywords describing the topics of this document(Thus multi-labelled). Each document belongs to some authors(not just one author for a document). I wanted to find out the topics interested by each author by looking at documents they write. I'm currently looking an LDA variation (labeled-LDA proposed by D Ramaga: https://www.aclweb.org/anthology/D/D09/D09-1026.pdf .). I'm using all the documents in my dataset to train a model and using the model to …
I'm working on a large corpus of french daily newspapers from the 19th century that have been digitized and where the data are in the form of raw OCR text files (one text file per day). In terms of size, one year of issues is around 350 000 words long. What I'm trying to achieve is to detect the different articles that form a newspaper issue. Knowing that an article can be two or thee lines long or very much …
So in my head, I have an idea about what this architecture should look like, or at least behave, but I am having trouble implementing it. So let me describe the problem, and if anyone has an idea on how to actually implement it let me know. Or if I am over-thinking a solution. I am trying to classify accounts into one of two groups, good and bad. I have multiple text documents per account. What I want to do …
I have a data of a bag of words in a document. The data has 3 columns: {document number, word number, count of the word in the number}. I am supposed to generate frequent item-sets of a particular size. I thought that I would make list of all words that appear in a document, create a table of this list, and then generate frequent item-sets using Mlxtend or Orange . However, this approach does not seem to be efficient.
I have set of bibliometrics data (references). I want to extract the author names, title and the name of the conference/journal from it. Since the referencing style used by different papers vary, I am interested in knowing if there are any per-existing tools to do it? I am happy to provide examples if needed :)
Hei! I want to train a model, that predicts the sentiment of news headlines. I've got multiple unordered news headlines per day, but one sentiment score. What is a convenient solution to overcome the not 1:1 issue? I could: Concatenate all headlines to one string, but that feels a bit wrong, as an LSTM or CNN will use cross-sentence word relations, that don't exist. Predict one score per headline (1:1), and take the average in the application. But that might …
I am experimenting with Orange data mining tool. When I use the 'Corpus' widget from the text mining section it gives me the error: [![Corpus widget error][1]][1] I have tried many things, but am still unable to resolve this issue. [1]: https://i.stack.imgur.com/X4tKu.png Besides that, in text mining, it does not show import document option in orange v 3.25. I just want to know the options to import text files into the Orange.
I am quite familiar with Association Rule mining but I need to use it to associate ACROSS two market baskets instead of finding support WITHIN a market basket. Imagine customers come to a Store A and buy a certain number of products. The same customers go to Store B and buy another set of products. I want to associate between the two Stores and not within the Store. So I want to make "A --> B" statements like "Customers that …
I have twitter text as an Excel file: every line is one one tweet. How do I view this corpus in Orange3? I don't understand why I can't simply see this corpus. As you can see in the image below, the channel is red and there's nothing in Corpus View, while Data Table shows some data
For purpose of quite big project I am doing a text mining on some documents. My steps are quite common: All to lower case Tokenization Stop list and stop words Lemmatizaton Stemming Some other steps like removing symbols. Then I prepare bag of words, make DTF and classify to 3 classes with SVM and Naive Bayes. But the accuracy I get is not too high (50-60%). I think that may be because in array of words after all the steps …
I am working on a data entry task with approximately 6000 entries to go over. The source comes in the form of a string and can look something like this: Air Canada B737 FFS From this I can extract the following information: Company: Air Canada Model: B737 Technology: FFS For my initial plan of attack I iterated over the source strings using Regular Expression to extract as many keywords as possible, the problem is there are so many different Companies, …
My corpus contains several posts having text for several companies i.e. each post contains information about several companies. I want to cluster the information based on few company names that I can specify. Clustering should be based on some similarity matrix such as euclidean or cosine similarity. Which algorithm to use based on company name that I can specify and which similarity method to use?
I have a data frame with a Audio Transcript column from customer care phone conversation. I have created one list with words and sentences words = ["rain", "buy new house", "tornado"] What I need to do is create a column in the data frame which checks these words in the text column row by row and if it presents then update the column with word and it's frequency. For example first row text "I was going to buy new house …
Is there an algorithm or NN to match two documents? One is a claim description (e.g. a CV or product offer) and another is a requirements description (e.g. vacancy description or RFP). They are not similar, so basically it's not a docs similarity per se. What's it better embedding to use on document corps (Doc2vec, Word2vec or just TF-IDF? etc) and what kind of further NN architecture would work to basically find a matching scores vector/matrix as output on how …
Given a vocabulary v = {'sales', 'units', 'parts', 'operators', 'revenue'} and strings such as s1 = 'total of 1138 units, repaired 7710 parts, sales increased 588 (+34), decreasing of operator 413 (-14)' s2 = 'part 7710 (repaired), units are 1138, revenue 1212, operators variation is -14, salles increment +34 (588 total)' I have to associate each key of v with the corresponding number from s1 and s2 (for sales and operators I need the variations (numbers with a sign in …