Error using corpus widget in Orange v 3.25.0 text mining and no import document option?

I am experimenting with Orange data mining tool. When I use the 'Corpus' widget from the text mining section it gives me the error: [![Corpus widget error][1]][1] I have tried many things, but am still unable to resolve this issue. [1]: https://i.stack.imgur.com/X4tKu.png Besides that, in text mining, it does not show import document option in orange v 3.25. I just want to know the options to import text files into the Orange.
Category: Data Science

Cloud-based visual tool to perform NLP on text corpora

I have some text corpora to share with non-programming clients (~50K documents, ~100M tokens) who would like to perform operations like regex searches, co-locations, Named-entity recognition, and word clusters. The tool AntConc is nice and can do some of these things, but comes with severe size limitations, and crashes on these corpora even on powerful machines. What cloud-based tools with a web interface would you recommend for this kind of task? Is there an open-source tool or a cloud service …
Topic: corpus nlp tools
Category: Data Science

Email corpus with image signatures

I am building a system to classify inline image attachments as part of the message body or signature. To train the system, I am looking for an email corpus with the raw MIME email messages from people from different companies (ie not all Enron employees), hopefully with a lot of image signatures. Do you know any email corpus like this?
Category: Data Science

Can I average the BERT embeddings of multiple instances of the same word to get one vector representation of the word?

In the project I'm working on right now I would like to get one embedding for every unique lemma in a corpus. Could I get this by averaging the embeddings of every instance of a lemma? For example, say that there were 500 tokens of the lemma "walk" - regardless of conjugation - could I then add/average/concatenate these 500 embeddings together to get one embedding accurately representing all of them? If this would work, which operation should I use on …
Category: Data Science

How to program conditional statements for this problem in r

Situation: I'm trying to program the following in r. Task: I am trying to select for words that appear as nouns in my dataset more than they do as adjectives, verbs, or adverbs etc. I have all these counts and below is an example of one instance of what I am trying to do. Imagine the information below is in a dataframe. I do not want to select for this lemma (ability), because it appears most times as a VERB; …
Topic: corpus r
Category: Data Science

Building a Model for Time Series Data in R (no forecasting)

Problem: I had planned to use a linear regression model to model time series data in retrospect (i.e., no forecasting). However, I am wondering if this is the best option having come across a few posts - (https://www.quora.com/Is-regression-analysis-legitimate-for-time-series-data) - that regression analysis might not be legitimate for time series data. Preliminary plotting also shows a concave shape in the data, but this would still be a regression model, I think. Question: Would anyone have any good sources to link to …
Category: Data Science

Jargon extraction in a text

I have a big text corpus (documentation from a company) and I want to extract the terms that are specific to that area/business. I can do that using TF or TF-IDF and guide myself by the frequency of the words, which isn't always reliable. I want to also do that for single, shorter sentences, but I think this is already more difficult. I was also thinking of using Wikipedia articles to train a model and then apply it to my …
Topic: corpus nlp python
Category: Data Science

How to find the probability of a word to belong to a dataset of text

I have two text datasets, one with people that have a certain medical condition and another of random patients. I want to figure out what words are more likely to show up in the dataset with that medical condition. My original thought was to use a chi squared test, but it seems like I can't run the test for each word since the tokens are "categories" and a single word is a "value" of the "categorical" variable. For example if …
Category: Data Science

Simple Question: How to Load corpora (.txt files) in R

Context: I have 47 .txt files which are all technically small corpora. They have been lowercased and full-stops have been removed. Question: Would anyone know how to or be able to point me to a resource that explains how to load these in R? I wish to lemmatise them and extract collocations (nearest neighbours) within a term with a window size of 3 + their counts per corpus + proportion (count of collocate/ total tokens in corpus). Any resources/ tips …
Category: Data Science

How to build a speech corpus for continuous speech?

I am developing a speech corpus from audiobooks. I was able to collect recordings with their transcriptions but without timing information. Therefore I am looking for recent techniques to automatically build segments adapted to the development of ASR systems for large vocabulary continuous speech. Thanks in an advance.
Category: Data Science

Suggestions for improvement? Time series of variation in relative frequency of emotion-related words in academic psychology over time

First time plotting and interpreting time series data and I have used a line plot for ease of use. I am aware this is incredibly basic, but any input/ recommendations would be much appreciated (e.g., is anything unclear?). My main concern is whether I have adequately displayed the data and whether I can do anything useful to improve (e.g., moving average)? Additionally, whether I have interpreted this time series data appropriately: "The relative frequency of affect-related tokens (counts per 10,000 …
Category: Data Science

Does building a corpus make sense on a documentation project?

I have zero to experience in data science or machine learning. Because of this I am not able to determine if building a corpus does apply to the problem I am trying to solve. I am trying to build a reference site for cloud technologies such as AWS Google Cloud. I was able to build structured data and identify primary entities with in a single ecosystem using standard web scraping and sql.queries. But I wanted to have the ability to …
Category: Data Science

Build a corpus for machine translation

I want to train an LSTM with attention for translation between French and a "rare" language. I say rare because it is an african language with less digital content, and especially databases with seq to seq like format. I have found somewhere a dataset, but in terms of quality, both french and native language sentences where awfully wrong. When I used this dataset, of course my translations where damn funny ... So I decided to do some web scraping to …
Category: Data Science

Text classification with Word2Vec on a larger corpus

I am working on a small project and I would like to use the word2vec technique as a text representation method. I need to classify patents but I have only a few of them labelled and to increase the performance of my ML model, I would like to increase the corpus/vocabulary of my model by using a large amount of patents. The question is, once I have train my word embedding feature, how to use this larger corpus with my …
Category: Data Science

Tools/tutorials for compiling corpora for NLP experiments?

I have a couple of NLP ideas I want to try out (mostly for my own learning) - while I have the python/tensorflow background for running the actual training and prediction tasks, I don't have much experience in processing large amounts of text data and whatever pipelines are involved. Are there any tutorials on how to gather data and label it for a larg(ish) NLP experiment? For example: BERT was originally trained on all of English Wikipedia. How do you …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.