I am experimenting with Orange data mining tool. When I use the 'Corpus' widget from the text mining section it gives me the error: [![Corpus widget error][1]][1] I have tried many things, but am still unable to resolve this issue. [1]: https://i.stack.imgur.com/X4tKu.png Besides that, in text mining, it does not show import document option in orange v 3.25. I just want to know the options to import text files into the Orange.
I am looking for a domain specific computer science corpus of at least 20M words (preferable >50M words), for the purpose of training a language model in it. Is there anything out-of-the box that I could use? *I tried to look for the sciBERT corpus, can not find how to access it. Thanks!
I have some text corpora to share with non-programming clients (~50K documents, ~100M tokens) who would like to perform operations like regex searches, co-locations, Named-entity recognition, and word clusters. The tool AntConc is nice and can do some of these things, but comes with severe size limitations, and crashes on these corpora even on powerful machines. What cloud-based tools with a web interface would you recommend for this kind of task? Is there an open-source tool or a cloud service …
I am building a system to classify inline image attachments as part of the message body or signature. To train the system, I am looking for an email corpus with the raw MIME email messages from people from different companies (ie not all Enron employees), hopefully with a lot of image signatures. Do you know any email corpus like this?
In the project I'm working on right now I would like to get one embedding for every unique lemma in a corpus. Could I get this by averaging the embeddings of every instance of a lemma? For example, say that there were 500 tokens of the lemma "walk" - regardless of conjugation - could I then add/average/concatenate these 500 embeddings together to get one embedding accurately representing all of them? If this would work, which operation should I use on …
I didn't got the meaning or the difference between sparse and dense corpra here in this sentence "the reason is that Skip-gram works better over sparse corpora like Twitter and NIPS, while CBOW works better over dense corpora "
Situation: I'm trying to program the following in r. Task: I am trying to select for words that appear as nouns in my dataset more than they do as adjectives, verbs, or adverbs etc. I have all these counts and below is an example of one instance of what I am trying to do. Imagine the information below is in a dataframe. I do not want to select for this lemma (ability), because it appears most times as a VERB; …
Problem: I had planned to use a linear regression model to model time series data in retrospect (i.e., no forecasting). However, I am wondering if this is the best option having come across a few posts - (https://www.quora.com/Is-regression-analysis-legitimate-for-time-series-data) - that regression analysis might not be legitimate for time series data. Preliminary plotting also shows a concave shape in the data, but this would still be a regression model, I think. Question: Would anyone have any good sources to link to …
I have a big text corpus (documentation from a company) and I want to extract the terms that are specific to that area/business. I can do that using TF or TF-IDF and guide myself by the frequency of the words, which isn't always reliable. I want to also do that for single, shorter sentences, but I think this is already more difficult. I was also thinking of using Wikipedia articles to train a model and then apply it to my …
I have two text datasets, one with people that have a certain medical condition and another of random patients. I want to figure out what words are more likely to show up in the dataset with that medical condition. My original thought was to use a chi squared test, but it seems like I can't run the test for each word since the tokens are "categories" and a single word is a "value" of the "categorical" variable. For example if …
Context: I have 47 .txt files which are all technically small corpora. They have been lowercased and full-stops have been removed. Question: Would anyone know how to or be able to point me to a resource that explains how to load these in R? I wish to lemmatise them and extract collocations (nearest neighbours) within a term with a window size of 3 + their counts per corpus + proportion (count of collocate/ total tokens in corpus). Any resources/ tips …
Can I pass a selected word (via a click) in a word cloud to the RedExp filter in a corpus viewer? Is: Desired: [email protected]'ll join up next...Thanks for any help
I am developing a speech corpus from audiobooks. I was able to collect recordings with their transcriptions but without timing information. Therefore I am looking for recent techniques to automatically build segments adapted to the development of ASR systems for large vocabulary continuous speech. Thanks in an advance.
First time plotting and interpreting time series data and I have used a line plot for ease of use. I am aware this is incredibly basic, but any input/ recommendations would be much appreciated (e.g., is anything unclear?). My main concern is whether I have adequately displayed the data and whether I can do anything useful to improve (e.g., moving average)? Additionally, whether I have interpreted this time series data appropriately: "The relative frequency of affect-related tokens (counts per 10,000 …
I have zero to experience in data science or machine learning. Because of this I am not able to determine if building a corpus does apply to the problem I am trying to solve. I am trying to build a reference site for cloud technologies such as AWS Google Cloud. I was able to build structured data and identify primary entities with in a single ecosystem using standard web scraping and sql.queries. But I wanted to have the ability to …
I want to train an LSTM with attention for translation between French and a "rare" language. I say rare because it is an african language with less digital content, and especially databases with seq to seq like format. I have found somewhere a dataset, but in terms of quality, both french and native language sentences where awfully wrong. When I used this dataset, of course my translations where damn funny ... So I decided to do some web scraping to …
I am working on a small project and I would like to use the word2vec technique as a text representation method. I need to classify patents but I have only a few of them labelled and to increase the performance of my ML model, I would like to increase the corpus/vocabulary of my model by using a large amount of patents. The question is, once I have train my word embedding feature, how to use this larger corpus with my …
I have a couple of NLP ideas I want to try out (mostly for my own learning) - while I have the python/tensorflow background for running the actual training and prediction tasks, I don't have much experience in processing large amounts of text data and whatever pipelines are involved. Are there any tutorials on how to gather data and label it for a larg(ish) NLP experiment? For example: BERT was originally trained on all of English Wikipedia. How do you …
I can see many wikipedia dumps out there. I am looking for a wikipedia-made corpus, in which every line is one sentence, without any wikipedia meta tags.