corpus

Error using corpus widget in Orange v 3.25.0 text mining and no import document option?

vipp.k0000

2022年5月7日 04:04

I am experimenting with Orange data mining tool. When I use the 'Corpus' widget from the text mining section it gives me the error: [![Corpus widget error][1]][1] I have tried many things, but am still unable to resolve this issue. [1]: https://i.stack.imgur.com/X4tKu.png Besides that, in text mining, it does not show import document option in orange v 3.25. I just want to know the options to import text files into the Orange.

Topic: corpus orange3 orange text-mining data-mining

Category: Data Science

Computer science corpus for training a language model

user

2022年3月16日 15:02

I am looking for a domain specific computer science corpus of at least 20M words (preferable >50M words), for the purpose of training a language model in it. Is there anything out-of-the box that I could use? *I tried to look for the sciBERT corpus, can not find how to access it. Thanks!

Topic: corpus text text-mining nlp data-mining

Category: Data Science

Cloud-based visual tool to perform NLP on text corpora

Strabonio

2021年11月10日 21:45

I have some text corpora to share with non-programming clients (~50K documents, ~100M tokens) who would like to perform operations like regex searches, co-locations, Named-entity recognition, and word clusters. The tool AntConc is nice and can do some of these things, but comes with severe size limitations, and crashes on these corpora even on powerful machines. What cloud-based tools with a web interface would you recommend for this kind of task? Is there an open-source tool or a cloud service …

Topic: corpus nlp tools

Category: Data Science

Email corpus with image signatures

Victor

2021年9月1日 17:15

I am building a system to classify inline image attachments as part of the message body or signature. To train the system, I am looking for an email corpus with the raw MIME email messages from people from different companies (ie not all Enron employees), hopefully with a lot of image signatures. Do you know any email corpus like this?

Topic: corpus classification machine-learning

Category: Data Science

Can I average the BERT embeddings of multiple instances of the same word to get one vector representation of the word?

Hantan G

2021年8月20日 11:45

In the project I'm working on right now I would like to get one embedding for every unique lemma in a corpus. Could I get this by averaging the embeddings of every instance of a lemma? For example, say that there were 500 tokens of the lemma "walk" - regardless of conjugation - could I then add/average/concatenate these 500 embeddings together to get one embedding accurately representing all of them? If this would work, which operation should I use on …

Topic: corpus bert word-embeddings

Category: Data Science

What is the difference between sparse and dense corpra?

user

2021年8月11日 08:14

I didn't got the meaning or the difference between sparse and dense corpra here in this sentence "the reason is that Skip-gram works better over sparse corpora like Twitter and NIPS, while CBOW works better over dense corpora "

Topic: corpus sparsity word2vec word-embeddings

Category: Data Science

How to program conditional statements for this problem in r

n.baes

2021年8月3日 14:42

Situation: I'm trying to program the following in r. Task: I am trying to select for words that appear as nouns in my dataset more than they do as adjectives, verbs, or adverbs etc. I have all these counts and below is an example of one instance of what I am trying to do. Imagine the information below is in a dataframe. I do not want to select for this lemma (ability), because it appears most times as a VERB; …

Topic: corpus r

Category: Data Science

Building a Model for Time Series Data in R (no forecasting)

n.baes

2021年7月29日 00:29

Problem: I had planned to use a linear regression model to model time series data in retrospect (i.e., no forecasting). However, I am wondering if this is the best option having come across a few posts - (https://www.quora.com/Is-regression-analysis-legitimate-for-time-series-data) - that regression analysis might not be legitimate for time series data. Preliminary plotting also shows a concave shape in the data, but this would still be a regression model, I think. Question: Would anyone have any good sources to link to …

Topic: corpus parameter-estimation time-series r

Category: Data Science

Jargon extraction in a text

n.mathfreak

2021年7月27日 07:49

I have a big text corpus (documentation from a company) and I want to extract the terms that are specific to that area/business. I can do that using TF or TF-IDF and guide myself by the frequency of the words, which isn't always reliable. I want to also do that for single, shorter sentences, but I think this is already more difficult. I was also thinking of using Wikipedia articles to train a model and then apply it to my …

Topic: corpus nlp python

Category: Data Science

How to find the probability of a word to belong to a dataset of text

Kevin

2021年7月16日 02:30

I have two text datasets, one with people that have a certain medical condition and another of random patients. I want to figure out what words are more likely to show up in the dataset with that medical condition. My original thought was to use a chi squared test, but it seems like I can't run the test for each word since the tokens are "categories" and a single word is a "value" of the "categorical" variable. For example if …

Topic: corpus text feature-selection

Category: Data Science

Simple Question: How to Load corpora (.txt files) in R

n.baes

2021年5月11日 07:55

Context: I have 47 .txt files which are all technically small corpora. They have been lowercased and full-stops have been removed. Question: Would anyone know how to or be able to point me to a resource that explains how to load these in R? I wish to lemmatise them and extract collocations (nearest neighbours) within a term with a window size of 3 + their counts per corpus + proportion (count of collocate/ total tokens in corpus). Any resources/ tips …

Topic: corpus text-mining r

Category: Data Science

Passing a word cloud choice to a corpus viewer

RonDespain

2021年4月2日 13:56

Can I pass a selected word (via a click) in a word cloud to the RedExp filter in a corpus viewer? Is: Desired: [email protected]'ll join up next...Thanks for any help

Topic: entity-linking corpus cloud

Category: Data Science

How to build a speech corpus for continuous speech?

Selma_KA

2021年2月1日 08:27

I am developing a speech corpus from audiobooks. I was able to collect recordings with their transcriptions but without timing information. Therefore I am looking for recent techniques to automatically build segments adapted to the development of ASR systems for large vocabulary continuous speech. Thanks in an advance.

Topic: corpus speech-to-text

Category: Data Science

Suggestions for improvement? Time series of variation in relative frequency of emotion-related words in academic psychology over time

n.baes

2021年1月17日 18:00

First time plotting and interpreting time series data and I have used a line plot for ease of use. I am aware this is incredibly basic, but any input/ recommendations would be much appreciated (e.g., is anything unclear?). My main concern is whether I have adequately displayed the data and whether I can do anything useful to improve (e.g., moving average)? Additionally, whether I have interpreted this time series data appropriately: "The relative frequency of affect-related tokens (counts per 10,000 …

Topic: corpus interpretation data time-series

Category: Data Science

Does building a corpus make sense on a documentation project?

DevZer0

2021年1月16日 00:14

I have zero to experience in data science or machine learning. Because of this I am not able to determine if building a corpus does apply to the problem I am trying to solve. I am trying to build a reference site for cloud technologies such as AWS Google Cloud. I was able to build structured data and identify primary entities with in a single ecosystem using standard web scraping and sql.queries. But I wanted to have the ability to …

Topic: corpus nlp machine-learning

Category: Data Science

Build a corpus for machine translation

Meomeoowww

2020年12月29日 23:25

I want to train an LSTM with attention for translation between French and a "rare" language. I say rare because it is an african language with less digital content, and especially databases with seq to seq like format. I have found somewhere a dataset, but in terms of quality, both french and native language sentences where awfully wrong. When I used this dataset, of course my translations where damn funny ... So I decided to do some web scraping to …

Topic: corpus sequence-to-sequence lstm nlp

Category: Data Science

Text classification with Word2Vec on a larger corpus

Rob C

2020年7月15日 17:02

I am working on a small project and I would like to use the word2vec technique as a text representation method. I need to classify patents but I have only a few of them labelled and to increase the performance of my ML model, I would like to increase the corpus/vocabulary of my model by using a large amount of patents. The question is, once I have train my word embedding feature, how to use this larger corpus with my …

Topic: corpus text-classification word2vec nlp machine-learning

Category: Data Science

Tools/tutorials for compiling corpora for NLP experiments?

Alex S Kinman

2020年1月21日 02:00

I have a couple of NLP ideas I want to try out (mostly for my own learning) - while I have the python/tensorflow background for running the actual training and prediction tasks, I don't have much experience in processing large amounts of text data and whatever pipelines are involved. Are there any tutorials on how to gather data and label it for a larg(ish) NLP experiment? For example: BERT was originally trained on all of English Wikipedia. How do you …

Topic: corpus pipelines nlp tools

Category: Data Science

Wikipedia corpus for NLP - Cleaned sentences

Nathan B

2019年10月21日 08:17

I can see many wikipedia dumps out there. I am looking for a wikipedia-made corpus, in which every line is one sentence, without any wikipedia meta tags.

Topic: wikipedia corpus nlp

Category: Data Science

About