I am considering an idea of stitching together a slide deck based on text input, e.g. given: An all-hands presentation with business updates, project timelines, and financial report charts the output could be a deck with slides corresponding to Title, List, Calendar, Pie Chart, Conclusion. I have preexisting slides that are mostly categorized by the "form" ranging from very general like List to more specific like Decision Tree or Venn Diagram. Am I on the right track that this sounds …
I have 2 ddbb with around 60,000 samples each. Both have the same features (same column names) that represent particular things with text or categories (turned into numbers). Each sample in a ddbb is assumed to refer to a different particular thing. But there are some objects that are represented in both ddbb, yet with somewhat different values in the same-name column (like different open descriptions, or classified as another category). The aim is to train a machine learning model …
I'm trying to put together a script that classifies comments into either adequate or inadequate. I put a question up here earlier with all my code, but I think I've isolated the problem down into the setup of the model, so I deleted that one, and hopefully this is more streamlined and easy to follow. The example i'm trying to follow is the classic IMDB comment, where the comments are either positive or negative, but again in my instance, adequate …
I am trying to find a model or way to classify text which falls into a category and its a positive or negative feedback. For ex. we have three columns Review : Camera's not good battery backup is not very good. Ok ok product camera's not very good and battery backup is not very good. Rating : 2 Topic :['Camera (Neutral)', 'Battery (Neutral)'] My Whole Dataset is like above and Topic is not standard one , Topic value is based …
I'm looking for something similar to this https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py But instead of positive and negative examples, I have positive examples and a bunch of unlabeled data that will contain some positive examples but is mostly negative. I'm planning on using this in a pipeline to transform text data into a vector, then feeding it into a classifier using https://pulearn.github.io/pulearn/doc/pulearn/ The issue is I'm not sure the best way to build the preprocessing stage where I transform the raw text data into …
I have a dataset of different sentences from news articles which I need to classify by their sentiment. For that goal I'm planning to use a fine-tuned model which was fine-tuned on different datasets, for example various comments from forums, reviews, tweets. However, news articles are supposedly quite different from that dataset as they are usually more neutral. I understand that a correct way to approach this issue would be by training a model on my own labeled dataset, however …
I have created a dataset automatically and wanted to clarify my interpretation of the amount of noise using the confidence interval. I selected a random sample and manually annotated the sample and found that 98% of the labels were correct. Based on these values I then calculated the confidence interval at 99% which gave a lower bound of 0.9614 and upper bound of 0.9949. Does this mean that the noise in the overall dataset is between the lower and upper …
I would like to create word embeddings that take context into account, so the vector of the word Jaguar [animal] would be different from the word Jaguar [car brand]. As you know, word2vec only gives one representation for a given word, and I would like to take already pretrained embeddings and enrich them with context. So far I've tried a simple way with taking an average vector of the word and category word, for example like this. Now I would …
I want to build a classification model to match customers and products. I have a description of each product, and a description of each customer, and the label : customer *i* buy/did not buy product *j*. Each sample/row is a pair (customer, product), so Feature 1 is customer's description, Feature 2 is product's description, and the target variable y is: "y = 1 : customer buys product", "y = 0 otherwise". The goal is to predict for new arriving products …
I want to classify 500-character long text samples as to whether they look like natural language using a character-level RNN. I'm unsure as to the best way to feed the input to the RNN. Here are two approaches I've thought of: Provide the whole 500 characters (one per time step) to the RNN, and predict a binary class, $\{0,1\}$. Provide shorter overlapping segments (e.g. 10 characters) and predict the next (e.g. 11th) character. Convert this to classification by taking the …
I want to use transformer xl for text classification tasks. But I don't know the architect model for the text classification task. I use dense layers with activation softmax for logits output from the transformer xl model, but this doesn't seem right. when training I see accuracy is very low. Output of my model: My training step:
I have 200 *.txt unique files for each folder: Each file is a lawsuit initial text separated by legal areas (folders) of public advocacy. I would like to create training data to predict new lawsuits by their legal area. Last year, I have tried using PHP-ML, but it consumes too much memory, so I would like to migrate to Python. I started the code, loading each text file in a json-alike structure, but I don't know the next steps: import …
I have a set of 150 documents with their assigned binary class. I also have 1000 unlabeled documents. Each document is about the length of a journal paper. Each class has 15 associated keywords. I want to be able to predict the assigned class of the documents using this information. Does anyone have any ideas of how I could approach this problem?
I am very new to nlp. I am doing a text segmentation task and for evaluating my model I need to calculate Pk and Windiff scores. My question is what is the ideal value for window size (k) for Pk score because different window sizes give different results. I am using this function nltk.metrics.segmentation.pk. Thanks.
I was working on a text classification problem where I currently have around 40-45 different labels. The input is a text sentence with a keyword. For e.g. This phone is the most durable in the market is the input sentence and the out label is X and all the words in the output with label X will have durable as a keyword. What would be a good model to fit this? I tried basic SVM, Random Forest but to no …
I am trying to use the pycld2 package to detect multiple languages in text. This package provides Python bindings for the Compact Language Detect 2 (CLD2) This is the example I am testing out: import pycld2 as cld2 text = '''The universal connection with an additional advantage: Push-in connection. Terminate solid and stranded (Class B 7 strands or less), as well as ferruled conductors, by simply pushing them in – no tools required. La connessione universale con un ulteriore vantaggio: …
I'm trying to run Multinomial Bayes classificator on various balanced data sets and comparing 2 different vectorizers: TfidfVectorizer and CountVectorizer. I have 3 classes: NEG, NEU and POS. I have 10000 documents. NEG class has 2474, NEU 5894 and POS 1632. Out of that I have made 3 differently balanced data sets like this: text counts: NEU NEG POS Total number NEU balance dataset 5894 2474 1632 10000 NEG balance dataset 2474 2474 1632 6580 POS balance dataset 1632 1632 …
I want to classify the sentences in my dataset as declarative, interrogative, imperative and exclamative. Although It can be classified with respect to punctuation marks such as ?, ! and . but there are many cases and situations that these rules can fail. In NLP area, is there any model or solution that can be applied to reach the mentioned goal?
Given a collection of documents - each corresponding to some economic entity - I am looking to extract information and populate a table with predetermined headings. I have a small sample of this already done by humans and I was wondering if there's an efficient way to automatise it. Grateful for any suggestions.
I have trained an XGBClassifier to classify text issues to a rightful assignee (simple 50-way classification). The source from where I am fetching the data also provides a datetime object which gives us the timestamp at which the issue was created. Logically, the person who has recently worked on an issue (say 2 weeks ago) should be a better suggestion instead of (another) person who has worked on similar issue 2 years ago. That is, if there two examples from …