Text2Slide multiclass classification

I am considering an idea of stitching together a slide deck based on text input, e.g. given: An all-hands presentation with business updates, project timelines, and financial report charts the output could be a deck with slides corresponding to Title, List, Calendar, Pie Chart, Conclusion. I have preexisting slides that are mostly categorized by the "form" ranging from very general like List to more specific like Decision Tree or Venn Diagram. Am I on the right track that this sounds …
Category: Data Science

How to train a model to predict if 2 samples refer to the same thing?

I have 2 ddbb with around 60,000 samples each. Both have the same features (same column names) that represent particular things with text or categories (turned into numbers). Each sample in a ddbb is assumed to refer to a different particular thing. But there are some objects that are represented in both ddbb, yet with somewhat different values in the same-name column (like different open descriptions, or classified as another category). The aim is to train a machine learning model …
Category: Data Science

TF Keras Text Processing - Classification Model

I'm trying to put together a script that classifies comments into either adequate or inadequate. I put a question up here earlier with all my code, but I think I've isolated the problem down into the setup of the model, so I deleted that one, and hopefully this is more streamlined and easy to follow. The example i'm trying to follow is the classic IMDB comment, where the comments are either positive or negative, but again in my instance, adequate …
Category: Data Science

Is there anyway to classify the category on give amazon reviews using python

I am trying to find a model or way to classify text which falls into a category and its a positive or negative feedback. For ex. we have three columns Review : Camera's not good battery backup is not very good. Ok ok product camera's not very good and battery backup is not very good. Rating : 2 Topic :['Camera (Neutral)', 'Battery (Neutral)'] My Whole Dataset is like above and Topic is not standard one , Topic value is based …
Category: Data Science

How to use scikit-learn to extract features from text when I only have positive and unlabeled data?

I'm looking for something similar to this https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py But instead of positive and negative examples, I have positive examples and a bunch of unlabeled data that will contain some positive examples but is mostly negative. I'm planning on using this in a pipeline to transform text data into a vector, then feeding it into a classifier using https://pulearn.github.io/pulearn/doc/pulearn/ The issue is I'm not sure the best way to build the preprocessing stage where I transform the raw text data into …
Category: Data Science

Using a fine-tuned model for a different dataset

I have a dataset of different sentences from news articles which I need to classify by their sentiment. For that goal I'm planning to use a fine-tuned model which was fine-tuned on different datasets, for example various comments from forums, reviews, tweets. However, news articles are supposedly quite different from that dataset as they are usually more neutral. I understand that a correct way to approach this issue would be by training a model on my own labeled dataset, however …
Category: Data Science

Interpreting confidence interval results for datasets

I have created a dataset automatically and wanted to clarify my interpretation of the amount of noise using the confidence interval. I selected a random sample and manually annotated the sample and found that 98% of the labels were correct. Based on these values I then calculated the confidence interval at 99% which gave a lower bound of 0.9614 and upper bound of 0.9949. Does this mean that the noise in the overall dataset is between the lower and upper …
Category: Data Science

Contextual word embeddings from pretrained word2vec vectors

I would like to create word embeddings that take context into account, so the vector of the word Jaguar [animal] would be different from the word Jaguar [car brand]. As you know, word2vec only gives one representation for a given word, and I would like to take already pretrained embeddings and enrich them with context. So far I've tried a simple way with taking an average vector of the word and category word, for example like this. Now I would …
Category: Data Science

Classification using texts as features

I want to build a classification model to match customers and products. I have a description of each product, and a description of each customer, and the label : customer *i* buy/did not buy product *j*. Each sample/row is a pair (customer, product), so Feature 1 is customer's description, Feature 2 is product's description, and the target variable y is: "y = 1 : customer buys product", "y = 0 otherwise". The goal is to predict for new arriving products …
Category: Data Science

Optimal input setup for character-level text classification RNN

I want to classify 500-character long text samples as to whether they look like natural language using a character-level RNN. I'm unsure as to the best way to feed the input to the RNN. Here are two approaches I've thought of: Provide the whole 500 characters (one per time step) to the RNN, and predict a binary class, $\{0,1\}$. Provide shorter overlapping segments (e.g. 10 characters) and predict the next (e.g. 11th) character. Convert this to classification by taking the …
Category: Data Science

Can i use Transformer-XL for text classification task?

I want to use transformer xl for text classification tasks. But I don't know the architect model for the text classification task. I use dense layers with activation softmax for logits output from the transformer xl model, but this doesn't seem right. when training I see accuracy is very low. Output of my model: My training step:
Category: Data Science

How to use text classification where the training source are txt files in categorized folders?

I have 200 *.txt unique files for each folder: Each file is a lawsuit initial text separated by legal areas (folders) of public advocacy. I would like to create training data to predict new lawsuits by their legal area. Last year, I have tried using PHP-ML, but it consumes too much memory, so I would like to migrate to Python. I started the code, loading each text file in a json-alike structure, but I don't know the next steps: import …
Category: Data Science

Binary document classification using keywords for a very small dataset

I have a set of 150 documents with their assigned binary class. I also have 1000 unlabeled documents. Each document is about the length of a journal paper. Each class has 15 associated keywords. I want to be able to predict the assigned class of the documents using this information. Does anyone have any ideas of how I could approach this problem?
Category: Data Science

Suggestions for a multi-class text classification model with a large number of classes?

I was working on a text classification problem where I currently have around 40-45 different labels. The input is a text sentence with a keyword. For e.g. This phone is the most durable in the market is the input sentence and the out label is X and all the words in the output with label X will have durable as a keyword. What would be a good model to fit this? I tried basic SVM, Random Forest but to no …
Category: Data Science

Language Detection using pycld2

I am trying to use the pycld2 package to detect multiple languages in text. This package provides Python bindings for the Compact Language Detect 2 (CLD2) This is the example I am testing out: import pycld2 as cld2 text = '''The universal connection with an additional advantage: Push-in connection. Terminate solid and stranded (Class B 7 strands or less), as well as ferruled conductors, by simply pushing them in – no tools required. La connessione universale con un ulteriore vantaggio: …
Category: Data Science

Naive Bayes TfidfVectorizer predicts everything to one class

I'm trying to run Multinomial Bayes classificator on various balanced data sets and comparing 2 different vectorizers: TfidfVectorizer and CountVectorizer. I have 3 classes: NEG, NEU and POS. I have 10000 documents. NEG class has 2474, NEU 5894 and POS 1632. Out of that I have made 3 differently balanced data sets like this: text counts: NEU NEG POS Total number NEU balance dataset 5894 2474 1632 10000 NEG balance dataset 2474 2474 1632 6580 POS balance dataset 1632 1632 …
Category: Data Science

sentence type classification

I want to classify the sentences in my dataset as declarative, interrogative, imperative and exclamative. Although It can be classified with respect to punctuation marks such as ?, ! and . but there are many cases and situations that these rules can fail. In NLP area, is there any model or solution that can be applied to reach the mentioned goal?
Category: Data Science

How to provide Intentional Bias towards recent examples in Text Classification?

I have trained an XGBClassifier to classify text issues to a rightful assignee (simple 50-way classification). The source from where I am fetching the data also provides a datetime object which gives us the timestamp at which the issue was created. Logically, the person who has recently worked on an issue (say 2 weeks ago) should be a better suggestion instead of (another) person who has worked on similar issue 2 years ago. That is, if there two examples from …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.