Cluster words into groups of similar meaning (synonyms)

How can words be clustered into groups of similar meaning (synonyms)? I started with pre-trained word embeddings (e.g., Google News), which is great, but not perfect - a limitation arises because the word embeddings are based on surrounding words. This introduces challenging results. For example: polar meanings: word embeddings might find opposites to be similar. Even though these words mean the opposite semantically, they can quite readily be interchanged given the same preceding and following words. For example, "terrible" and …
Category: Data Science

How to match a word from column and compare with other column in pandas dataframe

I have the below dataframe Text Keywords Type It’s a roll-on tube roll-on ball It is barrel barrel barr An unknown shape others it’s a assembly assembly assembly it’s a sealing assembly assembly factory its a roll-on double roll-on factory I have first found out the keywords, and based on the keyword and its corresponding type, it should true or false For example, when the keyword is roll-on , the type should be "ball" or "others" when the keyword is …
Category: Data Science

Can I get un-normalized vectors from the TF USE model?

I'm using this Universal Sentence Encoder (USE) model to get embeddings of a set of texts, each text corresponding to a newspaper article. In order to build a Recommender System, I generate user embeddings by averaging the embeddings of items a user has read, and then I look for other texts that are cosine-similar to this user (basically, the method returns a set of items that are similar to this user embedding). Now, the problem is that the mentioned model …
Category: Data Science

How to extract and classify data from a column in excel?

I have a column in an Excel sheet that contains a lot of data separated by || delimiters. The data can be classified to some classes like Entity, IFSC codes, transaction reference id, etc. A single cell looks like this: EFT INCOMING||0141201||NHFI0141201||UTR||SBIN118121948660 M S||some-name ||some-purpose||TRN REF NO:a1b2c3d4e5 Not every cell has the same number of classes or even the same type of classes. Another example: COMM/CHARGES/FEES||CHECK/REF.6546644473||BILPAY CCTY BEARING C||00.00||00012||18031358||BLPY||TRN REF NO:a1b2c3d4e5 I tried extracting this information using regular expressions and …
Category: Data Science

How to remove 'wordpress...' text from page titles in tabs

I am working on a site and sometimes I run into an error when logging out and on the site tab it says 'Wordpress Failure Notice'. I am trying to remove all instances of wordpress so users dont know Im using it, but I can not figure out how to remove the text from the tab. I dont have no code to try and show because Im not even sure where to start. The text shows up on the wp-login.php …
Category: Web

Advantages of CNN vs. LSTM for sequence data like text or log-files

When do you tend to use CNN rather than LSTM (or the other way round) in classification or generation tasks of sequential data like text or log-data? What are the reasons for the decision and what does it depend on? Are there any papers or statistics that confirm this? I'm thinking of data like Linux log entries or short sentence of length of less than 20 words/tokens. Personally i would almost always use LSTM but I'm curious if CNN wouldn't …
Category: Data Science

What is the logic/algorithm behind 'did you mean' suggestion by search engines, command suggestion in command prompt like git?

For eg. https://stackoverflow.com/questions/307291/how-does-the-google-did-you-mean-algorithm-work this is the logic behind google's did you mean algorithm - used for spell correction suggestion. What is the algorithm used in case of other search algorithm for spell correction/ to find similar text - in case of a music/OTT search app, eg. amazon music - Similarly - what is the logic used - in case of git commands - How do one usually backtrack the algorithm behind an application from usage? Any general ideas will also …
Category: Data Science

How to implement hierarchical labeling classification?

I am currently working on the task of eCommerce product name classification, so I have categories and subcategories in product data. I noticed that using subcategories as labels delivers worse results (84% acc) than categories (94% acc). But subcategories are more precise as labels, what's important for the whole task. And then I got an idea to first do category classification and then based on the results continue with subcategories within the predicted category. The problem here is that I …
Category: Data Science

Clustering Tweet Data using DBSCAN Algorithm

I am doing a tweet clustering using DBSCAN algorithm. I use all the preprocessing steps and convert sentences to vector format before applying the algorithm. However, It always puts a lot of tweets in to the '0' class. The following is the plot showing eps with number of clusters. The following are the parameters that I pass. dbscan = DBSCAN(eps=0.15, min_samples=2, metric='cosine').fit(x) The following are the resulting clusters. label -1 1221 0 1349 1 2 2 2 3 4 ... …
Category: Data Science

Needed: Java library to calculate text readability/complexity

In principle the same as this but for Java (and ideally for multiple languages) (e.g. flesch reading ease, smog index, flesch kincaid grade, coleman liau index, automated readability index, dale chall readability score, linsear write formula, gunning fog etc). I guess there must be plenty of libs but I just cant find them ...
Topic: text java nlp
Category: Data Science

What methods to create singular content classification from inconsistent inbound info?

I am attempting to aggregate professional profile info from multiple sources, imposing a consistent taxonomy. Specifically, the current problem is how to impose a preferred taxonomy on profiles with inconsistent or absent in-bound taxonomy terms. Primary source of profile info is biography pages on people's employer websites. Some of those sites choose to state employees' multiple specialist topics, some make only narrative biographies available, some both. I have collected all available info, using Python's Scrapy, in to CSV files - …
Category: Data Science

Data transformations in hierarchical classification

I am building a hierarchical text classifier using the Local Classifier Per Parent Node (LCPN) approach with the 'siblings' policy as described in the A survey of hierarchical classification across different application domains: E.g. if we have the classes 1.1, 1.2, 2.1, 2.2, 2.3 then in the first level we use all the training set to train a classifier to distinguish between class 1 (1.1,1.2) and 2 (2.1,2.2,2.3), at the second level we use two multiclass classifier the first one …
Category: Data Science

How to remove irrelevant text data from a large dataset

I am working on a ML project where data were coming from a social media, and the topic about the data should be depression under Covid-19. However, when I read some of the data retrieved, I noticed that even though the text (around 1-5 %) mentioned some covid-related keywords, the context of those texts are not actually about the pandemic, they are telling a life story (from 5-year-old to 27-year-old) instead of how covid affects their lives. The data I …
Category: Data Science

How to convert a string variable containing comments to a variable with integers to be used in neural networks?

I am working with data contains comment variable like imdb data. imdb <- dataset_imdb(num_words = 500) c(c(train_x, train_y), c(test_x, test_y)) %<-% imdb train_x[[3]] These are reviews on movies so they contained actual English texts. However, train_x[[3]] gives a vector of integers. I don't have enough experience with strings in R and would like to convert a vector of comments data to a vector of integers based on the overall frequency in that vector. I cannot share a sample of my …
Category: Data Science

NLP text representation techniques that preserve word order in sentence?

I see people are talking mostly about bag-of-words, td-idf and word embeddings. But these are at word levels. BoW and tf-idf fail to represent word orders, and word embeddings are not meant to represent any order at all. What's the best practice/most popular way of representing word order for texts of varying lengths? Simply concatenating word embeddings of individual words into long vector appearly not working for texts of varying lengths... Or there exists no method of doing that except …
Category: Data Science

Extracting structure and content from invoices

Lately, I have been largely inspired by this https://rossum.ai/, which is able to extract text from invoice documents. Do you have any ideas on how this could be implemented? It's clear that they did a lot of research to reach this performance level, but in my case I am interested in the overall approach to such problems. If I understand correctly, the first part of the pipeline is to extract different blocks from the document. In that case, is object …
Category: Data Science

Clustering mixed data types - numeric, categorical, arrays, and text

I have a dataset with 4 types of data columns: numeric categorical tags text id 1 51585 27 [A, B, C, ...] "Some text bla bla bla" 2 53596 27 [B, D, E] "Other text..." 3 1176345 27 [D, A, F, ...] "..." 4 168 24 NaN "..." 5 88564 22 NaN "..." numeric - continuous numeric values. categorical - discrete categories, either numbers or strings (the type doesn't really matter because I can convert it to whatever works) tags …
Category: Data Science

How to down\up sample text?

I have data set of 5566 samples - one column is the text of the recipe description and the other is what tax class is it. I wish to make a classifier that would classify receipts using ML only. I have a huge imbalance in the data: What is a good method to do when dealing with this kind of data? How to downsample or upsample? from what I understood SMOT will not work.
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.