text - Geeks Mental

Cluster words into groups of similar meaning (synonyms)

Ben

2022年5月31日 22:08

How can words be clustered into groups of similar meaning (synonyms)? I started with pre-trained word embeddings (e.g., Google News), which is great, but not perfect - a limitation arises because the word embeddings are based on surrounding words. This introduces challenging results. For example: polar meanings: word embeddings might find opposites to be similar. Even though these words mean the opposite semantically, they can quite readily be interchanged given the same preceding and following words. For example, "terrible" and …

Topic: semantic-similarity text word-embeddings nlp clustering

Category: Data Science

How to match a word from column and compare with other column in pandas dataframe

anagha s

2022年5月31日 14:13

I have the below dataframe Text Keywords Type It’s a roll-on tube roll-on ball It is barrel barrel barr An unknown shape others it’s a assembly assembly assembly it’s a sealing assembly assembly factory its a roll-on double roll-on factory I have first found out the keywords, and based on the keyword and its corresponding type, it should true or false For example, when the keyword is roll-on , the type should be "ball" or "others" when the keyword is …

Topic: dataframe text pandas python

Category: Data Science

Can I get un-normalized vectors from the TF USE model?

kevin_was_here

2022年5月25日 13:05

I'm using this Universal Sentence Encoder (USE) model to get embeddings of a set of texts, each text corresponding to a newspaper article. In order to build a Recommender System, I generate user embeddings by averaging the embeddings of items a user has read, and then I look for other texts that are cosine-similar to this user (basically, the method returns a set of items that are similar to this user embedding). Now, the problem is that the mentioned model …

Topic: embeddings tensorflow text word-embeddings recommender-system

Category: Data Science

How to extract and classify data from a column in excel?

Arjun Arora

2022年5月23日 03:07

I have a column in an Excel sheet that contains a lot of data separated by || delimiters. The data can be classified to some classes like Entity, IFSC codes, transaction reference id, etc. A single cell looks like this: EFT INCOMING||0141201||NHFI0141201||UTR||SBIN118121948660 M S||some-name ||some-purpose||TRN REF NO:a1b2c3d4e5 Not every cell has the same number of classes or even the same type of classes. Another example: COMM/CHARGES/FEES||CHECK/REF.6546644473||BILPAY CCTY BEARING C||00.00||00012||18031358||BLPY||TRN REF NO:a1b2c3d4e5 I tried extracting this information using regular expressions and …

Topic: text preprocessing named-entity-recognition classification python

Category: Data Science

How to remove 'wordpress...' text from page titles in tabs

730wavy

2022年5月21日 09:32

I am working on a site and sometimes I run into an error when logging out and on the site tab it says 'Wordpress Failure Notice'. I am trying to remove all instances of wordpress so users dont know Im using it, but I can not figure out how to remove the text from the tab. I dont have no code to try and show because Im not even sure where to start. The text shows up on the wp-login.php …

Topic: notices logout text login Wordpress

Category: Web

Advantages of CNN vs. LSTM for sequence data like text or log-files

moooo112

2022年5月20日 02:07

When do you tend to use CNN rather than LSTM (or the other way round) in classification or generation tasks of sequential data like text or log-data? What are the reasons for the decision and what does it depend on? Are there any papers or statistics that confirm this? I'm thinking of data like Linux log entries or short sentence of length of less than 20 words/tokens. Personally i would almost always use LSTM but I'm curious if CNN wouldn't …

Topic: cnn lstm text sequence deep-learning

Category: Data Science

What is the logic/algorithm behind 'did you mean' suggestion by search engines, command suggestion in command prompt like git?

jarvis

2022年5月19日 14:47

For eg. https://stackoverflow.com/questions/307291/how-does-the-google-did-you-mean-algorithm-work this is the logic behind google's did you mean algorithm - used for spell correction suggestion. What is the algorithm used in case of other search algorithm for spell correction/ to find similar text - in case of a music/OTT search app, eg. amazon music - Similarly - what is the logic used - in case of git commands - How do one usually backtrack the algorithm behind an application from usage? Any general ideas will also …

Topic: text nlp similarity search

Category: Data Science

How to implement hierarchical labeling classification?

chacid

2022年5月13日 21:03

I am currently working on the task of eCommerce product name classification, so I have categories and subcategories in product data. I noticed that using subcategories as labels delivers worse results (84% acc) than categories (94% acc). But subcategories are more precise as labels, what's important for the whole task. And then I got an idea to first do category classification and then based on the results continue with subcategories within the predicted category. The problem here is that I …

Topic: keras text neural-network classification nlp

Category: Data Science

Clustering Tweet Data using DBSCAN Algorithm

Nilani Algiriyage

2022年4月29日 20:22

I am doing a tweet clustering using DBSCAN algorithm. I use all the preprocessing steps and convert sentences to vector format before applying the algorithm. However, It always puts a lot of tweets in to the '0' class. The following is the plot showing eps with number of clusters. The following are the parameters that I pass. dbscan = DBSCAN(eps=0.15, min_samples=2, metric='cosine').fit(x) The following are the resulting clusters. label -1 1221 0 1349 1 2 2 2 3 4 ... …

Topic: python-3.x text dbscan scikit-learn clustering

Category: Data Science

Needed: Java library to calculate text readability/complexity

lordy

2022年4月29日 14:07

In principle the same as this but for Java (and ideally for multiple languages) (e.g. flesch reading ease, smog index, flesch kincaid grade, coleman liau index, automated readability index, dale chall readability score, linsear write formula, gunning fog etc). I guess there must be plenty of libs but I just cant find them ...

Topic: text java nlp

Category: Data Science

What methods to create singular content classification from inconsistent inbound info?

Robert Andrews

2022年4月28日 13:00

I am attempting to aggregate professional profile info from multiple sources, imposing a consistent taxonomy. Specifically, the current problem is how to impose a preferred taxonomy on profiles with inconsistent or absent in-bound taxonomy terms. Primary source of profile info is biography pages on people's employer websites. Some of those sites choose to state employees' multiple specialist topics, some make only narrative biographies available, some both. I have collected all available info, using Python's Scrapy, in to CSV files - …

Topic: text classification

Category: Data Science

Change the "Register" headline in Woocommerce

Nostradamyys

2022年4月22日 15:12

I want to change the Register text on the "My account" page. But I cant figure out how to do it.

Topic: woocommerce-offtopic text Wordpress

Category: Web

Data transformations in hierarchical classification

matentzn

2022年4月11日 19:00

I am building a hierarchical text classifier using the Local Classifier Per Parent Node (LCPN) approach with the 'siblings' policy as described in the A survey of hierarchical classification across different application domains: E.g. if we have the classes 1.1, 1.2, 2.1, 2.2, 2.3 then in the first level we use all the training set to train a classifier to distinguish between class 1 (1.1,1.2) and 2 (2.1,2.2,2.3), at the second level we use two multiclass classifier the first one …

Topic: text multiclass-classification classification

Category: Data Science

How to utilize dictionary data set for text classification?

Ananthakrishnan M A

2022年4月7日 22:05

I have a dataset similar to newsgroup20 for classification. With the training dataset, I have a dictionary data set that explains some jargons in the training dataset. These both are different data set, So how will i utilize the dictionary dataset for improving my model accuracy?

Topic: text word2vec word-embeddings classification nlp

Category: Data Science

How to remove irrelevant text data from a large dataset

zxcisnoias

2022年4月5日 18:04

I am working on a ML project where data were coming from a social media, and the topic about the data should be depression under Covid-19. However, when I read some of the data retrieved, I noticed that even though the text (around 1-5 %) mentioned some covid-related keywords, the context of those texts are not actually about the pandemic, they are telling a life story (from 5-year-old to 27-year-old) instead of how covid affects their lives. The data I …

Topic: text nlp data-cleaning machine-learning

Category: Data Science

How to convert a string variable containing comments to a variable with integers to be used in neural networks?

Mustafa Kamal

2022年4月3日 19:42

I am working with data contains comment variable like imdb data. imdb <- dataset_imdb(num_words = 500) c(c(train_x, train_y), c(test_x, test_y)) %<-% imdb train_x[[3]] These are reviews on movies so they contained actual English texts. However, train_x[[3]] gives a vector of integers. I don't have enough experience with strings in R and would like to convert a vector of comments data to a vector of integers based on the overall frequency in that vector. I cannot share a sample of my …

Topic: lstm text text-mining r

Category: Data Science

NLP text representation techniques that preserve word order in sentence?

Hing

2022年4月3日 11:44

I see people are talking mostly about bag-of-words, td-idf and word embeddings. But these are at word levels. BoW and tf-idf fail to represent word orders, and word embeddings are not meant to represent any order at all. What's the best practice/most popular way of representing word order for texts of varying lengths? Simply concatenating word embeddings of individual words into long vector appearly not working for texts of varying lengths... Or there exists no method of doing that except …

Topic: text feature-engineering text-mining feature-extraction nlp

Category: Data Science

Extracting structure and content from invoices

Don Draper

2022年3月31日 14:01

Lately, I have been largely inspired by this https://rossum.ai/, which is able to extract text from invoice documents. Do you have any ideas on how this could be implemented? It's clear that they did a lot of research to reach this performance level, but in my case I am interested in the overall approach to such problems. If I understand correctly, the first part of the pipeline is to extract different blocks from the document. In that case, is object …

Topic: object-detection ocr text

Category: Data Science

Clustering mixed data types - numeric, categorical, arrays, and text

Malki

2022年3月30日 08:06

I have a dataset with 4 types of data columns: numeric categorical tags text id 1 51585 27 [A, B, C, ...] "Some text bla bla bla" 2 53596 27 [B, D, E] "Other text..." 3 1176345 27 [D, A, F, ...] "..." 4 168 24 NaN "..." 5 88564 22 NaN "..." numeric - continuous numeric values. categorical - discrete categories, either numbers or strings (the type doesn't really matter because I can convert it to whatever works) tags …

Topic: text nlp categorical-data k-means clustering

Category: Data Science

How to down\up sample text?

JamseGoldman

2022年3月27日 11:02

I have data set of 5566 samples - one column is the text of the recipe description and the other is what tax class is it. I wish to make a classifier that would classify receipts using ML only. I have a huge imbalance in the data: What is a good method to do when dealing with this kind of data? How to downsample or upsample? from what I understood SMOT will not work.

Topic: text-classification text

Category: Data Science

About