Initializing weights that are a pointwise product of multiple variables

In two-layer perceptrons that slide across words of text, such as word2vec and fastText, hidden layer heights may be a product of two random variables such as positional embeddings and word embeddings (Mikolov et al. 2017, Section 2.2): $$v_c = \sum_{p\in P} d_p \odot u_{t+p}$$ However, it's unclear to me how to best initialize the two variables. When only word embeddings are used for the hidden layer weights, word2vec and fastText initialize them to $\mathcal{U}(-1 / \text{fan_out}; 1 / \text{fan_out})$. …
Category: Data Science

Keep word2vexc/fasttext model loaded in memory without using API

I have to use Fasttext model to return word embeddings. In test I was calling it through API. Since there are too many words to compute embeddings, API call seems to be expensive. I would like to use fasttext without API. For that I need to load the model once and keep it in memory for further calls. How can this be done without using API. Any help is highly appreciated.
Category: Data Science

FastText Model Explained

I was reading the FastText paper and I have a few questions about the model used for classification. Since I am not from NLP background, some I am unfamiliar with the jargon. In the figure, what exactly is are the $x_i$? I am not sure what $N$ ngram features mean. If my document has total $L$ words, then how can I represent the entire document using $N$ variables ($x_1$,..,$x_n$)? What exactly is $N$? $$-\frac{1}{N}\sum_{n=1}^Ny_n\log(f(BAx_n)) $$ If $y_n$ is the label, …
Category: Data Science

DBSCAN getting one huge cluster with noisy points

I'm currently trying to cluster customer service email answers (NLP). When I use DBSCAN with TF-IDF embeddings + Annoy indexes, I get good clusters. But, when I use DBSCAN with FastText embeddings + Annoy indexes, I get good clusters except the cluster with label zero (0) which seems to include lots of noisy points (that should be labeled with -1 instead of 0). Anyone with and idea of what this can be? I'm using an eps=0.5 for both cases.
Category: Data Science

Should I use Pad Sequence when using Word Vectors?

I have an unbalanced text data set. I want to use word vectors to embed words. When I use pad sequence? Before or after the word vector? I tried it, after the word vector I used pad sequence but my model accuracy was low. When I use the pad sequence before the word vector, how do I know the result? Because pad sequence and word vector give me numeric result. Meanwhile, word vector input is token and pad sequence input …
Category: Data Science

Training fasttext on your own corpus

I want to train fasttext on my own corpus. However, I have a small question before continuing. Do I need each sentences as a different item in corpus or can I have many sentences as one item? For example, I have this DataFrame: text | summary ------------------------------------------------------------------ this is sentence one this is sentence two continue | one two other other similar sentences some other | word word sent Basically, the column text is an article so it has many …
Category: Data Science

How can I use Ensemble learning of two models with different features as an input?

I have a fake news detection problem and it predicts the binary labels "1"&"0" by vectorizing the 'tweet' column, I use three different models for detection but I want to use the ensemble method to increase the accuracy but they use different vectorezer. I have 3 KNN models the first and the second one vectorizes the 'tweet' column using TF-IDF. from sklearn.feature_extraction.text import TfidfVectorizer vector = TfidfVectorizer(max_features =5000, ngram_range=(1,3)) X_train = vector.fit_transform(X_train['tweet']).toarray() X_test = vector.fit_transform(X_test['tweet']).toarray() for the third model I …
Category: Data Science

Gensim fast text get vocab or word index

Trying to use gensim's fasttext, testing the sample code from gensim with a small change of replacing the arguement to corpus_iterable https://radimrehurek.com/gensim/models/fasttext.html gensim_version == 4.0.1 from gensim.models import FastText from gensim.test.utils import common_texts # some example sentences print(common_texts[0]) ['human', 'interface', 'computer'] print(len(common_texts)) 9 model = FastText(vector_size=4, window=3, min_count=1) # instantiate model.build_vocab(corpus_iterable=common_texts) model.train(corpus_iterable=common_texts, total_examples=len(common_texts), epochs=10) It works, but is there any way to get the vocab for the model. For example, in Tensorflow Tokenizer there is a word_index which will return …
Category: Data Science

Data Set and guidance for Occupations/ Roles classification problem

I am working on a project where I need to find similar roles -- for example, Software Engineer, Soft. Engineer , Software Eng ( all should be marked similar) Currently, I have tried using the Standard Occupational Classification Dataset and tried using LSA, Leveinstein and unsupervised FastText with Word Movers Distances. The last option works but isn't great. I am wondering if there are more comprehensive data sets or ways available to solve this problem?? Any lead would be helpful!
Category: Data Science

Genesis most_similar find synonym only (not antonyms)

Is there a way to let model.wv.most_similar in gensim return positive-meaning words only (i.e. that shows synonyms but not antonyms)? For example, if I do: import fasttext.util from gensim.models.fasttext import load_facebook_model from gensim.models.fasttext import FastTextKeyedVectors fasttext.util.download_model('en', if_exists='ignore') # English model = load_facebook_model('cc.en.300.bin') model.wv.most_similar(positive=['honest'], topn=2000) Then the mode is also going to return words such as "dishonest". ('dishonest', 0.5542981028556824), However, what if I want words with the positive-meaning only? I have tried the following - subtracting "not" from "honest" in the …
Category: Data Science

How to detect out-of-domain text input?

I have a text classifier which can classify around 40 classes. But the problem is there is no way to handle the case where if any user gives some input to the model which input doesn't match with any of the classes, the model still converges to one of the valid intents. So my question is what are the ways to identify if the input is out of domain? Right now I am using Facebook's fasttext as the supervised classifier, …
Category: Data Science

When are subword ngrams trained in fasttext? (Enriching Word Vectors with Subword Information)

when is the training for subword ngrams done? is it done simultaneously as when the word representation are trained? or is it done at the end, after word representations are created? fasttext implements this paper where word representations are enriched with subword information. here, the word representation for each word is the sum of the representations of its character ngrams. just as how skipgram model is trained, so is the character ngrams, where the ngrams are the context and the …
Category: Data Science

Pre-trained models for finding similar word n-grams

Are there any pre-trained models for finding similar word n-grams, where n>1? FastText, for instance, seems to work only on unigrams: from pyfasttext import FastText model = FastText('cc.en.300.bin') model.nearest_neighbors('dog', k=2000) [('dogs', 0.8463464975357056), ('puppy', 0.7873005270957947), ('pup', 0.7692237496376038), ('canine', 0.7435278296470642), ... but it fails on longer n-grams: model.nearest_neighbors('Gone with the Wind', k=2000) [('DEky4M0BSpUOTPnSpkuL5I0GTSnRI4jMepcaFAoxIoFnX5kmJQk1aYvr2odGBAAIfkECQoABAAsCQAAABAAEgAACGcAARAYSLCgQQEABBokkFAhAQEQHQ4EMKCiQogRCVKsOOAiRocbLQ7EmJEhR4cfEWoUOTFhRIUNE44kGZOjSIQfG9rsyDCnzp0AaMYMyfNjS6JFZWpEKlDiUqALJ0KNatKmU4NDBwYEACH5BAUKAAQALAkAAAAQABIAAAhpAAEQGEiQIICDBAUgLEgAwICHAgkImBhxoMOHAyJOpGgQY8aBGxV2hJgwZMWLFTcCUIjwoEuLBym69PgxJMuDNAUqVDkz50qZLi', 0.71047443151474), or model.nearest_neighbors('Star Wars', k=2000) [('clockHauser', 0.5432934761047363), ('CrônicasEsdrasNeemiasEsterJóSalmosProvérbiosEclesiastesCânticosIsaíasJeremiasLamentaçõesEzequielDanielOséiasJoelAmósObadiasJonasMiquéiasNaumHabacuqueSofoniasAgeuZacariasMalaquiasNovo', 0.5197194218635559),
Topic: fasttext nlp
Category: Data Science

Explain FastText model using SHAP values

I have trained fastText model and some fully connected network build on its embeddings. I figured out how to use Lime on it: complete example can be found in Natural Language Processing Is Fun Part 3: Explaining Model Predictions The idea is clear - put 1 sentence into Lime, it drop words and generate some new sentences from my and check how score changes. My next idea - use SHAP values for this. SHAP values can be used for any …
Category: Data Science

Extracting vectors of FastText own model to use it on a NN

I have trained my own model of fasttext using the pretrained model of English available on their website with the next code: from gensim.models.fasttext import load_facebook_model mod = load_facebook_model('fasttext/cc.en.300.bin') mod.build_vocab(sentences=list(df_train.text), update = True) mod.train(sentences=list(df_train.tex), total_examples=len(df_train.text), epochs=10) Now I will like to extract the vectors of this embedding to train a LSTM neural network with it. Any tip on how to do so? Thanks in advance.
Category: Data Science

Removing duplicate records before training

I am currently working on a project classifying text into classes. The specific problem is classifying job titles into various industry codes. For example "McDonalds Employee" might get classified to 11203 (there are a few hundred classes in the problem). For this we are using FastText. The person that I am working with insists on removing duplicate records from the data before training our model. That is, we might see 100 records with "McDonalds Employee" and class 11203 and he …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.