fasttext

Initializing weights that are a pointwise product of multiple variables

Witiko

2022年5月21日 17:05

In two-layer perceptrons that slide across words of text, such as word2vec and fastText, hidden layer heights may be a product of two random variables such as positional embeddings and word embeddings (Mikolov et al. 2017, Section 2.2): $$v_c = \sum_{p\in P} d_p \odot u_{t+p}$$ However, it's unclear to me how to best initialize the two variables. When only word embeddings are used for the hidden layer weights, word2vec and fastText initialize them to $\mathcal{U}(-1 / \text{fan_out}; 1 / \text{fan_out})$. …

Topic: fasttext weight-initialization word2vec word-embeddings nlp

Category: Data Science

Keep word2vexc/fasttext model loaded in memory without using API

pramesh

2022年5月8日 18:07

I have to use Fasttext model to return word embeddings. In test I was calling it through API. Since there are too many words to compute embeddings, API call seems to be expensive. I would like to use fasttext without API. For that I need to load the model once and keep it in memory for further calls. How can this be done without using API. Any help is highly appreciated.

Topic: fasttext word2vec word-embeddings nlp

Category: Data Science

FastText Model Explained

Black Jack 21

2022年3月13日 19:03

I was reading the FastText paper and I have a few questions about the model used for classification. Since I am not from NLP background, some I am unfamiliar with the jargon. In the figure, what exactly is are the $x_i$? I am not sure what $N$ ngram features mean. If my document has total $L$ words, then how can I represent the entire document using $N$ variables ($x_1$,..,$x_n$)? What exactly is $N$? $$-\frac{1}{N}\sum_{n=1}^Ny_n\log(f(BAx_n)) $$ If $y_n$ is the label, …

Topic: fasttext ngrams nlp

Category: Data Science

DBSCAN getting one huge cluster with noisy points

Sofia Fernandes

2022年1月20日 15:32

I'm currently trying to cluster customer service email answers (NLP). When I use DBSCAN with TF-IDF embeddings + Annoy indexes, I get good clusters. But, when I use DBSCAN with FastText embeddings + Annoy indexes, I get good clusters except the cluster with label zero (0) which seems to include lots of noisy points (that should be labeled with -1 instead of 0). Anyone with and idea of what this can be? I'm using an eps=0.5 for both cases.

Topic: fasttext tfidf dbscan scikit-learn machine-learning

Category: Data Science

Should I use Pad Sequence when using Word Vectors?

grace

2021年11月21日 16:03

I have an unbalanced text data set. I want to use word vectors to embed words. When I use pad sequence? Before or after the word vector? I tried it, after the word vector I used pad sequence but my model accuracy was low. When I use the pad sequence before the word vector, how do I know the result? Because pad sequence and word vector give me numeric result. Meanwhile, word vector input is token and pad sequence input …

Topic: fasttext word-embeddings class-imbalance neural-network python

Category: Data Science

Training fasttext on your own corpus

BlueMango

2021年10月15日 10:51

I want to train fasttext on my own corpus. However, I have a small question before continuing. Do I need each sentences as a different item in corpus or can I have many sentences as one item? For example, I have this DataFrame: text | summary ------------------------------------------------------------------ this is sentence one this is sentence two continue | one two other other similar sentences some other | word word sent Basically, the column text is an article so it has many …

Topic: fasttext gensim tensorflow word-embeddings python

Category: Data Science

How can I use Ensemble learning of two models with different features as an input?

Narges Se

2021年9月17日 00:46

I have a fake news detection problem and it predicts the binary labels "1"&"0" by vectorizing the 'tweet' column, I use three different models for detection but I want to use the ensemble method to increase the accuracy but they use different vectorezer. I have 3 KNN models the first and the second one vectorizes the 'tweet' column using TF-IDF. from sklearn.feature_extraction.text import TfidfVectorizer vector = TfidfVectorizer(max_features =5000, ngram_range=(1,3)) X_train = vector.fit_transform(X_train['tweet']).toarray() X_test = vector.fit_transform(X_test['tweet']).toarray() for the third model I …

Topic: fasttext tfidf ensemble-modeling python machine-learning

Category: Data Science

Gensim fast text get vocab or word index

data_person

2021年9月2日 02:48

Trying to use gensim's fasttext, testing the sample code from gensim with a small change of replacing the arguement to corpus_iterable https://radimrehurek.com/gensim/models/fasttext.html gensim_version == 4.0.1 from gensim.models import FastText from gensim.test.utils import common_texts # some example sentences print(common_texts[0]) ['human', 'interface', 'computer'] print(len(common_texts)) 9 model = FastText(vector_size=4, window=3, min_count=1) # instantiate model.build_vocab(corpus_iterable=common_texts) model.train(corpus_iterable=common_texts, total_examples=len(common_texts), epochs=10) It works, but is there any way to get the vocab for the model. For example, in Tensorflow Tokenizer there is a word_index which will return …

Topic: fasttext gensim word-embeddings nlp machine-learning

Category: Data Science

Data Set and guidance for Occupations/ Roles classification problem

rspenpal

2021年8月19日 13:53

I am working on a project where I need to find similar roles -- for example, Software Engineer, Soft. Engineer , Software Eng ( all should be marked similar) Currently, I have tried using the Standard Occupational Classification Dataset and tried using LSA, Leveinstein and unsupervised FastText with Word Movers Distances. The last option works but isn't great. I am wondering if there are more comprehensive data sets or ways available to solve this problem?? Any lead would be helpful!

Topic: fasttext word2vec dataset nlp machine-learning

Category: Data Science

Genesis most_similar find synonym only (not antonyms)

Jinhua Wang

2021年7月19日 03:45

Is there a way to let model.wv.most_similar in gensim return positive-meaning words only (i.e. that shows synonyms but not antonyms)? For example, if I do: import fasttext.util from gensim.models.fasttext import load_facebook_model from gensim.models.fasttext import FastTextKeyedVectors fasttext.util.download_model('en', if_exists='ignore') # English model = load_facebook_model('cc.en.300.bin') model.wv.most_similar(positive=['honest'], topn=2000) Then the mode is also going to return words such as "dishonest". ('dishonest', 0.5542981028556824), However, what if I want words with the positive-meaning only? I have tried the following - subtracting "not" from "honest" in the …

Topic: fasttext gensim word2vec nlp python

Category: Data Science

How to detect out-of-domain text input?

hafiz031

2021年6月16日 08:18

I have a text classifier which can classify around 40 classes. But the problem is there is no way to handle the case where if any user gives some input to the model which input doesn't match with any of the classes, the model still converges to one of the valid intents. So my question is what are the ways to identify if the input is out of domain? Right now I am using Facebook's fasttext as the supervised classifier, …

Topic: fasttext text-classification classification

Category: Data Science

When are subword ngrams trained in fasttext? (Enriching Word Vectors with Subword Information)

Sid

2021年3月24日 07:36

when is the training for subword ngrams done? is it done simultaneously as when the word representation are trained? or is it done at the end, after word representations are created? fasttext implements this paper where word representations are enriched with subword information. here, the word representation for each word is the sum of the representations of its character ngrams. just as how skipgram model is trained, so is the character ngrams, where the ngrams are the context and the …

Topic: fasttext word-embeddings nlp machine-learning

Category: Data Science

Pre-trained models for finding similar word n-grams

dzieciou

2021年1月9日 10:52

Are there any pre-trained models for finding similar word n-grams, where n>1? FastText, for instance, seems to work only on unigrams: from pyfasttext import FastText model = FastText('cc.en.300.bin') model.nearest_neighbors('dog', k=2000) [('dogs', 0.8463464975357056), ('puppy', 0.7873005270957947), ('pup', 0.7692237496376038), ('canine', 0.7435278296470642), ... but it fails on longer n-grams: model.nearest_neighbors('Gone with the Wind', k=2000) [('DEky4M0BSpUOTPnSpkuL5I0GTSnRI4jMepcaFAoxIoFnX5kmJQk1aYvr2odGBAAIfkECQoABAAsCQAAABAAEgAACGcAARAYSLCgQQEABBokkFAhAQEQHQ4EMKCiQogRCVKsOOAiRocbLQ7EmJEhR4cfEWoUOTFhRIUNE44kGZOjSIQfG9rsyDCnzp0AaMYMyfNjS6JFZWpEKlDiUqALJ0KNatKmU4NDBwYEACH5BAUKAAQALAkAAAAQABIAAAhpAAEQGEiQIICDBAUgLEgAwICHAgkImBhxoMOHAyJOpGgQY8aBGxV2hJgwZMWLFTcCUIjwoEuLBym69PgxJMuDNAUqVDkz50qZLi', 0.71047443151474), or model.nearest_neighbors('Star Wars', k=2000) [('clockHauser', 0.5432934761047363), ('CrônicasEsdrasNeemiasEsterJóSalmosProvérbiosEclesiastesCânticosIsaíasJeremiasLamentaçõesEzequielDanielOséiasJoelAmósObadiasJonasMiquéiasNaumHabacuqueSofoniasAgeuZacariasMalaquiasNovo', 0.5197194218635559),

Topic: fasttext nlp

Category: Data Science

Explain FastText model using SHAP values

Mikhail_Sam

2020年10月22日 22:12

I have trained fastText model and some fully connected network build on its embeddings. I figured out how to use Lime on it: complete example can be found in Natural Language Processing Is Fun Part 3: Explaining Model Predictions The idea is clear - put 1 sentence into Lime, it drop words and generate some new sentences from my and check how score changes. My next idea - use SHAP values for this. SHAP values can be used for any …

Topic: fasttext shap pytorch nlp python

Category: Data Science

Extracting vectors of FastText own model to use it on a NN

IMB

2020年6月10日 17:21

I have trained my own model of fasttext using the pretrained model of English available on their website with the next code: from gensim.models.fasttext import load_facebook_model mod = load_facebook_model('fasttext/cc.en.300.bin') mod.build_vocab(sentences=list(df_train.text), update = True) mod.train(sentences=list(df_train.tex), total_examples=len(df_train.text), epochs=10) Now I will like to extract the vectors of this embedding to train a LSTM neural network with it. Any tip on how to do so? Thanks in advance.

Topic: fasttext lstm gensim word2vec word-embeddings

Category: Data Science

Removing duplicate records before training

astel

2020年5月14日 18:47

I am currently working on a project classifying text into classes. The specific problem is classifying job titles into various industry codes. For example "McDonalds Employee" might get classified to 11203 (there are a few hundred classes in the problem). For this we are using FastText. The person that I am working with insists on removing duplicate records from the data before training our model. That is, we might see 100 records with "McDonalds Employee" and class 11203 and he …

Topic: fasttext overfitting nlp

Category: Data Science

About