Training fasttext on your own corpus

Question

Training fasttext on your own corpus

BlueMango

2021年10月15日 10:51

I want to train fasttext on my own corpus. However, I have a small question before continuing. Do I need each sentences as a different item in corpus or can I have many sentences as one item?

For example, I have this DataFrame:

 text                                               |     summary
 ------------------------------------------------------------------
 this is sentence one this is sentence two continue | one two other
 other similar sentences some other                 | word word sent

Basically, the column text is an article so it has many sentences. Because of the preprocessing, I no longer have full stop .. So the question is can I do something like this directly or do I need to split each sentences.

docs = df['text']
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(docs)

What are the differences? Is this the right way of training fasttext in your own corpus?

Thank you!

Topic fasttext gensim tensorflow word-embeddings python

Category Data Science

Training fasttext on your own corpus

About