Training fasttext on your own corpus
I want to train fasttext on my own corpus. However, I have a small question before continuing. Do I need each sentences as a different item in corpus or can I have many sentences as one item?
For example, I have this DataFrame:
text | summary
------------------------------------------------------------------
this is sentence one this is sentence two continue | one two other
other similar sentences some other | word word sent
Basically, the column text
is an article so it has many sentences. Because of the preprocessing, I no longer have full stop .
. So the question is can I do something like this directly or do I need to split each sentences.
docs = df['text']
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(docs)
What are the differences? Is this the right way of training fasttext in your own corpus?
Thank you!
Topic fasttext gensim tensorflow word-embeddings python
Category Data Science