Term document matrix for webpages

hanugm

2021年6月12日 01:29

Consider the following code for obtaining term-document matrix for given texts

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

docs = ['why hello there', 'omg hello pony', 'she went there? omg']
vec = CountVectorizer()
X = vec.fit_transform(docs)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)

Here the docs list contain content of three text files. Now, i need to form the docs from three wiki pages: text #1, text #2, text #3

How can I perform the term document matrix from the links provided? Is there any package in Python that make this task easier?

Topic document-term-matrix encoding

Category Data Science

Term document matrix for webpages

About