Term document matrix for webpages
Consider the following code for obtaining term-document matrix for given texts
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
docs = ['why hello there', 'omg hello pony', 'she went there? omg']
vec = CountVectorizer()
X = vec.fit_transform(docs)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)
Here the docs list contain content of three text files. Now, i need to form the docs from three wiki pages: text #1, text #2, text #3
How can I perform the term document matrix from the links provided? Is there any package in Python that make this task easier?
Topic document-term-matrix encoding
Category Data Science