How to combine nlp and numeric data for a linear regression problem

I'm very new to data science (this is my hello world project), and I have a data set made up of a combination of review text and numerical data such as number of tables. There is also a column for reviews which is a float (avg of all user reviews for that restaurant). So a row of data could be like:

{ 
    rating: 3.765, 
    review: `Food was great, staff was friendly`, 
    tables: 30, 
    staff: 15, 
    parking: 20
    ... 
}

So following tutorials, I have been able to do the following:

  1. Created a linear regression model to predict rating with the inputs being all the numerical data columns.
  2. Created a regression model to predict rating based on review text using sklearn.TfidfVectorizer.

But now I'd like to combine models or combine the data from both into one to create a linear regression model. So how can I utilize the vectorized text data in my linear regression model?

Topic tfidf linear-regression scikit-learn nlp

Category Data Science


It sounds like you could use FeatureUnion for this. Here's an example:

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

iris = load_iris()

X, y = iris.data, iris.target

# This dataset is way too high-dimensional. Better do PCA:
pca = PCA(n_components=2)

# Maybe some original features where good, too?
selection = SelectKBest(k=1)

# Build estimator from PCA and Univariate selection:

combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)
print("Combined space has", X_features.shape[1], "features")

svm = SVC(kernel="linear")

# Do grid search over k, n_components and C:

pipeline = Pipeline([("features", combined_features), ("svm", svm)])

param_grid = dict(features__pca__n_components=[1, 2, 3],
                  features__univ_select__k=[1, 2],
                  svm__C=[0.1, 1, 10])

grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, verbose=10)
grid_search.fit(X, y)
print(grid_search.best_estimator_)

Hopefully it is clear from that example how you could use this to merge your TfidfVectorizer results with your original features.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.