Convert Pandas Dataframe with mixed datatypes to LibSVM format

I have a pandas data frame with about Million rows and 3 columns. The columns are of 3 different datatypes. NumberOfFollowers is of a numerical datatype, UserName is of a categorical data type, Embeddings is of categorical-set type.

df:

Index  NumberOfFollowers                  UserName                    Embeddings        Target Variable

0        15                                name1                      [0.5 0.3 0.2]       0
1        4                                 name2                      [0.4 0.2 0.4]       1
2        8                                 name3                      [0.5 0.5 0.0]       0
3        10                                name1                      [0.1 0.0 0.9]       0
...      ...                               ....                       ...                 ..

I would like to convert this pandas data frame into the LibSVM input format.

Desired Output:

0 0:15 4:1 1:0.5 2:0.3 3:0.2
1 0:4 5:1 1:0.4 2:0.2 3:0.4
0 0:8 6:1 1:0.5 2:0.5 3:0.0
0 0:10 4:1 1:0.1 2:0.0 3:0.9
...

One solution I found was using:

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.dump_svmlight_file.html

It takes input as a NumPy array or a sparse matrix.

The UserName has a million unique values, so calling pd.get_dummies on this column and storing it as a dense NumPy array is not a solution and will not fit in the memory.

So, I know it may be done using sparse matrices, but, then I don't know how to convert the above data with mixed data types into a sparse matrix and then use sklearn.datasets.dump_svmlight_file.html.

In reality, I have many columns with mixed data types and I need to convert them into libSVM format. But, all the columns fall in one of the above three types.

Thanks in advance for any thoughts on how to solve the above problem.

Topic sparse scikit-learn pandas libsvm

Category Data Science


As you've mentioned you can use the function by sklearn, I don't see the problem using it (perhaps I'm missing something)

import pandas as pd
from sklearn.datasets import dump_svmlight_file

def df_to_libsvm(df: pd.DataFrame):
    x = df.drop('label', axis=1)
    y = df['label']
    dump_svmlight_file(X=x, y=y, f='libsvm.dat', zero_based=True)

Regarding the categorical features with 10^6 unique categories, you can use a simple embedding for it into a binary vector. One way to do it will be to map each username to a unique integer number. Then you can convert the number to binary representation; that way you'll have a simple embedding of approximately size 20 (2^20=1,048,576) i.e. this feature is represented by 20 binary features.

Of course if the usernames are all unique, they probably shouldn't be a feature (identical to id).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.