How to get dummy variables from "first name"

I intend to predict the age of customers using some features. There are some categorical features that I need to convert to dummy variables before the modelling stage.

Since the datasets are so big (millions of rows) when I used StringIndexer in pyspark to get dummies from first names, I got the following error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 25.0 failed 4 times, most recent failure: Lost task 0.3 in stage 25.0 (TID 399, 10.139.64.28, executor 2): org.apache.spark.SparkException: Failed to execute user defined function(StringIndexerModel$$Lambda$6517/699548305: (string) =gt; double)

Can you suggest any better approach to convert the first names to dummy variables?

Topic dummy-variables pyspark feature-extraction categorical-data bigdata

Category Data Science


It appears that you are writing a user defined function to parse the data Failed to execute user defined function(StringIndexerModel$$Lambda$6517/699548305: (string). It is better to use the built-in Spark functions (which are more scalable).

After you parse the string, use OneHotEncoderEstimator to generate the dummy variables.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.