Separate numerical and categorical variables

I have a dataset (42000, 10) which contains 7 categorical features and 3 numerical. I would like to separate both the numerical and categorical features into 2 different data frames i.e I would like 2 data frames where one contains only numerical data (42000, 3) and the other only categorical data (42000, 7), perform some pre-processing on both of them, and lastly concatenate them into one data frame.

So, my question is how do I separate my initial dataframe into 2 based on numerical and categorical data?

Topic numerical preprocessing pandas categorical-data

Category Data Science


Simplest way is to use select_dtypes method in Pandas. This returns a subset of a dataframe based on the column dtypes:

      df_numerical_features = df.select_dtypes(include='number')
      df_categorical_features = df.select_dtypes(include='category')

Reference documentation of select_dtypes

This will also depend on the column datatypes of your dataframe. Considering you have categorical columns and few columns are either int64 or float you can go for:

  df_numerical_features = df.select_dtypes(exclude='object')
  df_categorical_features = df.select_dtypes(include='object')

Use the include/exclude option to choose based on the dtype. Other dtype information is as shown below:

  • To select all numeric types, use np.number or 'number'

  • To select strings you must use the object dtype, but note that this will return all object dtype columns

  • To select datetimes, use np.datetime64, 'datetime' or 'datetime64'

  • To select timedeltas, use np.timedelta64, 'timedelta' or 'timedelta64'

  • To select Pandas categorical dtypes, use 'category'

  • To select Pandas datetimetz dtypes, use 'datetimetz' (new in 0.20.0) or 'datetime64[ns, tz]'

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.