How to deal with errors of defining data types in pandas' read_csv ()?

I have a table with 118,000 rows and 80 columns. I would like to select 8 columns from the table. I am reading the file using the pandas function pd.read_csv command as:

df = pd.read_csv(filename, header=None, sep='|',
                 usecols=[1,3,4,5,37,40,51,76])

I would like to change the data type of each column inside of read_csv using dtype={'5': np.float, '37': np.float, ....}, but this does not work.

There is a message that column 5 has mixed types. The command print(df.dtypes) shows all columns of the type object. When I examine the column 5, I cannot see any problems. I have to change the data type for each column separately using pd.to_numeric.

My question is: Is there a way of setting data types inside read_csv and what is the problem in this case?

Topic pandas python

Category Data Science


You could try just using your own solution, replacing np.float:

dtype={'5': pd.to_numeric, '37': np.float, ....}

Or make a function that does what you want:

def convert(val):
    try:
        return np.float(val)
    except:
        return float(val)
    except:
        return pd.to_numeric(val)

    return val

Then:

dtype={'5': convert, '37': np.float, ....}

That is a bit exaggerated, but you get the idea :)


If you see the warning that your column has mixed types, but you only see numbers there, it could be that missing values are causing the problem.

In Pandas 1.0.0, a new function has been introduced to try to solve that problem. Namely, the Dataframe.convert_dtypes (docs).

You can use it like this:

df = pd.read_csv(filename, header=None, sep='|', usecols=[1,3,4,5,37,40,51,76])
df = df.convert_dtypes()

then check the type of the columns

print(df.dtypes)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.