ValueError: could not convert string to float: '���'

I have a (2M, 23) dimensional numpy array X. It has a dtype of U26, i.e. unicode string of 26 characters.

array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
   ['50905', '0', '0', ..., '110', '0', '0'],
   ['143899', '1325', '28.80434783', ..., '61', '0', '0'],
   ...,
   ['85', '0', '0', ..., '1980', '0', '0'],
   ['233', '54', '27', ..., '-1', '0', '0'],
   ['���', '�', '�����', ..., '�', '��', '���']], dtype='U26')

When I convert it to a float datatype, using

X_f = X.astype(float)

I get the error as shown above. how to solve this string formatting error for '���'?

I realize that some characters are not read properly in the dataframe, and the unicode replacement character is just a result of it.

My questions:-

  1. How do I handle this misreading?
  2. Should I ignore these characters? Or should I transform them to zero maybe?

Additional Information on how the data was read:-

importing relevant packages

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import col

loading the dataset in a pyspark dataframe

def loading_data(dataset):
    dataset=sql_sc.read.format('csv').options(header='true', inferSchema='true').load(dataset)
    # #changing column header name
    dataset = dataset.select(*[col(s).alias('Label') if s == ' Label' else s for s in dataset.columns])
    #to change datatype
    dataset=dataset.drop('External IP')
    dataset = dataset.filter(dataset.Label.isNotNull())
    dataset=dataset.filter(dataset.Label!=' Label')#filter Label from label
    print(dataset.groupBy('Label').count().collect())
    return dataset

# invoking
ds_path = '../final.csv'
dataset=loading_data(ds_path)

check type of dataset.

type(dataset)

pyspark.sql.dataframe.DataFrame

convert to np array

import numpy as np
np_dfr = np.array(data_preprocessing(dataset).collect())

split features and labels

X = np_dfr[:,0:22]
Y = np_dfr[:,-1]

show X

 X
array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
       ['50905', '0', '0', ..., '110', '0', '0'],
       ['143899', '1325', '28.80434783', ..., '61', '0', '0'],
       ...,
       ['85', '0', '0', ..., '1980', '0', '0'],
       ['233', '54', '27', ..., '-1', '0', '0'],
       ['���', '�', '�����', ..., '�', '��', '���']], dtype='U26')

Topic dataframe csv data-formats python

Category Data Science


Let's try to use pandas dataframe and convert strings into numeric classes

from sklearn import preprocessing

def convert(data):
    number = preprocessing.LabelEncoder()
    data['column_name'] = number.fit_transform(data['column_name'])
    data=data.fillna(-999) # fill holes with default value
    return data

call the above convert() function like, test = convert(test)


Though not the best solution, I found some success by converting it into pandas dataframe and working along.

code snippet

# convert X into dataframe
X_pd = pd.DataFrame(data=X)
# replace all instances of URC with 0 
X_replace = X_pd.replace('�',0, regex=True)
# convert it back to numpy array
X_np = X_replace.values
# set the object type as float
X_fa = X_np.astype(float)

input

array([['85', '0', '0', '1980', '0', '0'],
       ['233', '54', '27', '-1', '0', '0'],
       ['���', '�', '�����', '�', '��', '���']], dtype='<U5')

output

array([[ 8.50e+01,  0.00e+00,  0.00e+00,  1.98e+03,  0.00e+00,  0.00e+00],
       [ 2.33e+02,  5.40e+01,  2.70e+01, -1.00e+00,  0.00e+00,  0.00e+00],
       [ 0.00e+00,  0.00e+00,  0.00e+00,  0.00e+00,  0.00e+00,  0.00e+00]])

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.