ValueError: could not convert string to float: '���'
I have a (2M, 23) dimensional numpy
array X
. It has a dtype of U26
, i.e. unicode string of 26 characters.
array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
['50905', '0', '0', ..., '110', '0', '0'],
['143899', '1325', '28.80434783', ..., '61', '0', '0'],
['85', '0', '0', ..., '1980', '0', '0'],
['233', '54', '27', ..., '-1', '0', '0'],
['���', '�', '�����', ..., '�', '��', '���']], dtype='U26')
When I convert it to a float datatype, using
X_f = X.astype(float)
I get the error as shown above. how to solve this string formatting error for '���'?
I realize that some characters are not read properly in the dataframe, and the unicode replacement character is just a result of it.
My questions:-
- How do I handle this misreading?
- Should I ignore these characters? Or should I transform them to zero maybe?
Additional Information on how the data was read:-
importing relevant packages
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import col
loading the dataset in a pyspark dataframe
def loading_data(dataset):'csv').options(header='true', inferSchema='true').load(dataset)
# #changing column header name
dataset =*[col(s).alias('Label') if s == ' Label' else s for s in dataset.columns])
#to change datatype
dataset=dataset.drop('External IP')
dataset = dataset.filter(dataset.Label.isNotNull())
dataset=dataset.filter(dataset.Label!=' Label')#filter Label from label
return dataset
# invoking
ds_path = '../final.csv'
check type of dataset.
convert to np array
import numpy as np
np_dfr = np.array(data_preprocessing(dataset).collect())
split features and labels
X = np_dfr[:,0:22]
Y = np_dfr[:,-1]
show X
array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
['50905', '0', '0', ..., '110', '0', '0'],
['143899', '1325', '28.80434783', ..., '61', '0', '0'],
['85', '0', '0', ..., '1980', '0', '0'],
['233', '54', '27', ..., '-1', '0', '0'],
['���', '�', '�����', ..., '�', '��', '���']], dtype='U26')
Topic dataframe csv data-formats python
Category Data Science