Getting 'ValueError: setting an array element with a sequence.' when attempting to fit mixed-type data
I have already seen this, this and this question, but none of the suggestions seemed to fix my problem (so I have reverted them).
I have the following code:
nlp = spacy.load('en_core_web_sm')
parser = English()
class CleanTextTransformer(TransformerMixin):
def transform(self, X, **transform_params):
return [cleanText(text) for text in X]
def fit(self, X, y=None, **fit_params):
return self
def get_params(self, deep=True):
return {}
def cleanText(text):
text = text.strip().replace("\n", " ").replace("\r", " ")
text = text.lower()
return text
def tokenizeText(sample):
tokens = parser(sample)
lemmas = []
for tok in tokens:
lemmas.append(tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_)
tokens = lemmas
tokens = [tok for tok in tokens if tok not in STOPLIST]
tokens = [nlp(tok)[0].lemma_ for tok in tokens if tok not in SYMBOLS]
return tokens
class multilabelbin(TransformerMixin):
def __init__(self, *args, **kwargs):
self.encoder = MultiLabelBinarizer(*args, **kwargs)
def fit(self, x, y=0):
self.encoder.fit(x)
return self
def transform(self, x, y=0):
return self.encoder.transform(x)
def represent(rd, ed, number, category, text):
doc_train = rd
doc_test = ed
for column in category:
doc_train[column] = [tuple(doc.split(",")) for doc in rd[column]]
doc_test[column] = [tuple(doc.split(",")) for doc in ed[column]]
print("columns split")
mlb = multilabelbin(sparse_output=False)
mlb.fit(doc_train)
transformed_r = mlb.transform(doc_train)
for row in range(len(doc_train[column])):
print(doc_train[column][row])
doc_train[column][row] = transformed_r[row]
transformed_e = mlb.transform(doc_test)
for row in range(len(doc_test[column])):
print(doc_test[column][row])
doc_test[column][row] = transformed_e[row]
print("categorical columns encoded using MultiLabelBinarizer()")
for column in number:
ss = StandardScaler()
ss.fit(doc_train[column].values.reshape(-1, 1))
doc_train[column] = ss.transform(doc_train[column].values.reshape(-1, 1))
doc_test[column] = ss.transform(doc_test[column].values.reshape(-1, 1))
print("numbers scaled using StandardScaler()")
for column in text:
cleaner = CleanTextTransformer()
cleaner.fit(doc_train[column].tolist())
doc_train[column] = cleaner.transform(doc_train[column])
doc_test[column] = cleaner.transform(doc_test[column])
print(doc_train[column])
vec = TfidfVectorizer(tokenizer=tokenizeText, ngram_range=(1, 1))
vec.fit(doc_train[column].tolist())
doc_train[column] = vec.transform(doc_train[column]).todense()
doc_test[column] = vec.transform(doc_test[column]).todense()
print(doc_train[column])
print("text vectorized")
print("preprocessing completed successfully")
return doc_train, doc_test
def train_classifier(train_docs, classAxis):
clf = OneVsRestClassifier(LogisticRegression(solver='saga'))
X = [list(train_docs[list(train_docs)[i]]) for i in range(1, len(train_docs))]
y = list(train_docs[classAxis])
classifier = clf.fit(X, y)
return classifier
df = pd.DataFrame(pd.read_csv("testdata.csv", header=0))
test_data = pd.DataFrame(pd.read_csv("test.csv", header=0))
train, test = represent(df, test_data, ["Cat2", "Cat5"], ["Cat6"], ["Cat1", "Cat3", "Cat4", "Cat7"])
print(train, test)
model = train_classifier(train, "Class")
train.csv
contains data in this format:
test.csv
is of the same format.
As you can see, there are text values, number values and categorical values. My code firstly splits up the categorical values (which are comma-delimited), before running them through MultiLabelBinarizer()
. Then, I simply scale the numbers. Finally, I process the text using the spaCy
settings found in this tutorial. I make sure to apply transformations to the test data, too, so there can be no inconsistency there. Finally, I list
-enise everything in the train_classifier
function, which supposedly should help... but it didn't. In the line classifier = clf.fit(list(X), y)
, I get the following error:
Traceback (most recent call last):
File "input", line 1, in module
File "C:\Users\User\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\191.7141.48\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "C:\Users\User\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\191.7141.48\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/User/PycharmProjects/ml/ml.py", line 148, in module
model = train_classifier(train, "Class")
File "C:/Users/User/PycharmProjects/ml/ml.py", line 124, in train_classifier
classifier = clf.fit(list(X), y)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\multiclass.py", line 215, in fit
for i, column in enumerate(columns))
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py", line 917, in __call__
if self.dispatch_one_batch(iterator):
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 182, in apply_async
result = ImmediateResult(func)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 549, in __init__
self.results = batch()
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py", line 225, in __call__
for func, args, kwargs in self.items]
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py", line 225, in listcomp
for func, args, kwargs in self.items]
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\multiclass.py", line 80, in _fit_binary
estimator.fit(X, y)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\linear_model\logistic.py", line 1288, in fit
accept_large_sparse=solver != 'liblinear')
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\utils\validation.py", line 756, in check_X_y
estimator=estimator)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\utils\validation.py", line 527, in check_array
array = np.asarray(array, dtype=dtype, order=order)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\core\numeric.py", line 538, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
I have tried to read through docs, and am not one to shy away from reading source code (and PyCharm helped me pinpoint the source of the error), but am no closer to fixing it. I feel like I have honestly tried everything on the first 3 pages of Google, but to no success.
How can I fix this error? Why is it happening? Is my preprocessing wrong? I know it's a bit dodgy in places, but does this make is unfunctional? If so, how could I fix these issues in the preprocessor? Would this fix the ValueError: setting an array element with a sequence.
error?
Some notes:
- For some reason,
spaCy
seems to return 0.0 for most values in each column. - I am unsure if I can just insert my
MultiLabelVectorizer()
output into the DataFrame like this (simply as the 2D arrays) - is this OK? Are there any more steps required? - I have tried Pipelines for more semantic code, as well as using different classifiers for the different data types (e.g using Chi^2 for text, and other things for other types), but it always seemed to result in an endless well of bugs.
- I am unable to even pinpoint what throws this error: is it the column data, the text data or the number data? I don't know.
Topic vector-space-models scikit-learn python machine-learning
Category Data Science