Getting 'ValueError: setting an array element with a sequence.' when attempting to fit mixed-type data

I have already seen this, this and this question, but none of the suggestions seemed to fix my problem (so I have reverted them).

I have the following code:

nlp = spacy.load('en_core_web_sm')
parser = English()

class CleanTextTransformer(TransformerMixin):
    def transform(self, X, **transform_params):
        return [cleanText(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}


def cleanText(text):
    text = text.strip().replace("\n", " ").replace("\r", " ")
    text = text.lower()
    return text


def tokenizeText(sample):
    tokens = parser(sample)
    lemmas = []
    for tok in tokens:
        lemmas.append(tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_)
    tokens = lemmas
    tokens = [tok for tok in tokens if tok not in STOPLIST]
    tokens = [nlp(tok)[0].lemma_ for tok in tokens if tok not in SYMBOLS]
    return tokens

class multilabelbin(TransformerMixin):
    def __init__(self, *args, **kwargs):
        self.encoder = MultiLabelBinarizer(*args, **kwargs)

    def fit(self, x, y=0):
        self.encoder.fit(x)
        return self

    def transform(self, x, y=0):
        return self.encoder.transform(x)


def represent(rd, ed, number, category, text):
    doc_train = rd
    doc_test = ed

    for column in category:
        doc_train[column] = [tuple(doc.split(",")) for doc in rd[column]]
        doc_test[column] = [tuple(doc.split(",")) for doc in ed[column]]

        print("columns split")

        mlb = multilabelbin(sparse_output=False)
        mlb.fit(doc_train)

        transformed_r = mlb.transform(doc_train)
        for row in range(len(doc_train[column])):
            print(doc_train[column][row])
            doc_train[column][row] = transformed_r[row]

        transformed_e = mlb.transform(doc_test)
        for row in range(len(doc_test[column])):
            print(doc_test[column][row])
            doc_test[column][row] = transformed_e[row]

        print("categorical columns encoded using MultiLabelBinarizer()")

    for column in number:
        ss = StandardScaler()
        ss.fit(doc_train[column].values.reshape(-1, 1))

        doc_train[column] = ss.transform(doc_train[column].values.reshape(-1, 1))
        doc_test[column] = ss.transform(doc_test[column].values.reshape(-1, 1))
        print("numbers scaled using StandardScaler()")

    for column in text:
        cleaner = CleanTextTransformer()
        cleaner.fit(doc_train[column].tolist())

        doc_train[column] = cleaner.transform(doc_train[column])
        doc_test[column] = cleaner.transform(doc_test[column])

        print(doc_train[column])

        vec = TfidfVectorizer(tokenizer=tokenizeText, ngram_range=(1, 1))
        vec.fit(doc_train[column].tolist())

        doc_train[column] = vec.transform(doc_train[column]).todense()
        doc_test[column] = vec.transform(doc_test[column]).todense()

        print(doc_train[column])

        print("text vectorized")

    print("preprocessing completed successfully")

    return doc_train, doc_test


def train_classifier(train_docs, classAxis):
    clf = OneVsRestClassifier(LogisticRegression(solver='saga'))

    X = [list(train_docs[list(train_docs)[i]]) for i in range(1, len(train_docs))]
    y = list(train_docs[classAxis])

    classifier = clf.fit(X, y)
    return classifier

df = pd.DataFrame(pd.read_csv("testdata.csv", header=0))
test_data = pd.DataFrame(pd.read_csv("test.csv", header=0))

train, test = represent(df, test_data, ["Cat2", "Cat5"], ["Cat6"], ["Cat1", "Cat3", "Cat4", "Cat7"])

print(train, test)

model = train_classifier(train, "Class")

train.csv contains data in this format:

test.csv is of the same format.

As you can see, there are text values, number values and categorical values. My code firstly splits up the categorical values (which are comma-delimited), before running them through MultiLabelBinarizer(). Then, I simply scale the numbers. Finally, I process the text using the spaCy settings found in this tutorial. I make sure to apply transformations to the test data, too, so there can be no inconsistency there. Finally, I list-enise everything in the train_classifier function, which supposedly should help... but it didn't. In the line classifier = clf.fit(list(X), y), I get the following error:

Traceback (most recent call last):
  File "input", line 1, in module
  File "C:\Users\User\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\191.7141.48\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "C:\Users\User\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\191.7141.48\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/User/PycharmProjects/ml/ml.py", line 148, in module
    model = train_classifier(train, "Class")
  File "C:/Users/User/PycharmProjects/ml/ml.py", line 124, in train_classifier
    classifier = clf.fit(list(X), y)
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\multiclass.py", line 215, in fit
    for i, column in enumerate(columns))
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py", line 917, in __call__
    if self.dispatch_one_batch(iterator):
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py", line 759, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py", line 716, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 549, in __init__
    self.results = batch()
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py", line 225, in __call__
    for func, args, kwargs in self.items]
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py", line 225, in listcomp
    for func, args, kwargs in self.items]
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\multiclass.py", line 80, in _fit_binary
    estimator.fit(X, y)
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\linear_model\logistic.py", line 1288, in fit
    accept_large_sparse=solver != 'liblinear')
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\utils\validation.py", line 756, in check_X_y
    estimator=estimator)
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\utils\validation.py", line 527, in check_array
    array = np.asarray(array, dtype=dtype, order=order)
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\core\numeric.py", line 538, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.

I have tried to read through docs, and am not one to shy away from reading source code (and PyCharm helped me pinpoint the source of the error), but am no closer to fixing it. I feel like I have honestly tried everything on the first 3 pages of Google, but to no success.

How can I fix this error? Why is it happening? Is my preprocessing wrong? I know it's a bit dodgy in places, but does this make is unfunctional? If so, how could I fix these issues in the preprocessor? Would this fix the ValueError: setting an array element with a sequence. error?

Some notes:

  • For some reason, spaCy seems to return 0.0 for most values in each column.
  • I am unsure if I can just insert my MultiLabelVectorizer() output into the DataFrame like this (simply as the 2D arrays) - is this OK? Are there any more steps required?
  • I have tried Pipelines for more semantic code, as well as using different classifiers for the different data types (e.g using Chi^2 for text, and other things for other types), but it always seemed to result in an endless well of bugs.
  • I am unable to even pinpoint what throws this error: is it the column data, the text data or the number data? I don't know.

Topic vector-space-models scikit-learn python machine-learning

Category Data Science


The packages you are using are designed to work in a very specific way and the data might not be what is expected at each stage.

A NumPy array has to have a consistent dtype/datatype through out. For machine learning that has to be a numerical value, typically float. If try to pass in object type into scikit-learn, it will not work.

Scikit-learn expects NumPy arrays as input, not Python lists. If data stays NumPy arrays, the code is more likely to work.

You are currently manually looping through the data to transform it. If refactor your code to primarily use scikit-learn Pipelines, it will be more automated and might have more informative error messages.

Since you have heterogeneous data, Feature Union is the best practice to process that kind of data for scikit-learn.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.