Bad Input Shape -- How to interpret and Diagnose; Also side ML question

I apologize I am a ML novice, but I am trying to learn. I am making a classifier based on this dataset to predict mental health disorders based on features. I wanted to run a very simple NB classifer model but I keep getting a bad input shape error (I want to feed in features such as age, ethnicity and gender to yield potential diagnoses). Unfortunately, I am having trouble diagnosing where my error is coming from and troubleshooting. Any guidance? (ignore the multiple input stuff at the top; I was trying different things but I am assuming that there is a problem with how I am inputting the data parameters)

Namely, for these labels (diagnoses) I want an output that will show the presence/lack thereof each [0 or 1] based on the features (which are numeric) Feature Names ['YEAR', 'AGE', 'EDUC', 'ETHNIC', 'RACE'] Values [ 9, -9, 4 , 2]

Labels: ['ADHDFLG', 'CONDUCTFLG', 'DELIRDEMFLG', 'BIPOLARFLG', 'DEPRESSFLG', 'ODDFLG', 'PDDFLG', 'PERSONFLG', 'SCHIZOFLG', 'ALCSUBFLG'] Corresponding Label values [0, 1, 0, 0, 0, 1, 0, 0, 0, 0]

Also, side question -- does anyone have any recommendations for other Maching Learning tasks I can try with this? I am doing this for a class and am trying to push myself to learn new topics. Thanks in advance!

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB

import scipy

from sklearn.model_selection import train_test_split


df = pd.read_csv(https://csprojectdatavisualizationsample50k.s3.us-east-2.amazonaws.com/sample_df.csv)
df_columns = df.columns
df_feature_names = (df_columns[1:6]).to_list()
df_features = df.iloc[:,2:6].values
df_label_names = (df_columns[26:36]).to_list()
df_labels = df.iloc[:, 26:36].values
#Input
print(df_label_names)

# Split our data
train, test, train_labels, test_labels = train_test_split(df_features,
                                                          df_labels,
                                                          test_size=0.50,
                                                          random_state=42)

print(train.shape)
print(test.shape)

# Initialize our classifier
gnb = GaussianNB()

# Train our classifier
model = gnb.fit(train, train_labels)

# Make predictions
preds = gnb.predict(test)
print(preds)


Topic naive-bayes-classifier machine-learning

Category Data Science


According to doc GaussianNB can handle multiple classes, but its fit function still accepts a one dimensional array. This array is allowed to contain multiple classes like: [0, 1, 2, 3,...]

When I changed your train labels as below fit function worked:

train_labels = np.random.randint(0, 9, 25000)

So you can create an array for labels. If each sample only belongs to one class then you can label this sample with corresponding class, but if there are some samples that have multiple classes at the same time then you should use one of them, or create groups that consist of these classes, and label sample if it belongs to this group or not.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.