I am new to Machine Learning and started solving the Titanic Survivor problem on Kaggle. While solving the problem using Logistic Regression I used various models having polynomial features with degree $2,3,4,5,6$ . Theoretically the accuracy on training set should increase with degree however it started decreasing post degree $2$ . The graph is as per below
My data set contains multiple columns with first name, last name, etc. I want to use a classifier model such as Isolation Forest later. Some word embedding techniques were used for longer text sequences preferably, not for single-word strings as in this case. So I think these techniques wouldn't be the way that will work correctly. Additionally Label encoding or Label binarization may not be suitable ways to work with names, beacause of many different values on the on side …
I have discrete values in the target variable(Exactly 13 different values in total) . When I am giving that as input to Random forest Classifier ,it gives error that input as continuous. And if I give it to regressor it is predicting a value between the discrete values. How can I treat this problem
For purpose of quite big project I am doing a text mining on some documents. My steps are quite common: All to lower case Tokenization Stop list and stop words Lemmatizaton Stemming Some other steps like removing symbols. Then I prepare bag of words, make DTF and classify to 3 classes with SVM and Naive Bayes. But the accuracy I get is not too high (50-60%). I think that may be because in array of words after all the steps …
I am using CART classification technique by dividing a dataset into train and test sets. I have been using Mis-classification error, KS by rank ordering, AUC and Gini as MPMs(model performance measures). The problem I am facing is that the MPM values are quite far apart. Dataset Metadata I have tried with minsplit equal to anywhere from 20 to 1400 and minbucket from 5 to 100 but couldn't get expected results. I have also tried oversampling/undersampling through ROSE package but …
Let's say I have trained a classifier that classifies images of animals into 10 different classes. And let's say that I have 20 different images of a particular animal and because I know the photographer, I know with certainty that all 20 images are of the same animal. So I use my classifier to make a prediction on what animal it is and get 20 predictions one for each image. The model predicts all the images to be a dog …
I am trying to build a classifier for a specific card dataset let's say cards or no cards. I am using Mobilenet trained on the Imagenet dataset as my classifier and further training it on my dataset. I am able to train it, and its performance is quite good on the dataset. Let's say my card has for different regions of interest as shown below:- It is able to perfectly recognize the above-passed image as a card. But I am …
I have the code below outputting the accuracy. How can I output the F1-score instead? clf.fit(data_train,target_train) preds = clf.predict(data_test) # accuracy for the current fold only r2score = clf.score(data_test,target_test)
As part of a group project at university, we are given a series of videos of cell cultures over a 24 hour period. A number of these cells (the "knockout" cells) have had a particular gene removed, which is often absent or mutated in malignancy. We are using a blob detection algorithm to identify the cell centers and radii and further processing to match cells frame-to-frame to build up individual paths, which we then use to calculate various features. We …
I keep trying to run a new set of data through my KNN Classifier but would recieve the message: ValueError: query data dimension must match training data dimension It then used: x_new = pd.read_csv('NewFeaturePractice.csv' , names = attributes) x_new = x_new.values.reshape(52,84) (which is the dimensions of the training data) but would then receive: ValueError: cannot reshape array of size 672 into shape (52,84) The second data set doesn't have the same amount of rows as the first meaning that even …
if I have a dataset of (x,y) and target f, how do I learn a model based on that dataset that allows me to insert value of f and get the optimal conditions (x,y) that correspond to it ? thanks in advance.
I have a dataset in the format: Keywords Disease/Drugs bradycardia, insomnia, hypotension, hearinglos... NSAIDS Poisoning vomiting, nausea, diarrhea, seizure, edema, an... NSAIDS Poisoning pancreatitis, gi, symptoms, restlessness, leuk... Chronic abacavir use (Nucleoside Analog Revers.. ards, apnea, hepatotoxicity, dyspnea, pulmonar... Chronic stavudine and didanosine use (Nucleosi... There are many data but it is in this format. Converted above data into the format, exploded, and created new rows according to , Keywords Disease/Drugs bradycardia NSAIDS Poisoning insomnia NSAIDS Poisoning pancreatitis Chronic stavudine …
Consider a data setup of one-dimensional data ∈ R1, where the hypothesis space H is parametrized by {p,q} where x is classified as 1 iff p < x < q. What will be the VC(H)? Here's my approach: Since 1D data so we can represent the hypothesis space in a number line. We will consider 2 points and try all possibilities and see if they can all be classified correctly. Assume data points are d1 and d2. case1: p < …
So, I'm trying to work with decision trees on Iris dataset. I've noticed by trying out different parameter (max_depth, leaves etc) that some of the classes are easier to predict (most of the trees give the same prediction). How do I justify this, and is there a way to visualize it based on different trees?
I was wondering how image classifier networks perform on images that are not photographs. For example, if you were to feed a drawing of a car or a face to an image classifier that was only trained on photos would the network still be able to classify the image correctly? Furthermore, what if you were to feed more and more abstract drawings into the network. As humans, we are able to recognize objects even in abstract forms (i.e., modern art) …
Betting markets offer betting lines for football matches, where you can bet over or under x offside for a team. For example, for one match they can offer U4.5 offside with odds 2.0/2.0 (lets assume there's no rake). Other matches, where for certain reasons there will be lower likelihood of offside during the match, they could offer U2.5 offside for same odds. Hence, I want to create a model (I have the data) which can give probabilities which can be …
Could anyone point me to a blog or content that talks about creating credit scorecards without logistic regression models? Instead, if we use an ensemble technique, such as random forest, how can we create scorecards? Essentially, how do we create a scorecard model through a classifier which is difficult to interpret like random forest? Thanks in advance.
What I know? Firstly, Precision= $\frac{TP}{TP+FP}$ Recall=$\frac{TP}{TP+FN}$ What book says? A model that declares every record has high recall but low precision. I understand that if predicted positive is high, precision will be low. But how will recall be high if predicted positive is high. A model that assigns a positive class to every test record that matches one of the positive records in the training set has very high precision but low recall. I am not able to properly …
I'm working with several estimators of all kind. Then, I want to stack these estimators, and the best is if they have low correlation between them. I suppose that the correlation method depends on the type of dependent variable, if it's categorical or numerical. In my case, it's categorical, and the estimators are classifiers. How can I do the correlation between two estimators?