I rencently had to work on a problem where the best baseline was knn (geolocalised data). I have different targets (binary classification, multiclass classification and regression) and associated metrics, so I use inddiferently knn for classification or regression. This Baseline was easy to implement in Python (sklearn). I was wondering how to improve the baseline. I tried tuning the knn hyperparameters. Optimising k worked a bit, modifying distances didn't work (natural L2 distance worked best by far). Others models gave …
I have data set with date features like 01/01/2019 and I would like to use KNN. However, I cannot find a good transformation for dates that has a meaningful distance result for the last feature. For example: f1 | 1 | 2 | 3 | 4 | 01/01/2019 f2 | 10 | 3 | 12 | 1 | 14/01/2019 Does anyone have any recommendations?
I'm working on a machine learning problem involving inventory (i.e. physical retail stock), however through the cleaning (outlier removal) process some of the items (via their corresponding transactions) will be removed. Therefore, I thought of using KNN to group similar items into respective categories. There are 1245 items The info for each item is Average Weighted Price Total Quantity Sold Total Revenue Achieved Min Sold per Transaction Max Sold per Transaction Min Sell Price Max Sell Price Number of Unique …
I have two separate files for Testing and Training. In the training data, I am dropping rows that contain too many missing values . But , In the test data , I cannot afford to drop the rows so I have chosen to impute the missing values using KNN approach . My question is , to impute missing values in the test data using KNN , is it enough to consider only the test data ? As in , neighbors …
I have a study where i want to find users similar to a set of users (SEED). My data looks like a pivot by customer e.g. sample of SEED looks like (note i drop cust_id): cust_id | spend_food | spend_nike | spend_harrods | 1 | 145 | 45 | 32 | 2 | 85 | 89 | 0 | 4 | 23 | 67 | 1900 | 5 | 84 | 12 | 900 | So to find users similar …
Here is a hypothetical simplified dataframe of my problem, which would be low dimensional (20ish features), containing some made-up information about certain dog breeds: Breed Min_Weight Max_Weight Min_Height Max_Height is_friendly grp Husky 10 20 30 35 True working Poodle 8 17 15 30 False terrier The algorithm would receive some information about a dog, and it would need to identify k-closest dog breeds based on the input data. It needs to be high performance. Example: algorithm receives an unknown breed …
I am doing hyperparameter tuning + cross validation and I'm constantly getting that the optimal size of the leaf should be 1. Should I worry? Is this a sign of overfitting?
I am trying to improve my KNN regression process (I would like to use sklearn / python, but it doesn't matter).I would like to improve my results and to gain insight. Here is an example: I have data measured from an electric motor: an input voltage (U) and current (I) and an output torque (T) and speed (S). First intend is a simple approach where I'm giving those data in the state to a KNN algorithm and I use the …
Lets say I have 100 values in my dataset and split it 80% train 20% test. When predicting the last value, is the prediction based on previous 99 (80 test + 19 already predicted values) or only the original 80 train values? For example: if kd-tree is used, is every data point inserted into the tree during the prediction? Is it possible to use knn for the following scenario? I have 20 train values, when I add new observation I …
I have a question, related to parallel work on python How I can use Processers =1,2,3... on k nearest neighbor algorithm when K=1, 2, 3,.. to find the change in time spent, speedup, and efficiency. What is the appropriate code for that?
I'm trying to build an item-based recommender using k-nn. I have a list of items, all of which have some properties (features) in common. item var_1 var_2 var_3 var_4 var_5 item_1 0.171547232 a 0.908855471 0.292061808 0.285678293 item_2 0.131694336 b 0.432665234 0.501300418 0.756824175 item_3 0.144318764 b 0.238752071 0.487600679 0.203133779 item_4 0.249241125 b 0.921229689 0.003638622 0.606875991 item_5 0.414306046 b 0.190824352 0.937412611 0.1789091 item_6 0.909501131 c 0.847112499 0.548322302 0.060136059 item_7 0.37469644 c 0.282628025 0.211128351 0.125910578 item_8 0.308634676 d 0.174650423 0.705026302 0.440098246 item_9 0.039294192 …
I have a data set with 6 variables that I'm trying to run the sknn function on and then output a table of the results to show k-NN results. I have updated the response variable to a factor to use as row and column headers in the table, and checked the data types of all other variables to make sure they are compatible (int and num). For some reason, no matter what I try, R freezes trying to pull the …
The KNN algorithm is very handy and particularly suited to some of my problems, but I can't find any resources on how to implement it in production. As a comparative example, when I use a neural network, I already have at my disposal high-level tools allowing me to apply the neural network to examples (either library allowing me to smartly exploit the hardware of my devices when I want to do embedded, or infrastructures allowing me to use my neural …
I am analyzing a database and I want to perform a KNN. I am using the 'tidymodels' library and when I run the model, I get the following error: All models failed. See the `.notes` column. # Tuning results # 10-fold cross-validation repeated 5 times There were issues with some computations: - Error(s) x1000: Error in check_outcome(): ! For a classification model, the outcome should be a factor. Use collect_notes(object) for more information. The bbdd is composed of the following …
I need to save the results of a fit of the SKlearn NearestNeighbors model: knn = NearestNeighbors(10) knn.fit(my_data) How do you save to disk the traied knn using Python?
I am using a book and a video to learn how to use KNN method to classify movies according to their genres.This is my code: import numpy as np import pandas as pd r_cols = ['user_id', 'movie_id', 'rating'] ratings = pd.read_csv('C:/Users/dell/Downloads/DataScience/DataScience-Python3/ml-100k/u.data', sep='\t', engine='python', names=r_cols, usecols=range(3)) # The file is u.data from MovieLens print(ratings.head()) movieProperties = ratings.groupby('movie_id').agg({'rating': [np.size, np.mean]}) print(movieProperties.head()) movieNumRatings = pd.DataFrame(movieProperties['rating']['size']) movieNormalizedNumRatings = movieNumRatings.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x))) print(movieNormalizedNumRatings.head()) movieDict = {} with open('C:/Users/dell/Downloads/DataScience/DataScience-Python3/ml-100k/u.item') as …
I have a large dataset for the activities performed by multiple staff in a factory over a long period of time - 01/01/2017 - present. The activities performed by the different staff are recorded at each point in time (since they interact with software). I have tabulated these to record the number of activities performed by each operator for each day. My table looks something like this: Date Name Activity UnitsProcessed Shift Team 01/10/2017 MMouse Soldering 1000 Shift A Team …
I am trying to develop a basic book recommender system to get in touch with the field and start learning methods and how to prepare the data. The Dataframe I am using is pretty plain, it has the following structure (this is a simplified example): number type username product publishing_dt price genres 0 34 access kerrigan 130365 2019-12-10 16.99 fantasy, kids 1 1 order kerrigan 76863 2020-01-15 4.66 action, crime 2 1 order 45michael 76863 2020-01-15 4.66 action, crime 3 …
This is my example of KNN model (I write it using R): library(gmodels) library(caret) library(class) db_class <- iris row_train <- sample(nrow(db_class),nrow(db_class)*0.8) db_train_x <- db_class[row_train,-ncol(db_class)] db_train_y <- db_class[row_train,ncol(db_class)] db_test_x <- db_class[-row_train,-ncol(db_class)] db_test_y <- db_class[-row_train,ncol(db_class)] model_knn <- knn(db_train_x,db_test_x,db_train_y,12) summary(model_knn) CrossTable(x=db_test_y,y=model_knn,prop.chisq = FALSE) confusionMatrix(data=factor(model_knn),reference=factor(db_test_y)) So, this is a supervised KNN models. How can I classify a new registration? I have this new registration: new_record <- c(5.3,3.2,2.0,0.2) How can I classify it using the previous model?