Does knn extend the train dataset by test values during the prediction?

Lets say I have 100 values in my dataset and split it 80% train 20% test. When predicting the last value, is the prediction based on previous 99 (80 test + 19 already predicted values) or only the original 80 train values?

For example: if kd-tree is used, is every data point inserted into the tree during the prediction?

Is it possible to use knn for the following scenario? I have 20 train values, when I add new observation I classify it and add it to the train dataset so there are 21 values, next time I add new value I classify it based on the 21 values in the dataset. I understand that this is probably not how it should be done, but imagine I am adding up to 50k values so the last one is classified based on the previous 49 999 values.

Another simplified example I came up with. n=2: on pictures 1,2,3 we see points as they were trained and one new green point which will be classified. then we take new observation, are the distances calculated to points as in 4a or as in 4b. link to visualization

Imagine its python sklearn module doing the classification. Up until the picture 1 we called .fit(X_train, y_train), where test dataset had 4 points. Then we called .predict(X_test) which had 2 points.

Topic k-nn supervised-learning classification machine-learning

Category Data Science


It depends which scenario you chose.

When you train any data science model, it won't move anymore. For example, if you train K-Means, you'll get at a result the cendroid of every cluster. If you train a random forest, you'll have as a result your trees.

Then, when you apply your model, it gives you an answer according to that. The answer will always be the same, if your entry is the same.

So if you trained your model in your 80 samples, after testing it on your 20 remaining, the model remains the same, trained on the 80 samples, and will give the exact same answer if you test it again on the same 20 samples.

However, it's possible to re-train your model : You make all your tests with your 80-20, and once you found the good parameters, you retrain a new model on your 100 samples, so it'll be more precise if it has to classify new samples.

What you want to do is avoid retraining your model, but include the result of new samples each time you encounter one : that's hard, not all models can do it, and that's not for a starter level at all. It also raises a lot of questions (I won't go into too much details)

What I'd suggest you is to manually set a threshold to re-train your model. Example : don't re-train your model until you have 1000 new samples to insert, or only re-train your model each month at a defined date, with the data you know at this time.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.