ML modeling a data with big amount of rows

I want to do ML modeling such XGboost, KNN, and similar models on data with 9 numerical features and more than 25 million rows and the size of data is almost 2.5 Gig and I prefer to use all the data for modeling and don't want to use samples of data and stuff. which platforms such as Databricks or AWS or GCP do you suggest to do this project? Do you think is it doable on a single machine?

Topic pyspark bigdata

Category Data Science


Regarding cloud services for 2.5 Gig - all have instances up to 64G so you are fine to go with either of AWS / GCP or others - that is not your problem.

What you need to do is to make sure you are loading your data into python efficiently. what are these 9 numerical features ? integers ? binary float ? how many digit of precision do you need ? If values are int and less than 255 consider using int8 if they are small and the precision is not that important use float16. If you have characters and categorical, make sure to have a function to convert it into 'numeric' (less characters).

Regarding boosting model - what is the histogram of each column looks like ? if you don't have much variation you won't expect many bins - so you can have max_bin to 10. Do you expect large trees ? start with a small tree - for example set the depth to 5.

And yes, 2.5 gig is not that big - I am sure you can do it on a pc :-).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.