ML modeling a data with big amount of rows

Question

ML modeling a data with big amount of rows

Amin Zadenoori

2022年3月5日 12:03

I want to do ML modeling such XGboost, KNN, and similar models on data with 9 numerical features and more than 25 million rows and the size of data is almost 2.5 Gig and I prefer to use all the data for modeling and don't want to use samples of data and stuff. which platforms such as Databricks or AWS or GCP do you suggest to do this project? Do you think is it doable on a single machine?

Topic pyspark bigdata

Category Data Science

user702846 · Accepted Answer · 2021年5月31日 12:45

Regarding cloud services for 2.5 Gig - all have instances up to 64G so you are fine to go with either of AWS / GCP or others - that is not your problem.

What you need to do is to make sure you are loading your data into python efficiently. what are these 9 numerical features ? integers ? binary float ? how many digit of precision do you need ? If values are int and less than 255 consider using int8 if they are small and the precision is not that important use float16. If you have characters and categorical, make sure to have a function to convert it into 'numeric' (less characters).

Regarding boosting model - what is the histogram of each column looks like ? if you don't have much variation you won't expect many bins - so you can have max_bin to 10. Do you expect large trees ? start with a small tree - for example set the depth to 5.

And yes, 2.5 gig is not that big - I am sure you can do it on a pc :-).

ML modeling a data with big amount of rows

About