How to partition data effectively?

Question

How to partition data effectively?

CyberPunk

2022年3月31日 20:23

I have a pipeline which outputs model scores to s3. I need to partition the data by model_type and date. Which is the most efficient way to partition the data from the below:

s3://bucket/data/model_type=foo/dt=YYYY-MM-DD/a.csv
s3://bucket/data/dt=YYYY-MM-DD/model_type=foo/a.csv

Topic data-engineering

Category Data Science

Ashwiniku918 · Accepted Answer · 2022年3月31日 20:23

While doing partitioning of data we need to understand that too many partitions are not a good practice and partition only model type may lead too too few. So before you decide on best partitioning think about following:

How will you access the data.. Which access patterns are more frequent. If you use model and then dates then probably go for first.
If its date go for second one.
Also instead of creating a partition for each day you may create biweekly partitions which will give you best of both worlds

How to partition data effectively?

About