Splitting train/test sets by an identifier?

I know sklearn has train_test_split() to split a train and test set. But I read that, even with setting a random seed, if your actual dataset is updated regularly, the random seed will reset with each updated dataset and take a different train/test split. Doing this, your ML algos will eventually cover the whole dataset, defeating the purpose of the train/test split because it'll eventually train on too much of the whole dataset over time.

The book I'm reading (Hands-On Machine Learning with Scikit-Learn and Tensorflow) gives this code to split train/test by id:

# Function to check test set's identifier.
def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier))  0xffffffff  test_ratio * 2**32

# Function to split train/test
def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]

And it says when there's no ID column given, to create one either by indexing the rows or creating a unique index from one of the variables.

My questions are:

  1. What is the 3rd line doing:

    crc32(np.int64(identifier)) 0xffffffff test_ratio * 2**32

  2. What is the anonymous function doing in the 2nd to last line?

    lambda id_: test_set_check(id_, test_ratio)

  3. In practice, do you commonly split datasets by id in this manner?



Topic randomized-algorithms dataset python data-cleaning machine-learning

Category Data Science

If the data has an index column and you insert or delete a row, you change all the data following this row (as you change their indices). Consequently, their unique identifiers computed by train_test_split change, and the split will not be consistent after data updates. One way to overcome this problem is to append new data to the old and never delete a record (as mentioned in the book Hands-On Machine Learning with Scikit...). Don't forget, we are assuming the random number generator is seeded.

crc32(np.int64(identifier)) = create a hash from a given value

crc32(np.int64(identifier)) & 0xffffffff = make sure the hash value does not exceed 2^32 (or 4294967296). Check the following simplified example for better understanding.

18 & 0xf = ?

18 in binary 0b10010

0xf in binary 0b1111 to make it of the same length as the previous number, we append a 0 to the left side. So 0xf in binary 0b01111.

Now, we can do the bitwise and operation 0b10010 & 0b01111 = 0b00010 = 3. So & 0xf ensures that the output is never above 15. Now, reflect on the original code to understand it properly.

crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32. This line returns True or False. Let test_ratio be 0.2. Then, any hash value less than 0.2 * 4294967296 returns True and will be added to the test set; otherwise, it returns False and will be added to the training set.

@n1k31t4 already provided great answer albeit I still have questions after reading. I work out the rest and share here.

For the first question

What is the 3rd line doing: crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

We first ignore the part & 0xffffffff because from the doc of the package https://docs.python.org/3/library/zlib.html

Changed in version 3.0: Always returns an unsigned value. To generate the same numeric value across all Python versions and platforms, use adler32(data) & 0xffffffff.

so if you are using the 3.0 version you can ignore this part and crc32(np.int64(identifier)) < test_ratio * 2**32 still works.

To understand why the above works, it just takes the fact that crc32 is evenly distributed across unsigned int32 https://stackoverflow.com/questions/38315172/distribution-of-crc-checksums and hence if the sample size is large enough, then test_ratio * sample_size amount of sample points would be smaller than test_ratio * $2^{32}$.

And if you are still curious what & 0xffffffff is doing, it is mapping signed int32 to unsigned int32 (negative mapped to positive, while non-negative unchanged). 0xffffffff is a hexadecimal representation of 0b11111111111111111111111111111111 (thirty two of ones). So if you do a bitwise & with it, you will get the following:

>>> print(-1 & 0xFFFFFFFF)
>>> print(-1 & 0b11111111111111111111111111111111)

To understand the bitwise operation, you will need to look at python's implementation of two's complement.


In Auriel Geron's book, there is a short description of the approach:

you could compute a hash of each instance’s identifier, keep only the last byte of the hash, and put the instance in the test set if this value is lower or equal to 51 (~20% of 256). This ensures that the test set will remain consistent across multiple runs, even if you refresh the dataset. The new test set will contain 20% of the new instances, but it will not contain any instance that was previously in the training set.

While a full explanation of what exactly happens and why is probably best placed on StackOverflow, I can try to answer your questions, first with some background info.

The method uses a cyclic redundancy check, which is a method of checking that the raw blocks of memory have not been damaged/changed. It is a way to ensure data integrity, e.g. in network traffic - checking if a message way altered between being sent and received.

For train/test splits, it is checking the unique identifier of each sample. We have a column that gives each sample an ID - this should never be changed! Don't delete rows, only append to the end with new unique IDs.

In this part: test_ratio * 2**32, the part $2^{32}$ represents the largest integer of a 32-bit system.

0xFFFFFFFF is a large number; it's the hexadecimal representation of $2^{32}-1$

To answer your questions:

  1. What is the 3rd line doing:

crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

Based on the information I gave above, we see the crc32 function finds the checksum value in memory (of the unique identifier). If we know the unique ID has never changed, then We ensure that crc32(np.int64(identifier)) & 0xffffffff will always return exactly the same numeric value, across all Python versions and platforms.

Imagine we give IDs in the range 0-80 for train, and 81-100 for test. No we want to make sure a sample'd s ID falls in the first bucket. We check its ID is simple less that 81, right? Well the numeric value we made above is checked to be less than our test_ratio * 2**32, where 2**32 is the largest 32-bit number. It checks that the sample's ID is within the range of train data, not in the test bucket:: > test_ratio * 2**32.

  1. What is the anonymous function doing in the 2nd to last line?

lambda id_: test_set_check(id_, test_ratio)

This simply applies our test_set_check function to each sample's unique identifiers. Using the apply methd on a Pandas Series object (here it is one column of a Pandas DataFrame).

  1. In practice, do you commonly split datasets by ID in this manner?

Not really... Scikit-Learn's train_test_split is often good enough. I think there are many other ways to remove bias and errors from your models before worrying too much about the impact of random splits.

For example, the snoop bias, whereby you analyse the entire dataset yourself before deciding on a model architecture/pipeline, thereby incorporating knowledge of the entuire distribution, which is inherently biasing our model.

There is also bias in overfitting e.g. in sequential imaging data (think frames of videos) such that the background is consistent, even though the objects you might want to detect are not. Your model will learn what to expect based on the background, which is not robust! Here you might look into using a geographical split (not random at all).

On a side note, there is also a slightly robuster way of setting random seeds in Python (instead of using NumPy's random seed generator). Have a look here for some differences.

Helpful resources:
  1. https://stackoverflow.com/questions/36819849/detect-int32-overflow-using-0xffffffff-masking-in-python
  2. https://pynative.com/python-random-module/
  3. https://stackoverflow.com/questions/30092226/how-to-calculate-crc32-with-python-to-match-online-results
  4. https://stackoverflow.com/questions/49331030/bitwise-xor-0xffffffff/49332291#49332291


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.