Different approaches of creating the test set

I came across different approaches to creating a test set. Theoretically, it's quite simple, just pick some instances randomly, typically 20% of the dataset and set them aside. Below are the approaches

The naive way of creating the test set is

def split_train_test(data,test_set_ratio):
  #create indices
  shuffled_indices = np.random.permutation(len(data))
  test_set_size = int(len(data) * test_set_ratio)
  test_set_indices = shuffled_indices[:test_set_size]
  train_set_indices = shuffled_indices[test_set_size:]
  return data.iloc[train_set_indices],data.iloc[test_set_indices]

The above splitting mechanism works, but if the program is run, again and again, it will generate a different dataset. Over the time, the machine learning algorithm will get to see all the examples. The solutions to fix the above problem was (guided by the author of the book)

  1. Save the test set on the first run and then load it in subsequent runs
  2. To set the random number generator's seed(np.random.seed(42)) before calling np.random.permutation() so that it always generates the same shuffled indices

But both the above solutions break when we fetch the next updated dataset. I am not still clear with this statement.

Can someone give me an intuition behind how do the above two solutions breaks when we fetch the next updated dataset ?.

Then the author came up with another reliable approach to create the test.

 def split_train_test_by_id(data, test_ratio, id_column):
   ids = data[id_column]
   in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
   return data.loc[~in_test_set], data.loc[in_test_set]

Approach #1

 def test_set_check(identifier, test_ratio, hash=hashlib.md5):
    return bytearray(hash(np.int64(identifier)).digest())[-1]  256 * test_ratio

Approach #2

 def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier))  0xffffffff  test_ratio * 2**32

Approaches #1,#2, why are we making use of crc32, 0xffffffff, byte array?.

Just out of curiosity, I passed different values for identifier variable into hash_function(np.int64(identifier)).digest() and I got different results.

Is there any intuition behind these results ?.

Topic numpy preprocessing python machine-learning

Category Data Science


It gets a little complicated, I've attached links at the end of the answer to explain as well.

def test_set_check(identifier, test_ratio, hash=hashlib.md5):
    return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio

The hash(np.int64(identifier)).digest() part returns a hash value of the identifier (which we cast as an int of 8 bytes with np.int64()).

The bytearray method stores the hash value into an array. The [-1] represents getting 1 byte/the last byte from the array/hash value. This byte will be a # between 0 and 255 (1 byte=11111111, or 255 in decimal).

Assuming our test_ratio is .20, the byte value should be less than or equal to 51 (256*.20=51.2). If the byte is less than or equal to 51, the row/instance will be added to the test set.

Rest is explained well in these links:

https://stackoverflow.com/questions/50646890/how-does-the-crc32-function-work-when-using-sampling-data

https://github.com/ageron/handson-ml/issues/71

https://docs.python.org/3/library/hashlib.html

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.