Is there a corpus of toy datasets specifically designed for finding bugs in data science software?

Question

Is there a corpus of toy datasets specifically designed for finding bugs in data science software?

Shoeboxam

2021年5月25日 00:16

I'm looking for a corpus of toy tabular datasets that can be used to test data profiling, machine learning, data manipulation, etc. software. Some example attributes:

Strange column names (empty string, long names, duplicate names, names with spaces, periods, syntax, escaped delimiters and tokens)
Non-rectangular
Mixed scientific notation in floats, inf literals
Row-empty or column-empty
Mixed file encodings
Numeric and string values designed to overflow memory buffers/cause truncation/rounding to int
Ambiguous and invalid dates
Diacritics, emojis

I was going to build a corpus myself, but surely there is some prior work here?

Topic software-development csv

Category Data Science

Is there a corpus of toy datasets specifically designed for finding bugs in data science software?

About