Is there a corpus of toy datasets specifically designed for finding bugs in data science software?
I'm looking for a corpus of toy tabular datasets that can be used to test data profiling, machine learning, data manipulation, etc. software. Some example attributes:
- Strange column names (empty string, long names, duplicate names, names with spaces, periods, syntax, escaped delimiters and tokens)
- Non-rectangular
- Mixed scientific notation in floats, inf literals
- Row-empty or column-empty
- Mixed file encodings
- Numeric and string values designed to overflow memory buffers/cause truncation/rounding to int
- Ambiguous and invalid dates
- Diacritics, emojis
I was going to build a corpus myself, but surely there is some prior work here?
Topic software-development csv
Category Data Science