Is there a corpus of toy datasets specifically designed for finding bugs in data science software?

I'm looking for a corpus of toy tabular datasets that can be used to test data profiling, machine learning, data manipulation, etc. software. Some example attributes:

  • Strange column names (empty string, long names, duplicate names, names with spaces, periods, syntax, escaped delimiters and tokens)
  • Non-rectangular
  • Mixed scientific notation in floats, inf literals
  • Row-empty or column-empty
  • Mixed file encodings
  • Numeric and string values designed to overflow memory buffers/cause truncation/rounding to int
  • Ambiguous and invalid dates
  • Diacritics, emojis

I was going to build a corpus myself, but surely there is some prior work here?

Topic software-development csv

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.