Can metadata be used to adapt parsing for an unescaped in field use of the delimiter?

I have data coming from a source system that is pipe delimited. Pipe was selected over comma since it was believed no pipes appeared in field, while it was known that commas do occur. After ingesting this data into Hive however it has been discovered that rarely a field does in fact contain a pipe character.

Due to a constraint we are unable to regenerate from source to escape the delimiter or change delimiters in the usual way. However we have the metadata used to create the Hive table. Could we use knowledge of the fields around the problem field to reprocess the file on our side to escape it or to change the file delimiter prior to reloading the data into Hive?

Topic parsing metadata

Category Data Science


So, a few of your rows will have too many columns by one or more as a result. That's easy to detect, but harder to infer where the error was -- which two columns are actually one? which delimiter is not a delimiter?

In some cases, you can use the metadata, because it helps you know when an interpretation of the columns can't be right. For example, if just the one column can have a text value, and all the others must be numeric, it's unambiguous where the error is. Any additional columns created by this error occur right after the text column.

If they're all text, this doesn't work of course.

You might be able to leverage more than the metadata's column type. For example you may know that some fields are from an enumerated set of values, and use that to determine when a column assignment is wrong.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.