Can metadata be used to adapt parsing for an unescaped in field use of the delimiter?
I have data coming from a source system that is pipe delimited. Pipe was selected over comma since it was believed no pipes appeared in field, while it was known that commas do occur. After ingesting this data into Hive however it has been discovered that rarely a field does in fact contain a pipe character.
Due to a constraint we are unable to regenerate from source to escape the delimiter or change delimiters in the usual way. However we have the metadata used to create the Hive table. Could we use knowledge of the fields around the problem field to reprocess the file on our side to escape it or to change the file delimiter prior to reloading the data into Hive?
Category Data Science