Is parsing files an application of machine learning?

I presently receive files from a device in a semi-csv format. I have a written a simple recursive descent parser for getting information out of these files. Every time the device updates firmware, I have a new version of the parser for the changes the update brings.

Down the road, we will be taking data from other devices, which means another parser and more updates to firmware. I'm wondering if I could define a basic structure of "this is the data I need" and use a neural network to get the parsed data without having to write a parser for each new file type that comes in.

Is this a pipe dream or is it a valid application of machine learning? I'm much more of a software engineer than I am a data scientist, but I'm starting to dip my toes into the machine learning realm.

Thanks in advance.

Topic parsing machine-learning

Category Data Science


Most programming languages and markup languages have a relatively simple syntax, so it is not usually necessary to use machine learning techniques to parse these languages. In machine learning, the process of automatically learning a formal grammar is also known as grammar induction. Several adaptive parsers have been designed for this purpose.


I had similar problem for my business: parsing tons of mails with different formatting and some of them slightly changing over the time. I did not solve it, but I use the service from http://www.scriptminer.com/ which is apparently using machine learning tools to parse these sort of semi-structured messages.


I would answer the question at two levels. The first level is "can it be done using machine learning?" I would say that machine learning is essentially about learning. So given that you prepare sufficient examples of sample documents and the output to expect from those documents, you can train a network to learn the structure of documents and extract the relevant information. The more general form of extracting information from documents is a well-researched problem and is more commonly known as Information Retrieval. And it is not limited to just machine learning techniques, you can use Natural Language Processing tools as well. So, in its general form, it is actually being done in practice. Coming to the second level, "should you be doing it using machine learning?". I would agree to what @NeilSlater said. The better and more feasible approach would be to use good programming practices so that you can reuse parts of your parser as your dataset evolves.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.