Data wrangling for a big set of docx files advice!
I'm looking for some advice on a data wrangling problem I'm trying to solve. I've spent a week solid taking different approaches and nothing seems to be quite perfect. Just FYI, this is my first big (for me anyway) data science project, so I'm really in need of some wisdom on the best way to approach it.
Essentially I have a set (200+) of docx files that are semi-structured. By semi-structured I mean the information I want is organized into tables (it's a form, with different tables which contain different info to fill out), but unfortunately these tables are not consistently formatted. Sometimes after people enter data into them they accidentally hit backspace to stick the tables together. Or sometimes they accidentally break the tables apart, for example.
My first attempt used python-docx to extract the data using document.tables[0] etc. I could then pull this into a big python dictionary for each document. It was quite neat but hit a snag - the table formatting problem above.
I then used python-docx again and tried to use the headings of each table as a marker (picked them out using regex) for when a sub data set should begin or end, and iterated across all text in the document. This sort of works, and is more flexible, but picks up a lot of text from outside of the table which makes it difficult to manage and clean.
Anyway, I'm interested in how an experienced data scientist would approach the problem.
The end goal is to extract the data from one of these documents into an SQL database.
If you're interested in the problem, let me know and I can send you the template documents I'm working with and some samples. If it's helpful, I can also post the code I've written so far (haven't done so because it's long).
Topic similar-documents data-wrangling python
Category Data Science