Data wrangling for a big set of docx files advice!

I'm looking for some advice on a data wrangling problem I'm trying to solve. I've spent a week solid taking different approaches and nothing seems to be quite perfect. Just FYI, this is my first big (for me anyway) data science project, so I'm really in need of some wisdom on the best way to approach it.

Essentially I have a set (200+) of docx files that are semi-structured. By semi-structured I mean the information I want is organized into tables (it's a form, with different tables which contain different info to fill out), but unfortunately these tables are not consistently formatted. Sometimes after people enter data into them they accidentally hit backspace to stick the tables together. Or sometimes they accidentally break the tables apart, for example.

My first attempt used python-docx to extract the data using document.tables[0] etc. I could then pull this into a big python dictionary for each document. It was quite neat but hit a snag - the table formatting problem above.

I then used python-docx again and tried to use the headings of each table as a marker (picked them out using regex) for when a sub data set should begin or end, and iterated across all text in the document. This sort of works, and is more flexible, but picks up a lot of text from outside of the table which makes it difficult to manage and clean.

Anyway, I'm interested in how an experienced data scientist would approach the problem.

The end goal is to extract the data from one of these documents into an SQL database.

If you're interested in the problem, let me know and I can send you the template documents I'm working with and some samples. If it's helpful, I can also post the code I've written so far (haven't done so because it's long).

Topic similar-documents data-wrangling python

Category Data Science


This question has been sitting here for quite a while but I will give answer as I faced a very similar problem some weeks ago.

I had multiple different files (.docx, .pdf) sitting in different folders. To make matters more complicated the .docx files were falling into three groups: SS.docx, ss.docx and plain .docx. My information was in the plain .docx in one table in each document.

I went the hard way and wrote a Python extraction algorithm which I will walk you through now:

import os
import pandas as pd
import docx
from docx.api import Document

#fixed variable
location = 'C:\\Data_analysis\\N_TRACKING'

I then added a function which iterates through the complete file tree starting from location and stored all the files names and their paths in the list targets_in_dir.

#the function 

def find_docx(location):
    all_files = []    # Will contain found files

    # Walk top-down through directory tree
    for (root, dirs, files) in os.walk(location, topdown=True):
   
        for f in files:
            f = f.lower()              # Make conditional case insensitive
            if f.endswith('.pdf'):
                continue               # Skipping pdf
            if f.endswith('ss.docx'):
                continue              # Skipping ss.docx or SS.docx
            if f.endswith('.docx'):
                # Desired docx (undesired docx has been filtered out by previous conditional)
                all_files.append(os.path.join(root, f))
                
    return all_files

the endswith function was very helpful here as it compares the end of string - here my list elements - with the string in the brackets. Essentially, I skip through everything which I do not want and only add what I want to all_files which is returned by the function. I apply the function to the location and get my results stored in targets_in_dir.

#main
targets_in_dir = find_docx(location) 

Then I feed it in the extraction loop

error_log =[]
empty =[]
processed_files =[]


final_df= pd.DataFrame(empty)
for idx, c in enumerate(targets_in_dir):
    try:
        document = Document(c)
        table = document.tables[6]
        processed_files.append(c)
    except: 
        error_log.append(c)
        
    data = []
                
    keys = None
    for i, row in enumerate(table.rows):
        text = (cell.text for cell in row.cells)

        if i == 0:
            keys = tuple(text)
            continue
        row_data = dict(zip(keys, text))
        data.append(row_data)
                 
    df = pd.DataFrame(data)
    df['report'] = str(c)
    final_df = final_df.append(df)

I use the python-docx library here which directly allows me to select table as an object. Strangely I had to pip install python-docx and was then able to import it only as docx Iterating through that many files often fails due to some kind of document error. That's what the try: except: loop is for

errors are stored in error_log. The files the algorithm went through in processed_files. final_df is an empty dataframe to which I will append my results.

Please note I specify the table I want (number 6) from all these documents in this line:

 table = document.tables[6]

the code then iterates through the rows of the table using the first row as keys for the dictionary and mapping the content on it. Then I add tracking information (always know where thy data comes from!) through the line

 df['report'] = str(c)

and dataframe the result and adding it to my empty df from the beginning. And then you can start cleaning the data as humans tend to be messy!


The big problem with docx files is that they have a ton of content related to formatting that most people find irreverent when scraping docx files. Hence, one approach is to convert the docx files to something more friendly like json that excludes all the irrelevant bits, and then use existing frameworks like JSON schema for pattern matching and jmespath for data extraction. There is a python package called simplify-docx for converting Docx files to JSON and a toy website where you can try it out. (Disclaimer: I'm the author of both)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.