After a few weeks of looking at what the data consists of, I wrote some modules in python to automate most of the work. These modules are using pyexcel and openpyxl (which has a faster read function). I was able to write some functions to match specific templates within a percentage of error using a collection of keywords. Below are the steps I took breaking the data set down.
1) Remove all non excel data from the file systems.
File Count : 100,300 -> 30,000
2) Remove all files containing specific keywords in the filename I did not need.
File Count : 30,000 -> 23,000
3) Remove all files that do not match a specific pattern in the template. This is the most expensive operation as each file needed to be opened (via python) and checked for specific row and column values. I also had an issue with Windows 10 because the file paths exceeded 260 characters in most cases. To resolve this conflict, I put everything on an external drive and transferred the files to a Linux VM.
File Count : 23,000 -> ~1000
4) At this point I can extract data using a similar method to the one above, there are multiple excel templates so step 3 will need to be repeated again.
File Count : 1000 -> ?
*This worked for me and I abstracted the logic so it isn't tied to my data set. I will post the code on GitHub (repo: Anacell) once the project is done and I removed all proprietary information.
**I also wanted to mention that there were a few inaccuracies in my initial question. As the years progressed data became more structured and formatted with templates. While there are multiple undocumented templates it was easier to take advantage of them (once found) than if nothing was structured.