30 Years Of Excel Test Data

I am a CS intern at an industrial company that has 30 years of excel files that need to be analyzed. Looking at the data, only a fraction of the files need to be looked at and used. After those files are identified, I need to pull out values from specific columns. The real issue is that there is no standard excel format for the tests and each column name can be different (ex. 'Front Axial Temperature' vs 'axial temp front') but contains the same type of data.

So, the data is semi-structured, are there any python/java libraries which are good at pulling such data out? I could create some regex patterns but as you can see the above names can be drastically different (while still meaning the same thing).

I would also be interested in looking at services that do this already.

Topic data excel java python

Category Data Science


After a few weeks of looking at what the data consists of, I wrote some modules in python to automate most of the work. These modules are using pyexcel and openpyxl (which has a faster read function). I was able to write some functions to match specific templates within a percentage of error using a collection of keywords. Below are the steps I took breaking the data set down.

1) Remove all non excel data from the file systems.

File Count : 100,300 -> 30,000

2) Remove all files containing specific keywords in the filename I did not need.

File Count : 30,000 -> 23,000

3) Remove all files that do not match a specific pattern in the template. This is the most expensive operation as each file needed to be opened (via python) and checked for specific row and column values. I also had an issue with Windows 10 because the file paths exceeded 260 characters in most cases. To resolve this conflict, I put everything on an external drive and transferred the files to a Linux VM.

File Count : 23,000 -> ~1000

4) At this point I can extract data using a similar method to the one above, there are multiple excel templates so step 3 will need to be repeated again.

File Count : 1000 -> ?

*This worked for me and I abstracted the logic so it isn't tied to my data set. I will post the code on GitHub (repo: Anacell) once the project is done and I removed all proprietary information.

**I also wanted to mention that there were a few inaccuracies in my initial question. As the years progressed data became more structured and formatted with templates. While there are multiple undocumented templates it was easier to take advantage of them (once found) than if nothing was structured.


I see three main ways of comparing columns: automated comparisons of column names (e.g. regex, Levenshtein distance), comparing content (e.g. compare mean and standard deviation of the data for the column; if the mean value of a column is 10,000 then it probably isn't recording Front Axial Temperature), and manual comparison. You can combine these, for instance clustering on column names and content, then manually looking at the contents of each cluster. The smaller the number of different column names, the more you can rely on manual examination. You may also be able to get other metadata sources, such as looking for documentation for whatever process generated the files.


How many useful Excel files are we talking about? More importantly, how many potential variants for the columns names?

Because if it's less than say a thousand, you're probably better off manually curating the columns names: it's going to take less time and provide you with more accurate data than implementing and testing a sophisticated string matching system. You can't rely on the automatic matching, so you would need to evaluate how correct the result is. Using any predefined string similarity method will leave you with many mistakes.

I would proceed in the following way:

  1. Automatically extract all the column names from all the files
  2. Manually group the ones which represent the same information
  3. Automatically replace the variants with a standardized version of the name

To me, the fastest way to analyze tabular data (rows and columns) from the same or different Excel files is Tableau.

You can get Tableau for free (but can't save your files).

Simplified: load your Excel files and create a key to join (e.g. a new column called 'axel temp font'). Join on this key. Once you've done this successfully you can bring the columns and rows (called measures in Tableau) into your view to analyze.

Bonus trick: once you're happy with your view (columns and rows) you can download it as a CSV. Now you have exactly the data you want in CSV.

https://onlinehelp.tableau.com/current/pro/desktop/en-us/joining_tables.htm

Good luck,


It is difficult to give you a good answer without knowing the dataset. However this is how I would approach the problem:

Create the base columns you are using for your final headers. Loop through the headers of each file and match them to the base column with the most common amount of characters.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.