Do data scientists use Excel?

I would consider myself a journeyman data scientist. Like most (I think), I made my first charts and did my first aggregations in high school and college, using Excel. As I went through college, grad school and ~7 years of work experience, I quickly picked up what I consider to be more advanced tools, like SQL, R, Python, Hadoop, LaTeX, etc.

We are interviewing for a data scientist position and one candidate advertises himself as a "senior data scientist" (a very buzzy term these days) with 15+ years experience. When asked what his preferred toolset was, he responded that it was Excel.

I took this as evidence that he was not as experienced as his resume would claim, but wasn't sure. After all, just because it's not my preferred tool, doesn't mean it's not other people's. Do experienced data scientists use Excel? Can you assume a lack of experience from someone who does primarily use Excel?

Topic excel career tools

Category Data Science


Excel can be an excellent tool. Sure, depending on what you do, it might not fit the bill but if it does, it would be almost foolish to dismiss it. While it takes a while to set up your pipeline, in Excel you can hit the ground pretty much running: built-in UI, easy extendibility via VBA even with Python (e.g. https://www.xlwings.org). It might not be ideal when it comes to stuff like version control but there are ways to make it work with Git (e.g. https://www.xltrail.com/blog/auto-export-vba-commit-hook).


I'm surprised how many people are attached to the coolness of the profession rather than the actual job to be done. Excel is excellent tool, with free Powerpivot, Powerquery, it can do so much. (these are not available on OS X). And if you know VBA, you can do some nice stuff. And then if you add on the top of that knowledge of python you can combine the very first steps of data extraction and manipulation with python and then use excel, especially if you are a visual person. With excel you can really inspect aggregated data before feeding into any further processes or visualizing. Its a must have tool.


Excel can be an excellent tool for exploratory data analysis it really depends on your needs and of course it has its limitations like any tool, but excel definitely deserves a place in the data science hall of fame.

Worth remembering that in practice most users will be exploring a heavily reduced data set anyway (created from an SQL query).

Excel is powerful for exploring data when you use the "table" object in combination with pivot tables, visualising is all 1-2 clicks max and a lot of excel charts in powerpoint look great, unless your looking to create something very bespoke e.g. in a scientific computing context. The interactive nature means you can explore rapidly.

The benefits of the "table" object is that as you transform the data further in excel to enable you to explore new distributions the pivot tables all remember the variable.

Where excel is weak is that the formula list is arguably limiting, for instance a SQL case statement or python statment is way more flexible than an endless chain of if functions.

It really depends on your needs but excel definitely deserves a place in the data science hall of fame.

Interesting anecdote, the team who work on the Facebook newsfeed algorithm can all regularly be seen to be playing with excel and lots of spreadsheets.


Most non-technical people often use Excel as a database replacement. I think that's wrong but tolerable. However, someone who is supposedly experienced in data analysis simply can not use Excel as his main tool (excluding the obvious task of looking at the data for the first time). That's because Excel was never intended for that kind of analysis and as a consequence of this, it is incredibly easy to make mistakes in Excel (that's not to say that it is not incredibly easy to make another type of mistakes when using other tools, but Excel aggravates the situation even more.)

To summarize what Excel doesn't have and is a must for any analysis:

  1. Reproducibility. A data analysis needs to be reproducible.
  2. Version control. Good for collaboration and also good for reproducibility. Instead of using xls, use csv (still very complex and has lots of edge cases, but csv parsers are fairly good nowadays.)
  3. Testing. If you don't have tests, your code is broken. If your code is broken, your analysis is worse than useless.
  4. Maintainability.
  5. Accuracy. Numerical accuracy, accurate date parsing, among others are really lacking in Excel.

More resources:

European Spreadsheet Risks Interest Group - Horror Stories

You shouldn’t use a spreadsheet for important work (I mean it)

Microsoft's Excel Might Be The Most Dangerous Software On The Planet

Destroy Your Data Using Excel With This One Weird Trick!

Excel spreadsheets are hard to get right


I teach a Business Analytics course that includes SQL and Excel. I teach in a business school so my students aren't the most technically capable, which is why I didn't use something like R, Pandas, or Weka. That being said, Excel is a powerful enough tool to use for some data analysis. It gets most of this power from its ability to act as a front end to SQL Server Analysis Services (a component in SQL Server for data analysis) using the Data Mining Add-In.

SSAS lets you construct decision trees, perform linear and logistic regressions, and even make bayesian or neural networks. I've found that using Excel as a front-end is a less threatening approach to doing these kinds of analyses since they've all used Excel before. The way to use SSAS without Excel is through a specialized version of Visual Studio and that isn't the most user friendly tool out there. When you combine it with a few other Excel tools like Power Query and Power Pivot, you're able to do some fairly sophisticated analysis of data.

Full Disclosure, I'm probably not going to use it again when I teach the new version of the course next year (we're splitting it into two courses so one can focus more heavily on data analysis). But that's just because the university was able to get enough licenses for Alteryx which is even easier to use and more powerful but is $4-85k/user/year if you can't get it free somehow. Say what you will about Excel, but it beats that price point.


In his book Data Smart, John Foreman solves common data science problems (clustering, naive bayes, ensemble methods,...) using Excel. Indeed it's always good to have some knowledge of Python or R but I guess Excel can still get most of the job done !


I think most people are answering without having a good knowledge of excel. Excel (since 2010) has an in memory columnar [multi table] database , called power pivot (which allows input from csv/databases etc), allowing it to store millions of rows (it doesn't have to be loaded on a spreadsheet). It also has an ETL tool called power query allowing you to read the data from a variety of sources (including hadoop). And it has a visualisation tool (power view & power map). A lot of Data Science is doing aggregation and top-n analysis at which power pivot excels. Add to this the interactive nature of these tools - any user can easily drag and drop a dimension on which to break up the results adn I hope you can see the benefits. So yes you can't do machine learning, but I would question how much machine learning is done by data scientists day to day: eg when I want to analyse the prediction errors made in machine learning program I find it easiest to slice and dice the errors with excel.


Excel allows only very small data and doesn't have anything that is sufficiently useful and flexible for machine learning or even just plotting. All I would do in Excel, is stare at a subset of the data for a first glance over the values to make sure I don't miss anything visible by eye.

So, if his favourite tool is Excel, this might suggest he rarely deals with machine learning, statistics, larger data sizes or any advanced plotting. Someone like this I wouldn't call a Data Scientist. Of course titles don't matter and it depends a lot on your requirements.

In any case, don't make a judgement by statements of experience or CV. I've seen CVs and known the people behind it.

Don't assume. Test him! You should be good enough to set up a test. It has been shown that interviews alone are close to useless to determine skills (they only show personality). Set up a very simple supervised learning test and let him use any tool he wants.

And if you want to screen people at an interview first, then ask him about very basic but important insights about statistics or machine learning. Something that every single of your current employees knows.


Let me first clarify that I am starting my journey into data science from a programmer and database developer standpoint. I am not a 10-year data science expert nor a statistical god. However, I do work data scientist and large datasets for a company that works with rather large clients worldwide.

From my experience, data scientist use whatever tools they need to get the job done. Excel, R, SAS, Python and more are all tools in a toolbox for good data scientist. The best can use a wide variety of tools to analyze and crunch data.

Therefore, if you find yourself comparing R to Python, then you're likely doing it all wrong in the data science world. Good data scientist use both when it makes sense to use one over the other. This also applies to Excel.

I think that it's rather hard to find anyone that is going to have experience in so many different tools and languages while been great at everything. I also think it's going to be hard to find data scientist specifically that can not only program complex algorithms but also know how to use them from a statistical standpoint too.

Most of the data scientist I've worked with come in about 2 flavors. Those that can program and those that can't. I rarely work with data scientist that can pull data in Python, manipulate it with something like Pandas, fit a model to the data in R and then present it to management at the end of the week.

I mean, I know they exist. I've read many data science blogs from guys developing web scrappers, pushing it into Hadoop, pulling it back out in Python, programming complex things and running it through R to boot. They exist. They're out there. I just haven't ran into too many that can do all of that. Maybe it's just my area though?

So, does that mean only specializing in one thing bad? No. Plenty of my friends specialize in just one main language and kill it. I know plenty of data guys who only know R and kill it. I also know plenty of people who just use Excel to analyze data because that's the only thing most non-data scientist can open and use (especially in B2B companies). The question you really need to answer is if this one thing is the ONE thing you need for this position? And most importantly, can they learn new things?

P.S

Data Science is not just restricted to "BIG DATA" or NoSQL.


Do experienced data scientists use Excel?

I've seen some experienced data scientists, who use Excel - either due to their preference, or due to their workplace's business and IT environment specifics (for example, many financial institutions use Excel as their major tool, at least, for modeling). However, I think that most experienced data scientists recognize the need to use tools, which are optimal for particular tasks, and adhere to this approach.

Can you assume a lack of experience from someone who does primarily use Excel?

No, you cannot. This is the corollary from my above-mentioned thoughts. Data science does not automatically imply big data - there is plenty of data science work that Excel can handle quite well. Having said that, if a data scientist (even experienced one) does not have knowledge (at least, basic) of modern data science tools, including big data-focused ones, it is somewhat disturbing. This is because experimentation is deeply ingrained into the nature of data science due to exploratory data analysis being a essential and, even, a crucial part of it. Therefore, a person, who does not have an urge to explore other tools within their domain, could rank lower among candidates in the overall fit for a data science position (of course, this is quite fuzzy, as some people are very quick in learning new material, plus, people might have not had an opportunity to satisfy their interest in other tools due to various personal or workplace reasons).

Therefore, in conclusion, I think that the best answer an experienced data scientist might have to a question in regard to their preferred tool is the following: My preferred tool is the optimal one, that is the one that best fits the task at hand.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.