Over on ZDNet I’ve been doing a series looking at the issues we face incorporating Big Data into our digital civilization (see When Big Data is bad data, Lying scientists and the lying lies they tell, and Humans are the weak link in Big Data. I’m not done yet, but I wanted to share a couple of cautionary Excel tales.
The latest comes by way of a paper Gene name errors are widespread in the scientific literature. The researchers
. . . downloaded and screened supplementary files from 18 journals published between 2005 and 2015 using a suite of shell scripts. Excel files (.xls and.xlsx suffixes) were converted to tabular separated files (tsv) with ssconvert (v1.12.9). Each sheet within the Excel file was converted to a separate tsv file. Each column of data in the tsv file was screened for the presence of gene symbols.
Result: 20% of the papers had errors. Specifically
In total, we screened 35,175 supplementary Excel files, finding 7467 gene lists attached to 3597 published papers. We downloaded and opened each file with putative gene name errors. Ten false-positive cases were identified. We confirmed gene name errors in 987 supplementary files from 704 published articles
The problem of Excel . . . inadvertently converting gene symbols to dates and floating-point numbers was originally described in 2004 . For example, gene symbols such as SEPT2 (Septin 2) and MARCH1 [Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase] are converted by default to ‘2-Sep’ and ‘1-Mar’, respectively. Furthermore, RIKEN identifiers were described to be automatically converted to floating point numbers (i.e. from accession ‘2310009E13’ to ‘2.31E+13’). Since that report, we have uncovered further instances where gene symbols were converted to dates in supplementary data of recently published papers (e.g. ‘SEPT2’ converted to ‘2006/09/02’).
Another, older, Excel misadventure occurred in Ken Rogoff’s and Carmen Reinhart’s paper, Growth in a Time of Debt, which was the intellectual justification for widespread national austerity in the last 7 years. That austerity put millions of people out of work and slowed – and in some cases reversed – economic recovery after the Great Recession.
Too bad for the unemployed who lost homes, life savings, families, and self-respect, but the academics made some key Excel mistakes that weren’t uncovered until a grad student tried to replicate their results. As this piece in The Atlantic notes the paper itself was suitably conservative, but the academics oversold their results to Congress and other policy-making bodies.
The StorageMojo take
Given that the genetic issue was first identified in 2004, it is unsettling that Microsoft, with its vast resources and world-class research organization, hasn’t been proactive in helping Excel users avoid these issues. Word has a grammar checker, and helping users avoid common mistakes seems doubly applicable to numerical data that most readers assume is correct because, after all, the computer did it.
Perhaps a smarter Excel would have noted that Rogoff failed to include five countries in the data set in the final calculations – and maybe a neural-net data checker could flag problems like that – but it isn’t the Excel team’s fault that economists oversold their faulty results. Publishing the spreadsheets along with papers – as they do in genome research – would be a help.
But the larger takeaway is that while our computers are usually accurate our human brains are riddled with cognitive and logical bugs. While Computer-Assisted-Everything has enormous potential, we must remember to keep our BS detectors tuned up and running.
Courteous comments welcome, of course.
“Rogoff failed to include five countries in the data set in the final calculation” That was deliberate, because, of the five countries excluded, two of them Japan and USA have proved that higher debt/GDP ratio have no impact whatsoever on that nations’ government’s ability to spend.
To me, this only shows a complete lack of understanding of the tool’s basic controls by the researchers.
The paper is right in that excel will by default convert a string like “March-2” to the internal march second date representation (it shows you Mar-2, but what’s in the cell is actually something else). That’s a convenience that is documented in the excel documentation and is called out in many basic excel classes (look at the many basic Excel videos on YouTube, for example). But it’s also true that you can turn that off with minimal effort (literally no more than 5 clicks) by formatting the cells the data is going into as “text” (or by entering the data as text by typing a ‘ in front of it, ‘march-2 when you enter it). So this is a huge lack of understanding of the tool.
I can sort of understand a genetics researcher running into a problem like this, but guys, check your work! Spend a couple of hours with a how to use excel class (or get one for your grad students)!
Now, I’m no lover of Excel’s interface as a tool, but it is a powerful analysis tool when used carefully and with an understanding of what it can and can’t do. That understanding is easy to get, and cheap in terms of both time and money.
So the conclusion I draw from the article is that the researchers didn’t even bother to do get the minimum understanding of the tools they were using, and just slammed stuff into cells without checking what happened to the data (or, more likely, they had overworked grad students or lab techs doing that, and didn’t bother to determine whether or not they understood what they were doing).
I agree that there is threat to good science here, but it comes from sloppy researchers, not necessarily the tools they are using.
(The economics paper, though, is simply egregiously sloppy science – any independent verification of results should have found the problems with the research, and that should have been done before they went public with the results. Bad science, poorly done. Not the tool, the economists themselves.)
David, your point is well taken. To me, the troubling aspect about the biologists is that these guys are PhDs and they are screwing up. What about the rest of the world who aren’t as smart and educated – and whose spreadsheets don’t get any scrutiny at all?
We put safety features on many tools because while we can tell people what not to do, some inevitably will anyway, even after reading the warnings. How many Excel spreadsheets are giving their unwitting users bad results?
If I were in charge of Office, I’d ask a team of Microsoft Researchers to delve into the problem and propose solutions. Customer trust could be lost if these kinds of problems are widespread.
I agree completely that Excel could be easier (or at least friendlier) to use, and my wife (a UI designer) points out that changing data in any way on the fly should be flagged to the user – and that Excel doesn’t do that is a UI fail (she suggests that it should do the conversion when the cell with the date-like string in it is used in a calculation, which would be a much nicer way of munging user data).
And I agree that it’s troubling that senior economists and biologists with PhDs are screwing up the use of a major research tool like this. But I don’t think we can blame all of that on the tool – there is a responsibility for the users to learn the tool if they’re going to depend on it for critical results.
And I agree with you that MSFT should put some research into making the tool easier to use (or at least a little more vocal about what it’s doing when it changes data (or, in the case of the economists, when there is data below the area they’ve dragged into their formula).
And finally, thanks, Robin, for bringing this issue up – it is a huge problem (in several different ways) that really needs to be addressed. I had not seen the articles about the economists before, and it’s dismaying to think that a significant piece of economic policy has been based on flawed analysis… (It’s even more troubling that there is no note on the original paper that the analysis was flawed and that the results aren’t valid – it is presented as originally submitted, as far as I can tell.)
Just wait until all those IoT devices start doing all kinds of weird things because we didn’t pay attention to what the ‘default behavior’ is because we didn’t want to read through 1000 pages of instructions.
Your fridge orders 50 gallons of milk, your fire alarm starts calling 911 ten times a day, or your lights stay on all night long running up your electric bill.