Over on ZDNet I’ve been doing a series looking at the issues we face incorporating Big Data into our digital civilization (see When Big Data is bad data, Lying scientists and the lying lies they tell, and Humans are the weak link in Big Data. I’m not done yet, but I wanted to share a couple of cautionary Excel tales.
The latest comes by way of a paper Gene name errors are widespread in the scientific literature. The researchers
. . . downloaded and screened supplementary files from 18 journals published between 2005 and 2015 using a suite of shell scripts. Excel files (.xls and.xlsx suffixes) were converted to tabular separated files (tsv) with ssconvert (v1.12.9). Each sheet within the Excel file was converted to a separate tsv file. Each column of data in the tsv file was screened for the presence of gene symbols.
Result: 20% of the papers had errors. Specifically
In total, we screened 35,175 supplementary Excel files, finding 7467 gene lists attached to 3597 published papers. We downloaded and opened each file with putative gene name errors. Ten false-positive cases were identified. We confirmed gene name errors in 987 supplementary files from 704 published articles
The problem of Excel . . . inadvertently converting gene symbols to dates and floating-point numbers was originally described in 2004 . For example, gene symbols such as SEPT2 (Septin 2) and MARCH1 [Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase] are converted by default to ‘2-Sep’ and ‘1-Mar’, respectively. Furthermore, RIKEN identifiers were described to be automatically converted to floating point numbers (i.e. from accession ‘2310009E13’ to ‘2.31E+13’). Since that report, we have uncovered further instances where gene symbols were converted to dates in supplementary data of recently published papers (e.g. ‘SEPT2’ converted to ‘2006/09/02’).
Another, older, Excel misadventure occurred in Ken Rogoff’s and Carmen Reinhart’s paper, Growth in a Time of Debt, which was the intellectual justification for widespread national austerity in the last 7 years. That austerity put millions of people out of work and slowed – and in some cases reversed – economic recovery after the Great Recession.
Too bad for the unemployed who lost homes, life savings, families, and self-respect, but the academics made some key Excel mistakes that weren’t uncovered until a grad student tried to replicate their results. As this piece in The Atlantic notes the paper itself was suitably conservative, but the academics oversold their results to Congress and other policy-making bodies.
The StorageMojo take
Given that the genetic issue was first identified in 2004, it is unsettling that Microsoft, with its vast resources and world-class research organization, hasn’t been proactive in helping Excel users avoid these issues. Word has a grammar checker, and helping users avoid common mistakes seems doubly applicable to numerical data that most readers assume is correct because, after all, the computer did it.
Perhaps a smarter Excel would have noted that Rogoff failed to include five countries in the data set in the final calculations – and maybe a neural-net data checker could flag problems like that – but it isn’t the Excel team’s fault that economists oversold their faulty results. Publishing the spreadsheets along with papers – as they do in genome research – would be a help.
But the larger takeaway is that while our computers are usually accurate our human brains are riddled with cognitive and logical bugs. While Computer-Assisted-Everything has enormous potential, we must remember to keep our BS detectors tuned up and running.
Courteous comments welcome, of course.