As someone who worked in the database field for many years and taught database theory and programming at the college level, I was initially enthusiastic about the Big Data phenomenon.
I am still a proponent but have developed some concerns about how Big Data is being used and misused. Cathy O’Neil’s book Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy could not have come at a better time.
O’Neil is a data scientist and author who earned a PhD in mathematics with a thesis on algebraic number theory. She blogs about mathematics and politics on her site: mathbabe.org. She went from teaching at Barnard College and Columbia University to a job as a “quant” for a Wall Street hedge fund just a year before the 2008 meltdown. Her book provides an inside look at the computer algorithms that not only run Wall Street but that increasingly run and/or ruin our lives. Software -- designed to analyze huge amounts of data and then spit out answers about everything from jail sentences to college rankings and recruitment, policing, hiring, employee management, teacher ratings, credit-worthiness, insurance, advertising and political polls – is analyzed and dissected.
Originally designed to remove human error, improve efficiency and cut costs, the algorithms built into proprietary software have taken on a life of their own. As O’Neil demonstrates again and again, those algorithms are often faulty but can still deny people a job, deny people a loan and deny them insurance, trapping them in a cycle of poverty. The problem is that those algorithms are secret and cannot be questioned.
The Black Box
When I first started teaching programming most development was focussed on desktop applications. The goal was to create a compiled executable program that could be sold for installation to the hard drive of a desktop computer. The compiled executable and the EULA warnings about reverse engineering the code within were used to protect that code from unscrupulous rival software developers. Students were taught to think of and design their applications as a series of “black boxes”. Instead of a giant, hard to maintain, monolithic block of code, the black boxes were code modules designed to perform a specific function and to hopefully be reusable within the application. They were a means of breaking the program down into more manageable pieces that could be created by individual members of a team of developers, tested and then left alone. Specific inputs would produce expected outputs and the functionality used to produce those outputs was purposely obfuscated. All another developer on the project needed to know was to “wire up” one of these black boxes of code and then “call” for its output.
A common problem with this approach to programming would come to light when the original application came to the end of its life cycle and was being replaced by a newer application. The new application was tested by users and reported to be faulty. Developers went to work testing and re-testing and could find nothing wrong. It was often the case that in the end it was discovered that the original application had been wrong all along but had been accepted as correct for so long by users that they trusted its results and doubted the new, correct application. Something was amiss inside one of the old application’s “black boxes” but it was difficult or impossible to see inside.
When I was working as a data analyst for a financial institution we were converting a large block of management reports created using Crystal Reports to SQL Server Reporting Services. I ran into a problem with one of the managers in the Loans Department. She insisted that one of the new reports I had created for her was incorrect. The report was counting and totalling information about student loans. Loan officers in the branch offices had to manually enter the word “Student” in a LoanType field on the loan application screen. In the old report, the SQL code was not in a black box so it was plainly visible as WHERE LoanType=”Student”. When testing I noticed a large number of variations entered in the LoanType field: “Student loan”, “School loan”, “ Student”, “Student ”, “Stdent”, etc. I modified the SQL query for the new report to account for these anomalies, so the totals for the faulty old and the corrected new reports did not match. It took a couple of heated meetings with me, the loan manager and my manager to get the loan manager to grudgingly accept the idea that the data she had been relying on for years had been faulty all along.
Big Data, Big Mistakes
In O’Neil’s book she identifies similar problems. Software used by police departments, insurance companies and financial institutions can be faulty but there is no way to discover the faults so users go on believing the results. The software code is a black box.
Even if the code itself is good, the premise behind the software may be to blame. The software used by police departments deploys more police in areas that Big Data tells them are high crime areas. More arrests are made in those areas so the next round of data shows that even more policing is needed there and seems to prove the original premise of the software. These areas are probably populated by poor minorities that become victims of the numbers game. People with poor credit – sometimes incorrectly reported – have their job applications rejected by HR software so they fall even deeper into financial hell. Poor working class people living in a “bad” postal code pay higher rates for the car insurance they need to drive to their low wage jobs, again perpetuating a cycle of poverty.
The book goes into a number of examples in detail.
O’Neil is not against the use of Big Data and is not a Luddite about programming. She is a data scientist remember. What she warns about – and I have to agree with her – is the misuse of data and a profit-centric model for software that puts corporate interests ahead of human interests. I also feel that we need programmers and start-ups to throttle back the hubris a bit and software vendors to employ less snake oil. The 21st century may be an age of data but it also needs to be a human age.