Viktor Mayer-Schönberger and Kenneth Cukier, authors of the just-published Big Data: A Revolution that Will Transform How We Live, Work, and Think, reacted sharply when I asked them if they are cheerleaders for big data, as one reviewer implied. ”We are messengers of big data, not its evangelists,” said Cukier. Added Mayer-Schönberger: “The reviewer did not read the book.”
I did. Big Data is an excellent introduction for general audiences to what has become a topic of conversation everywhere, faster than any other technology-driven buzzword in recent memory. To those who may react to “big data” as today’s incarnation of “big brother,” Mayer-Schönberger and Cukier offer a comprehensive and highly readable overview of the benefits and risks associated with big data, which they define as “the ability of society to harness information in novel ways to produce useful insights or goods and services of significant value.”
The most important part of the book is the authors’ discussion of potential risks and possible ways to address them, providing a launch-pad to a much-needed conversation regarding what’s to be done about big data.
In addition to privacy—Mayer-Schönberger and Cukier point out that perfect anonymization is impossible in the age of big data—they are concerned about two other, less-discussed, risks. One is what they call “propensity” or using big data predictions to punish people even before they acted.
Incidentally, the day I spoke with Mayer-Schönberger and Cukier, Senator Rand Paul was questioning the right of the U.S government to target U.S. citizens with drones on American soil. As a Wall Street Journal editorial pointed out, the U.S. government could not have targeted Jane Fonda, as Paul hypothesized, but it can target an “enemy combatant” (including a U.S. citizen so designated) anywhere, including on U.S. soil.
But what if the U.S. government starts relying on big data analysis to predict who is an “enemy combatant”? Already, Mayer-Schönberger and Cukier tell us, a number of local police departments have adopted “predictive policing,” or the practice of “using big data analysis to select what streets, groups, and individuals to subject to extra scrutiny, simply because an algorithm pointed to them as more likely to commit crime.”
The possibility that our fascination with data may become a dangerous addiction is the third risk the authors of Big Data discuss, what they call “the dictatorship of data.” The potential for abuse of data by people with bad intentions and misuse by blindly admiring people with good intentions is as big as the data itself.
As a remedy for the privacy risk, the authors suggest constructing a new privacy framework, moving from today’s focus on individual consent at the time of collection to holding data users accountable for what they do with the data. Following-up on the advice captured in the title of his previous book, Delete: The Virtue of Forgetting in the Digital Age, Mayer-Schönberger told me: “We suggest that we reap the benefit and the hidden value that lies in secondary and tertiary uses but at the same time be cognizant of the fact that data has a life expectancy. For certain types of data, we suggest that society consider expiration dates.”
As for propensity, the authors are categorical: “The predictive state is the nanny state, and then some.” The more we rely on data-driven interventions to reduce risks in society, they argue, the more we devalue the ideal of individual responsibility.
To help with the monitoring and transparency required by big data, Mayer-Schönberger and Cukier suggest developing a new professional class that will be responsible inside and outside companies for the proper handling of data. “Just as the explosion of financial information got us to standardize on accountants,” Cukier told me, “we suggest the same sort of approach with [what we call] the ‘algorithmists’ of big data.” These new professionals will be experts in computer science, mathematics, and predictions and will be bound by a professional certification and a vow of impartiality and confidentiality.
No matter how successful we are in pursuing their suggested solutions, Mayer-Schönberger and Cukier make it clear in the book that big data does not equal the rise of the machines. Big data, they say, is not about trying to “teach” a computer to “think” like human. Nor does it foretell the “end of theory” or abandoning making and testing hypotheses, the bedrock of scientific progress for centuries. “In the world of big data,” Mayer-Schönberger and Cukier say, “it is our most human traits that will need to be fostered—our creativity, intuition, and intellectual ambition.”
One of the many fascinating big data analysis examples recounted in the book is the story of how a crack team of young data scientists found a way to identify “Illegal conversion” in New York—the practice of cutting up a dwelling to many smaller units so that it can house as many as ten times the number of people it was designed for. The 200 inspectors on hand used to follow up on the 25,000 illegal-conversion complaints they get each year by focusing on the ones they deemed most important. But only in 13 percent of cases did they find conditions severe enough to warrant a vacate order. After the analysis of data collected by the city but never combined and analyzed, they were issuing vacate orders on more than 70 percent of the buildings they inspected. By indicating which buildings most needed their attention, big data analysis improved their efficiency fivefold.
As Mayer-Schönberger and Cukier note, “the experience of New York City’s analytical alchemists highlights many of themes” of their book. They used a lot of data and some of it was quite messy, proving the authors repeated argument for the value of “good enough” over absolute accuracy. “Yet the most important reason for the program’s success,” they say, “was that it dispensed with a reliance on causation in favor of correlation.”
This indeed is one of the key themes in the book, that “society will need to shed some of its obsession for causality in exchange for simple correlations: not knowing why but only what. This overturns centuries of established practices and challenges our most basic understanding of how to make decisions and comprehend reality.”
Our obsession with causality is reflected in the authors seeing causality where none exists. What has a problem of ranking, of deciding which complaint is of higher priority, to do with causality or asking why? Before the big data analysis, the inspectors also relied on correlations, such as the number of calls to the city’s “311” complaint line. It’s just that these correlations were misleading because of where they came from or what the complaint was about. The big data team simply uncovered better, more meaningful correlations.
The authors correctly say, “For many everyday needs, knowing what not why is good enough.” The book is full of such examples from making better diagnostic decisions when caring for premature babies to which flavor Pop-Tarts to stock at the front of the Walmart store before a hurricane. Big data can help answer these questions, but they never required “knowing why.” Big data analysis can be about correlations OR causation—it all depends, as it has always been, on what question we are asking, what problem we are solving, and what goal we are trying to achieve.
I don’t think big data will do anything to—and has little to do with—our obsession with causation. But as Big Data successfully demonstrates, this is one technology-driven phenomenon that can improve our lives and require all of us to pay attention and start engaging in a meaningful conversation of what to do about its potential risks.
I asked Cukier (after Mayer-Schönberger pointed out he is the one with the crystal ball) where does he see big data in five years. “I think it’s going to look a lot like how the Internet has unfolded,” he answered. “In five years we are going to see widespread adoption around all corners of society just like with the Internet when everyone had to have an Internet strategy. In the next 12 months we are going to see the same breakout moment when everyone has to have a big data strategy, doing an inventory of what data do we have, what data we should be collecting and we are not, and how we could use it.”
If so, what will be the breakthrough moment or event, the equivalent of when the Web hit that point of “everybody had to have a strategy”? Cukier thinks it may have already happened: “We define the birth of the Web at Netscape IPO in 1995 (although it happened before at CERN). Maybe the big data moment has already happened with the Facebook IPO. Maybe we just don’t know because Facebook’s IPO has fizzled out. It is a $67 billion company with very small revenues and small earnings and all the value of its shares is in the promise of what its data holds.”
Cukier is probably right. But I would venture to speculate that the Splunk IPO (April 2012, one month before the Facebook IPO) is what will be viewed in the future as the defining moment, the launch of the big data era. What I’m certain about is that Big Data will be the defining text in the discussion for some time to come.
Note: Bill Keller at the New York Times echos in a March 16th op-ed my concern about the U.S. government using big data analysis to predict who it should target with drones: “If you find the use of remotely piloted warrior drones troubling, imagine that the decision to kill a suspected enemy is not made by an operator in a distant control room, but by the machine itself.”
[Originally published on Forbes.com]