Big Data Self-Delusion

Big data human faceThe most compelling story told in the new documentary “The Human Face of Big Data” (PBS, February 24), is about the collection and analysis of data to predict the onset of potentially deadly infection in premature babies.

By the time these babies are physically showing signs of infection they are very unwell, a condition a caregiver could not predict by looking at their chart, where once an hour their vital signs are recorded.  “What shocked me was the amount of data loss,” says Dr. Carolyn McGregor of the University of Ontario. The solution was Project Artemis in which computers collect all relevant data continuously and watch for certain changes in vital signs. “If something starts to go wrong with that baby we have the ability to intervene [before physical symptoms appear],” says Dr. McGregor.

It’s a story of how more data may lead to better outcomes, in this case even save lives. The term “Big Data” has come to represent in recent years this promise, a potential that can only be realized if we clearly establish what we want to achieve by collecting more data and why more data is better than less data in each particular case.

Unfortunately, in our technology-obsessed world, new technologies and new technology applications tend sometimes to become buzzwords that are hyped, celebrated and often discussed irresponsibly by technology vendors and the media. Unfortunately, “The Human Face of Big Data” by and large falls into this trap, the fascination (self-delusion?) with the idea of we are living a momentous time in history thanks to technology. Going beyond “big data,” it is a paean to information technology and computerization, as Jay Walker of TEDMED declares in the film:

Billions and billions of people who have been excluded from the discussion, who couldn’t afford to step into the world of being connected, step into the world of information, learn things they could never learn, are suddenly on the network… Suddenly the world has a lot more minds connected in the simplest, least expensive possible way to make the world better… I don’t think there’s any question that we are at a moment in human history that we will look back on in fifty or a hundred years and we’ll say right around 2000 or so it all changed.

I’ll go out on a limb and venture to predict that a hundred years from now, the time around 2000 will be marked by Americans’ loss of the security they have enjoyed since the end of the Second World War, not by the rise of the Internet.  And it will be clear to most observers, as it is clear to many today, that some of the additional minds that are now connected to the Internet, do not see it as a tool “to make the world better” (they may see it that way but I’m guessing Walker probably doesn’t agree with their definition of a better world).

“The Human Face of Big Data” demonstrates that giving more people access to the Internet does not automatically include them in “the discussion.” China has more people connected to the Internet than any other country, but there is no one from China among the two dozen “experts” identified by name in the film—all are based in the U.S.  No one from Russia, India, Japan, Brazil—countries where one may find talking heads or, even better, data scientists, that may represent a different point of view about the role of technology, the Internet, and big data. It would have enriched this documentary tremendously if we heard their take on the pros and cons of big data, how they define it, what it means to them, and what specific types of data collection and analysis will make a difference in their countries. (In line with Anil Dash’s response to Mark Zuckerberg’s post regarding the recent decision by India’s telecom regulator: “What about pausing the Internet Basics effort and spending some time on a real effort to listen to Indian voices about what would help them have connectivity on their own terms, in a way they find acceptable?”).

The lack of diversity in the voices and opinions heard in the film, its relentless emphasis on accentuating the positive and the speculative—the two segments discussing the negative aspects of big data last a total of 7 minutes—is particularly astonishing given that there has been no shortage of intelligent discussions of the potential pitfalls in the rush to collect and analyze data.

Take, for example, Kate Crawford’s list of myths associated with big data, which includes the belief that bigger data is always better data,  that correlation is as good as causation, that big data eliminates biases, and that it doesn’t invade our privacy. Instead of using these and similar objections to prompt a rich discussion and debate, the documentary—promoted by PBS as an examination of “the promises and perils of this unstoppable force”—deals only with the issue of privacy, a post-Snowden requirement.

Here and there, the documentary almost gets into what could have been turned into an engaging and educational discussion of big data, only to stop in its tracks for fear of losing its shiny, sunny, positive packaging.

One example is the discussion of Google Flu Trends which “accurately predicts flu outbreaks up to two weeks before the CDC,” based on flu-related Google searches. To its credit, the documentary then shows Stephen Downs of the Robert Wood Johnson Foundation talking about the “flip side,” the time Google Flu Trends ”got it way wrong,” because media coverage of the flu season got people to search for “flu” even if they were not sick. But then the film moves on to the next topic. A missed opportunity to talk about what has Googled learned from its failure and the dangers of blind faith in big data and algorithms, to say nothing about raising the question of how “world changing” is finding out about flu outbreaks 2 weeks before the federal government and whether it justifies the generalized claim we hear from Rick Smolan that “now we can see in real-time what’s going on and respond to it.”

Another example of a missed opportunity for an intelligent discussion is when we hear from Tim O’Reilly “I am optimistic but not blindly foolishly optimistic. Remember, the financial crisis was brought to us by big data people.” Finally, you hope, we are going to get into an interesting discussion of the empirical—what has actually happened and why, not what may happen—and practical perils of big data. But you quickly find out that (at least in this regard) you are foolishly optimistic because all we get are platitudes such as “we have to earn our future… we have to make the right choices.”

The missed opportunities are compounded by outright fiction. Here is some of the data we discover in this documentary about big data:

  • All the data processing we did in the last 2 years is more than all the data processing we did in the last three thousand years;
  • We are now being exposed to as much information in a single day as our 15th century ancestors were exposed to in their entire lifetime;
  • Every two days the human race is now generating as much data as was generated from the dawn of humanity through the year 2003 (from the PBS website).

Really? What big data time-machine tells us exactly how much “information” and “data” and “data processing” there was in the last 3000 years or the 15th century or at the dawn of humanity?

The documentary provides a definition of big data, something that is often missing from discussions of the topic. While poetic, it is quite meaningless:  Big data is a nervous system for the planet. This global definition leads to discussions in the film that have more to do with the Internet than with big data.

For example, in the segment titled “Data: The future of revolution,” Joi Ito talks about how the “Arab Spring” started with a photo shared on Facebook and then picked up by Al Jazeera and broadcast on TV as an example of linking activists, social media and mainstream media. “Technology has fundamentally changed the way people interact with everything,” says Ito.  If big data is the planet’s nervous system, than every interaction is big data. QED.

In the same segment, Ito also comments “that’s one of the challenges of big data—it has so much opportunity for both good and also for screwing up our system.” But it is not clear (at least to this viewer) why he says that in this context.  As with the other voices we hear in the documentary, there may have been something else there that got lost in the editing process. The impression the film makers wanted to leave with the viewer is summarized by John Battle and quoted in the press release: “The era of Big Data is an important inflection point in human history and represents a critical moment in our civilization’s development.”

The theme of we-are-living-in-a-historic-moment-because-of-technology-and-we-have-to-make-critical-decisions-because-it-may-turn-negative-but-let’s-accentuate-the-positive has been the hallmark of technology talk for a while, moving rapidly from one hype cycle to the next, with little connection to reality (big data has already been eclipsed as the buzzword of the day by the Internet of Things, Artificial Intelligence, and Virtual Reality). There’s no escape from this escapist, technology-centric, US-centric myth-making, shared and promoted by the global chattering classes. Here’s danah boyd reporting on last month’s meeting of the World Economic Forum in Davos, Switzerland:

I started to sense that what the tech sector was doing at Davos was putting on the happy smiling blinky story that they’ve been telling for so long, exuding a narrative of progress: everything that is happening, everything that is coming, is good for society, at least in the long run.

Shifting from “big data,” because it’s become code for “big brother,” tech deployed the language of “artificial intelligence” to mean all things tech, knowing full well that decades of Hollywood hype would prompt critics to ask about killer robots. So, weirdly enough, it was usually the tech actors who brought up killer robots, if only to encourage attendees not to think about them.

Not only did any nuance get lost in this conversation, but so did the messy reality of doing tech. It’s hard to explain to political actors why, just because tech can (poorly) target advertising, this doesn’t mean that it can find someone who is trying to recruit for ISIS. Just because advances in AI-driven computer vision are enabling new image detection capabilities, this doesn’t mean that precision medicine is around the corner. And no one seemed to realize that artificial intelligence in this context is just another word for “big data.” Ah, the hype cycle.

It’s going to be a complicated year geopolitically and economically. Somewhere deep down, everyone seemed to realize that. But somehow, it was easier to engage around the magnificent dreams of science fiction. And I was disappointed to watch as tech folks fueled that fire with narratives of tech that drive enthusiasm for it but are so disconnected from reality as to be a distraction on a global stage.

Similarly, veteran tech observer Steven Levy says that the virtual and augmented reality demos at TED 2016 were redundant because “At TED, you are already immersed in a kind of artificial reality.” Is there a tech backlash brewing? Are we finally going to have more sober and multi-dimensional discussions of technology?

I don’t think so, I don’t think we (especially in the U.S.) will let go of soothing escapism. Expect to see in a few years, when we will already move to other buzzwords, a PBS documentary titled “The Human Face of Artificial Intelligence.”

Originally published on Forbes.com

Posted in Big Data, Misc | Leave a comment

The Hadoop Bubble Quivers As Hortonworks Misses

Hadoop BubbleLast month, Hortonworks announced quarterly results for the first time as a public company and they came below expectations. It had revenues of $12.7 million (up 55% year-over-year), but average Wall Street estimates were $13.42 million. Similarly, Wall Street expected a loss of $2.04 per share and Hortonworks reported a loss of $2.19 per share.

The results could be attributed to a company new to the game of providing guidance to Wall Street. But the company’s management had substantial experience in that department throughout their impressive careers so we must look somewhere else for an explanation. What if November 10, 2014, the day Hortonworks filed the paperwork for its IPO was the beginning of the end of the Hadoop bubble, to quote your humble correspondent? What if December 12, 2014, the day Hortonworks went public, surprising many by its swift action, the bubble “began to quiver and shake preparatory to its bursting”? What if Hortonworks had decided to rush to the exit while expectations were high?

People who had over-inflated expectations—and may have grumbled yesterday “what were we thinking”—should have listened to Mike Stonebraker last August. Here’s what this foremost authority on databases (and serial entrepreneur) said about the new generation of Hadoop from Hortonworks competitor Cloudera:

Impala is architected exactly like all of the shared-nothing parallel SQL DBMSs, serving the data warehouse market. Specifically, notice clearly that the MapReduce layer has been removed, and for good reason. As some of us have been pointing out for years, MapReduce is not a useful internal interface inside a SQL (or Hive) DBMS. Impala was architected by savvy DBMS developers, who know the above pragma. In fact, development activity similar to Impala is being done by both HortonWorks and FaceBook. This, of course, presents the Hadoop vendors with a dilemma. Historically, “Hadoop” referred to the open source version of MapReduce written by Yahoo. However, Impala has thrown this layer out of the stack. How can one be a Hadoop vendor, when Hadoop is no longer in the mainstream stack? The answer is simple: redefine “Hadoop”, and that is exactly what the Hadoop vendors have done. The word “Hadoop” is now used to mean the entire stack.

In my post, I suggested “a few things to ponder when considering the potential success of the current leading Hadoop vendors and whether Hadoop in general is in the first stage of a rapid market expansion or the last stage of a bubble inflating.” One of them was the incorporation of Hadoop and similar tools by established software vendors into their traditional database and information management offerings.  Stonebraker is highlighting the opposite, the recasting of Hadoop into what looks like a traditional database technology. He says: “Meanwhile most of the data warehouse vendors support HDFS, and many offer features to support semi-structured data. Hence, the data warehouse market and the Hadoop market will quickly converge.”

Another argument I made was “Hadoop is so 2004 (at least at Google).” Here’s Stonebraker on the subject:

Google must be “laughing in their beer” about now. They invented MapReduce to support the web crawl for their search engine in 2004. A few years ago they replaced MapReduce in this application with BigTable, because they wanted an interactive storage system and MapReduce was batch-only. Hence, the driving application behind MapReduce moved to a better platform a while ago. Now Google is reporting that they see little-to-no future need for MapReduce. It is indeed ironic that Hadoop is picking up support in the general community about five years after Google moved on to better things. Hence, the rest of the world followed Google into Hadoop with a delay of most of a decade. Google has long since abandoned it. I wonder how long it will take the rest of the world to follow Google’s direction and do likewise…

No matter. Here’s what Matthew Hedberg, an analyst for RBC Capital Markets, wrote (according to Investor’s Business Daily) just before the Hortonworks quarterly earnings announcement:  “We remain bullish on Hortonworks’ opportunity as a pure play on Hadoop and believe it to be one of the better-positioned disruptive vendors in what could be a once-in-a-decade data replatforming opportunity.” An analyst with Cowen and Co, Jesse Hulsing, expressed a similar bullish sentiment: “The Hadoop market is in early stages of adoption. Our view is that most large enterprises (5,000-plus employees) will have adopted or piloted the technology by fiscal year 2020. The underlying driver of this adoption is the growth in analytic applications, which is driven by rapid growth in new data types and new user types. Hortonworks should benefit from this.”

Maybe the market is indeed going gangbusters and Hortonworks is simply losing to better-equipped competitors, primarily Cloudera?

Apparently anticipating this question, Cloudera issued last week a “momentum press release,” announcing that its 2014 revenues “surpassed $100 million,” calling the results “an indicator of Hadoop’s strong momentum.” Derrick Harris at GigaOm had this to say about the news: “That the company, which is still privately held, would choose to disclose even that much information about its finances speaks to the fast maturation, growing competition and big egos in the Hadoop space.”  Similarly, Arik Hesseldahl at Re/Code noted that a “likely motivation for the press release is a battle of optics between Cloudera and its primary rival, Hortonworks… Cloudera may simply be seeking to remind the marketplace which Hadoop company is bigger.”

It is also reminding the marketplace that it’s not going to be subject to the scrutiny accorded public companies anytime soon. Cloudera co-founder and chief strategy officer Mike Olson told Re/Code: “We have no timeline for an IPO, period.” CEO Reilly told Fortune “We’re of the size and scale that we could be a successful public company right now. But we’re so well backed that we don’t need to go public to have access to financing.” Indeed, after riding a bubble and raising a cool $1 billion, who needs Wall Street?

But they need Main Street. Regardless of the close to $1.5 billion in venture capital the key Hadoop competitors left standing—Cloudera, Hortonworks and MapR—have raised, to survive and succeed they need enterprise customers to buy the current (and future) Hadoop incarnations they offer.

In my previous post on the Hadoop Bubble, I quoted a 2014 survey conducted by Wikibon which found that only 36% of the respondents were using Hadoop and the majority of those (64%) were using it in proof-of-concept environments.  Even more important to the financial future of Hadoop vendors, Wikibon found that “only 25% of Hadoop practitioners are paying customers of one or another Hadoop vendor. 24% use a free distribution provided by a vendor, but the majority, 51%, roll their own Hadoop downloaded from the Apache Software Foundation.” Don’t you think this has something to do with Hortonworks’ quarterly results?

The author of the excellent Wikibon report, Jeff Kelly, gave a presentation last week, titled The Big Data Money Trail. About 23 minutes into the presentation, Kelly gets to a slide titled (surprise!) “Is this the beginning of the end of the bubble, or is there something next that matters?”

Kelly definitely thinks (or at least thought last week) that Hadoop still matters. He thinks Cloudera and Hortonworks will survive and doesn’t back down from his previous estimates of how big the big data market will get. He predicts three future developments, all helping accelerate big data adoption, but not necessarily (in my opinion) promising for Hadoop vendors: enterprises will overcome their process and culture obstacles for adopting big data technologies; innovation will continue to drive the market because it is based on open source software; and while Hadoop was the “low-hanging fruit,” offering cost saving opportunities, now enterprises will start building “data-driven applications.”

To illustrate the last point, specifically the value that can be created by all these new applications of big data, Kelly reproduced on the slide the results of previous work done by Wikibon which estimated the “spend and value delivered by industrial internet” to reach $1.2 trillion in 2020. Bert Latamore, his colleague at Silicon Angle, wrote in his summary of Kelly’s talk that “Vendors will do well in the Big Data market over the next decade, Kelly predicts, but the real winners will be the companies that harness the technology creatively. He estimates that practitioners will create $1.2 trillion in new value from Big Data over the coming decade.” (Italics mine)

So a 2013 report on the Industrial Internet has metamorphosed into current (and misattributed) estimates of how many dollars are swimming in the big data lake. This is how bubbles rise, and eventually, burst.

Or maybe not, maybe I’m wrong and what we have is the beginning of a solid market for products from disruptive vendors going after “once-in-a-decade data replatforming opportunity.”  After all, Hortonworks provided above-consensus guidance for the current quarter and they are much closer than I to what is really happening in the marketplace.

In an interview with Derrick Harris conducted last week, Hortonworks CEO Rob Bearden said that he is not backing off his 2014 prediction that Hadoop will soon become a multi-billion-dollar market and Hortonworks will be a billion-dollar company in terms of revenue. Hadoop is actually just a part — albeit a big one — of a major evolution in the data-infrastructure space, he explained to Harris. As companies start replacing the pieces of their data environments, they’ll do so with the open source options that now dominate new technologies. These include Hadoop, NoSQL databases, Storm, Kafka, Spark and the like. “Open source companies can be very successful in terms of revenue growth and in terms of profitability faster than the old proprietary platforms got there,” Bearden said.

 Originally posted on Forbes.com

 

Posted in Big Data | Leave a comment

Big Data Quotes: Disruptive Innovation?

“By definition, big data cannot yield complicated descriptions of causality. Especially in healthcare. Almost all of our diseases occur in the intersections of systems in the body. For example, there is a drug that is marketed by Elan BioNeurology called TYSABRI. It was developed for MS [multiple sclerosis]. It turns out that of the people who have MS a proportion respond magnificently to TYSABRI. And others don’t. So what do you conclude from this? Is it just a mediocre drug? No. It is that there is one disease but it manifests itself in different ways. How does big data figure out what is the core of what is going on?”–Clayton Christensen

Continue reading

Posted in Big Data | Leave a comment

The Economist’s Data Editor on Data Fetishism

Ken Cukier

Ken Cukier

“We fetishize data, we think that data is the answer. It’s far from the truth. In fact, it’s ridiculous, because the data is only a simulacrum of reality in the same way that a map is not a territory. And so while we need to use information and data to make decisions as we need to do, the data is always unfaithful, always unreliable, it always misleads, and you have to torture it until it confesses”–Kenneth Cukier, Data Editor, The Economist

Source: Economist Radio, “Arthur Miller and Modern-Day Witch-Hunts”

Posted in Big Data | Tagged | Leave a comment

The End of Big Data and the Beginning of Big Data AI

Dilbert_BigDataIn December 2014, I asked whether we were at the beginning of “the end of the Hadoop bubble.” I kept updating my Hadoop bubble watch (here and here) through the much-hyped IPOs of Hortonworks and Cloudera. The question was whether an open-source distributed storage technology which Google invented (and quickly replaced with better tools) could survive as a business proposition at a time when enterprises have moved rapidly to adopting the cloud and “AI”—advanced machine learning or deep learning.

Read more here

Posted in Big Data | Leave a comment

New Research Reports on Big Data

Two new research reports on big data flash out its early impact on enterprise IT. Continue reading

Posted in Big Data, Data Science | Leave a comment

Big Data Events

Big Data and Data Science Events

August – November 2012

Last updated August 6, 2012

Highlights: Partner Events

Big Data Innovation Summit September 13-14, Boston

Predictive Analytics World–Government September 17-18, Washington DC

To get a 15% off of the 2 Day and Combo passes, use this code:   WTBDBP12

Predictive Analytics World September 30-October 4, Boston

To get a 15% off of the 2 Day and Combo passes, use this code:   WTBDBP12

Text Analytics World Oct 3-4, Boston

To get a 15% off of the 2 Day and Combo passes, use this code:   WTBDBP12

Predictive Analytics World  November 6-7, Düsseldorf

To get a 15% off of the 2 Day and Combo passes, use this code:   WTBDBP12

Predictive Analytics World November 27-28, London

To get a 15% off of the 2 Day and Combo passes, use this code:   WTBDBP12

The 13th Annual International Conference on Information Reuse and Integration   August 8-10, Las Vegas

The 18th ACM SIGKDD Conference on Knowledge Discovery and Data Mining   August 12-16, Beijing, China     Continue reading

Posted in Big Data, Predictive analytics | Leave a comment

Facebook’s IPO and the Laws of Big Data

Without using any predictive analytics tools, I confidently predict that Facebook’s IPO will give rise to more vocal demands for people to “get a cut” of its—and other social media companies’—profits. People deserve, so the argument goes, a share of any profits derived from mining the social data pool which they have so willingly helped create. Occupy Facebook, anyone?

But before you set up a tent in Menlo Park, consider this proposition: The value of personal data is zero. Personal data is not worth much if it’s kept personal and a sample of one is good for answering a very limited set of questions. Personal data gains value when it is shared, when it is combined with and compared to other data.  Continue reading

Posted in Big Data, Data Science | Leave a comment

Machines vs. Models, Noise vs. Signal

An excerpt from Nassim Taleb’s forthcoming book, Antifragile, was posted yesterday on the Farnam Street blog. In “Noise and Signal,” Taleb says that “In business and economic decision-making, data causes severe side effects —data is now plentiful thanks to connectivity; and the share of spuriousness in the data increases as one gets more immersed into it. A not well discussed property of data: it is toxic in large quantities—even in moderate quantities…. the best way… to mitigate interventionism is to ration the supply of information, as naturalistically as possible. This is hard to accept in the age of the internet. It has been very hard for me to explain that the more data you get, the less you know what’s going on, and the more iatrogenics you will cause.”   Continue reading

Posted in Artificial Intelligence, Big Data, Machine Learning | Leave a comment

McKinsey Updates Estimates of Big Data Potential Value

McKinsey_BigDataPotential2011.png

Source: McKinsey, 2011

[In 2011], we estimated the potential for big data and analytics to create value in five specific domains. Revisiting them today shows uneven progress and a great deal of that value still on the table (exhibit). The greatest advances have occurred in location-based services and in US retail, both areas with competitors that are digital natives. In contrast, manufacturing, the EU public sector, and healthcare have captured less than 30 percent of the potential value we highlighted five years ago. And new opportunities have arisen since 2011, further widening the gap between the leaders and laggards.

Source: McKinsey, 2016

Posted in Big Data | Tagged | Leave a comment