Skills for Big Data Jobs and Careers (Infographic)

Infographic - The Emerging Skillsets of the Data Revolution

Source: ObjectRocket

Posted in Big Data | Leave a comment

What’s the Big Data? 12 Definitions

Last week I got an email from UC Berkeley’s Master of Information and Data Science program, asking me to respond to a survey of data science thought leaders, asking the question “What is big data”? I was especially delighted to be regarded as a “thought leader” by Berkeley’s School of Information, whose previous dean, Hal Varian (now chief economist at Google, answered my challenge fourteen years ago and produced the first study to estimate the amount of new information created in the world annually, a study I consider to be a major milestone in the evolution of our understanding of big data.

The Berkeley researchers estimated that the world had produced about 1.5 billion gigabytes of information in 1999 and in a 2003 replication of the study found out that amount to have doubled in 3 years. Data was already getting bigger and bigger and around that time, in 2001, industry analyst Doug Laney described the “3Vs”—volume, variety, and velocity—as the key “data management challenges” for enterprises, the same “3Vs” that have been used in the last four years by just about anyone attempting to define or describe big data.

The first documented use of the term “big data” appeared in a 1997 paper by scientists at NASA, describing the problem they had with visualization (i.e. computer graphics) which “provides an interesting challenge for computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources.”

In 2008, a number of prominent American computer scientists popularized the term, predicting that “big-data computing” will “transform the activities of companies, scientific researchers, medical practitioners, and our nation’s defense and intelligence operations.” The term “big-data computing,” however, is never defined in the paper.

The traditional database of authoritative definitions is, of course, the Oxford English Dictionary (OED). Here’s how the OED defines big data: (definition #1) “data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges.”

But this is 2014 and maybe the first place to look for definitions should be Wikipedia. Indeed, it looks like the OED followed its lead. Wikipedia defines big data (and it did it before the OED) as (#2) “an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications.”

While a variation of this definition is what is used by most commentators on big data, its similarity to the 1997 definition by the NASA researchers reveals its weakness. “Large” and “traditional” are relative and ambiguous (and potentially self-serving for IT vendors selling either “more resources” of the “traditional” variety or new, non-“traditional” technologies).

The widely-quoted 2011 big data study by McKinsey highlighted that definitional challenge. Defining big data as (#3) “datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze,” the McKinsey researchers acknowledged that “this definition is intentionally subjective and incorporates a moving definition of how big a dataset needs to be in order to be considered big data.” As a result, all the quantitative insights of the study, including the updating of the UC Berkeley numbers by estimating how much new data is stored by enterprises and consumers annually, relate to digital data, rather than just big data, e.g., no attempt was made to estimate how much of the data (or “datasets”) enterprises store is big data.

Another prominent source on big data is Viktor Mayer-Schönberger and Kenneth Cukier’s book on the subject. Noting that “there is no rigorous definition of big data,” they offer one that points to what can be done with the data and why its size matters:

(#4) “The ability of society to harness information in novel ways to produce useful insights or goods and services of significant value” and “…things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value.”

In Big Data@Work, Tom Davenport concludes that because of “the problems with the definition” of big data, “I (and other experts I have consulted) predict a relatively short life span for this unfortunate term.” Still, Davenport offers this definition:

(#5) “The broad range of new and massive data types that have appeared over the last decade or so.”

Let me offer a few other possible definitions:

(#6) The new tools helping us find relevant data and analyze its implications.

(#7) The convergence of enterprise and consumer IT.

(#8) The shift (for enterprises) from processing internal data to mining external data.

(#9) The shift (for individuals) from consuming data to creating data.

(#10) The merger of Madame Olympe Maxime and Lieutenant Commander Data.

#(11) The belief that the more data you have the more insights and answers will rise automatically from the pool of ones and zeros.

#(12) A new attitude by businesses, non-profits, government agencies, and individuals that combining data from multiple sources could lead to better decisions.

I like the last two. #11 is a warning against blindly collecting more data for the sake of collecting more data (see NSA). #12 is an acknowledgment that storing data in “data silos” has been the key obstacle to getting the data to work for us, to improve our work and lives. It’s all about attitude, not technologies or quantities.

What’s your definition of big data?

See here for the compilation of Big data definitions from 40+ thought leaders.

[Originally published on Forbes.com]

Posted in Big Data | Leave a comment

The OED, Big Data, and Crowdsourcing

The term “big data” was included in the most recent quarterly online update of the Oxford English Dictionary (OED). So now we have a most authoritative definition of what recently became big news: “data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges.”

Beyond succinct definitions, the enchanting beauty of the OED, at least for those who love words and their history, lies in the collection of quotations illustrating the forms and uses of each word from the earliest known instance of its occurrence to more recent ones.

As someone who has been somewhat preoccupied with uncovering the historical antecedents for our present day usage of the term big data (see A Very Short History of Big Data), I was delightfully surprised to find out that the OED team has discovered that the earliest use of the term happened in 1980, seventeen years before the publication of the first paper in the ACM digital library to use (and define) “big data.” Sociologist Charles Tilly wrote in a 1980 working paper surveying “The old new social history and the new old social history” that “none of the big questions has actually yielded to the bludgeoning of the big-data people.” While the context is the increasing use of computer technology and statistical methods by historians, it is clear that Tilly used the term not to describe specifically the magnitude of the data but as a flourish of the pen following the words “big questions.” The meaning of the sentence would not change if he used only the word “data.”

While I’m quite sure that Tilly did not have in mind big data as it is defined by the OED itself, the context of his discussion is very relevant to today’s debates regarding big data and data science. In the section of the article from which the “big data” quote is taken, Tilly paraphrases the discussion in a 1979 paper by historian Lawrence Stone of the use of quantitative methods in historical research and attempts to make it a “science.”

Stone’s criticism of “cliometricians,” whose “special field is economic history,” reads like a description of the work of many “quants”—in Wall Street, academia, or government—in the forty-five years since he issued his warning: “[Their] great enterprises are necessarily the result of team-work, rather like building the pyramids: squads of diligent assistants assemble data, encode it, programme it, and pass it through the maw of the computer, all under the autocratic direction of a team-leader. The results cannot be tested by any of the traditional methods since the evidence is buried in private computer-tapes, not exposed in published footnotes. In any case the data are often expressed in so mathematically recondite a form that they are unintelligible to the majority of historical profession. The only reassurance to the bemused laity is that the members of this priestly order disagree fiercely and publicly about the validity of each other’s findings.”

Anticipating today’s doubts about the effectiveness of big data and concerns about the ratio of signal to noise, Stone concludes “in general, the sophistication of the methodology has tended to exceed the reliability of the data, while the usefulness of the results seem—up to a point—to be in inverse correlation to the mathematical complexity of the methodology and the grandiose scale of data-collection.” (For a recent enthusiastic embrace of the application of data science to the humanities and a rebuttal.

As Tilly hinted in the title to his paper, the new on many occasions is a very familiar old. Just scratch the surface and you find that the “revolution”—a word which we now tend to use liberally to describe any technological development—nicely delivers us to some place in the past while providing a soothing sense of moving forward. Indeed, the first sense of the word “revolution” in the OED is “The action or fact, on the part of celestial bodies, of moving around in an orbit or circular course” or simply “The return or recurrence of a point or period of time.”

Another word added to the OED online in the recent update affirms the notion that (almost) everything old is new again. While “crowdsourcing” was coined by Jeff Howe in 2006, this “new” (revolutionary?) practice launched the OED a century and a half ago:

In July 1857 a circular was issued by the ‘Unregistered Words Committee’ of the Philological Society of London, which had set up the Committee a few weeks earlier to organize the collection of material to supplement the best existing dictionaries. This circular, which was reprinted in various journals, asked for volunteers to undertake to read particular books and copy out quotations illustrating ‘unregistered’ words and meanings—items not recorded in other dictionaries—that could be included in the proposed supplement. Several dozen volunteers came forward, and the quotations began to pour in.

The volume of the “unregistered” material was such that in January 1858, The Philological Society decided that “efforts should be directed toward the compilation of a complete dictionary, and one of unprecedented comprehensiveness.” It took a while, but in April 1879, the newly-appointed editor James Murray issued an appeal to the public, asking for volunteers to read specific books in search of quotations to be included in the future dictionary. Within a year there were close to 800 volunteers and over the next three years, 3,500,000 quotation slips were received and processed by the OED team.

Was this the first big-data-crowdsourcing project?

Posted in Big Data, Data Science | Leave a comment

A Very Short History of Big Data

In addition to researching A Very Short History of Data Science, I have also been looking at the history of how data became big. Here I focus on the history of attempts to quantify the growth rate in the volume of data or what has popularly been known as the “information explosion” (a term first used in 1941, according to the OED). The following are the major milestones in the history of sizing data volumes plus other “firsts” or observations pertaining to the evolution of the idea of “big data.”

[An updated version of this timeline is at Forbes.com]

Continue reading
Posted in Big Data | Leave a comment

3 Big Data Milestones

If you were asked to name the top three events in the history of the IT industry, which ones would you choose? Here’s my list:

June 30, 1945: John Von Neumann published the First Draft of a Report on the EDVAC, the first documented discussion of the stored program concept and the blueprint for computer architecture to this day.

May 22, 1973: Bob Metcalfe “banged out the memo inventing Ethernet” at Xerox Palo Alto Research Center (PARC).

March 1989: Tim Berners-Lee circulated “Information management: A proposal” at CERN in which he outlined a global hypertext system.

[Note: if round numbers are your passion, you may opt—without changing the substance of this condensed history—for the ENIAC proposal of April 1943, Ethernet in 1973, and CERN making the World Wide Web available to the world free of charge in April 1993, so that 2013 marks the 70th, 40th, and 20th anniversaries of these events.]

Why bother at all to look back? And why did I select these as the top three milestones in the evolution of information technology?

Most observers of the IT industry prefer and are expected to talk about what’s coming, not what’s happened. But to make educated guesses about the future of the IT industry, it helps to understand its past. Here I depart from most commentators who, if they talk at all about the industry’s past, divide it into hardware-defined “eras,” usually labeled “mainframes,” “PCs,” “Internet,” and “Post-PC.”

Another way of looking at the evolution of IT is to focus on the specific contributions of technological inventions and advances to the industry’s key growth driver: digitization and the resulting growth in the amount of digital data created, shared, and consumed. Each of these three events represents a leap forward, a quantitative and qualitative change in the growth trajectory of what we now call big data.

The industry was born with the first giant calculators digitally processing and manipulating numbers and then expanded to digitize other, mostly transaction-oriented activities, such as airline reservations.  But until the 1980s, all computer-related activities revolved around interactions between a person and a computer. That did not change when the first PCs arrived on the scene.

The PC was simply a mainframe on your desk. Of course it unleashed a wonderful stream of personal productivity applications that in turn contributed greatly to the growth of enterprise data and the start of digitizing leisure-related, home-based activities. But I would argue that the major quantitative and qualitative leap occurred only when work PCs were connected to each other via Local Area Networks (LANs)—where Ethernet became the standard—and then long-distance via Wide Area Networks (WANs). With the PC, you could digitally create the memo you previously typed on a typewriter, but to distribute it, you still had to print it and make paper copies. Computer networks (and their “killer app,” email) made the entire process digital, ensuring the proliferation of the message, drastically increasing the amount of data created, stored, moved, and consumed.

Connecting people in a vast and distributed network of computers not only increased the amount of data generated but also led to numerous new ways of getting value out of it, unleashing many new enterprise applications and a new passion for “data mining.” This in turn changed the nature of competition and gave rise to new “horizontal” players, focused on one IT component as opposed to the vertically integrated, “end-to-end solution” business model that has dominated the industry until then. Intel in semiconductors, Microsoft in operating systems, Oracle in databases, Cisco in networking, Dell in PCs (or rather, build-to-order PCs), and EMC in storage have made the 1990s the decade in which “best-of-breed” was what many IT buyers believed in, assembling their IT infrastructures from components sold by focused, specialized IT vendors.

The next phase in the evolution of the industry, the next quantitative and qualitative leap in the amount of data generated, came with the invention of the World Wide Web (commonly mislabeled as “the Internet”). It led to the proliferation of new applications which were no longer limited to enterprise-related activities but digitized almost any activity in our lives. Most important, it provided us with tools that greatly facilitated the creation and sharing of information by anyone with access to the Internet (the open and almost free wide area network only few people cared or knew about before the invention of the World Wide Web). The work memo I typed on a typewriter which became a digital document sent across the enterprise and beyond now became my life journal which I could discuss with others, including people on the other side of the globe I have never met.  While computer networks took IT from the accounting department to all corners of the enterprise, the World Wide Web took IT to all corners of the globe, connecting millions of people. Interactive conversations and sharing of information among these millions replaced and augmented broadcasting and drastically increased (again) the amount of data created, stored, moved, and consumed. And just as in the previous phase, a bunch of new players emerged, all of them born on the Web, all of them regarding “IT” not as specific function responsible for running the infrastructure but as the essence of their business, data and its analysis becoming their competitive edge.

We are probably going to see soon—and maybe already are experiencing—a new phase in the evolution of IT and a new quantitative and qualitative leap in the growth of data. The cloud—a new way to deliver IT, big data—a new attitude towards data and its potential value, and The Internet of Things (including wearable computers such as Google Glass)—connecting billions of monitoring and measurement devices quantifying everything—combine to sketch for us the future of IT.

[Originally published on Forbes.com]

Posted in Big Data | Leave a comment

Big Data Quotes of the Week

“Data is everywhere. It exists. We’re just pulling it into one place and our goal is to make it consumable for teachers”–Fahad Hassan, Always Prepped

“A lot of people are changing their title, but they’re not really data scientists, and there’s a lot of talk about the skills shortage. There just aren’t enough of them”–Amit Bendov,  SiSense

“Engineering, I think you can pick up. [A data scientist’s] curiosity is built-in”–Scott Nicholson, Accretive Health

“The thought process is the most important ingredient in data science”–Catalin Ciobanu,  Carlson Wagonlit Travel.

“We run the company by questions, not by answers. So in the strategy process we’ve so far formulated 30 questions that we have to answer […] You ask it as a question, rather than a pithy answer, and that stimulates conversation. Out of the conversation comes innovation”–Eric Schmidt, Google

“We’re seeing the beginnings of bringing the collaboration models that have been vastly successful in open-source communities to data science… The future looks like this: The entire workflow from data to analysis to result to visualization will be social and collaborative“–Donnie Berkholz

“it’s not hard to imagine a day where [baseball] managers… have their locker room data scientist run real-time, in-game analytics using technologies like Cassandra, Hbase, Drill, and Impala”–Barry Eggers, Lightspeed Venture Partners

“Measuring influence is hard, especially in the context of an online social network. We may not be able to explicitly model the process of persuading others to change their behavior, especially when we do not have all of the necessary data in one place. But it is crucial test of an influence measure’s realism that it recognize human attention as a scarce commodity, and that it be resistant to manipulation. In any case, influence matters too much for us not to try to measure it. Influence is ultimately about the battle for the scarce space in people’s minds–our most precious natural resource”–Daniel Tunkelang, LinkedIn

Posted in Statistics | Leave a comment

Gartner on Big Data

In its just-published Hype Cycle for Cloud Computing 2012, Gartner predicts that “Big Data will deliver transformational benefits to enterprises within 2 to 5 years, and by 2015 will enable enterprises adopting this technology to outperform competitors by 20% in every available financial metric.” The “transformational benefits,” however, will be delivered to very few enterprises according to another Gartner prediction, from December 2011: “Through 2015, more than 85 percent of Fortune 500 organizations will fail to effectively exploit big data for competitive advantage.”

Gartner currently positions Big Data just below “the peak of inflated expectations.”

Posted in Big Data | Leave a comment

Big Data: A Revolution that Will Transform How We Live, Work, and Think

Viktor Mayer-Schönberger and Kenneth Cukier, authors of the just-published Big Data: A Revolution that Will Transform How We Live, Work, and Think,  reacted sharply when I asked them if they are cheerleaders for big data, as one reviewer implied. ”We are messengers of big data, not its evangelists,” said Cukier. Added Mayer-Schönberger: “The reviewer did not read the book.”

I did. Big Data is an excellent introduction for general audiences to what has become a topic of conversation everywhere, faster than any other technology-driven buzzword in recent memory. To those who may react to “big data” as today’s incarnation of “big brother,” Mayer-Schönberger and Cukier offer a comprehensive and highly readable overview of the benefits and risks associated with big data, which they define as “the ability of society to harness information in novel ways to produce useful insights or goods and services of significant value.”    Continue reading

Posted in Big Data | Leave a comment

Past Courses in Big Data Analytics and Data Science: Content Online

Past Courses

in Big Data Analytics and Data Science

Content Online

Analyzing Big Data with Twitter (UC Berkeley, School of Information) (Fall 2012)

Introduction to Data Science (Columbia University, Statistics Department) (Fall 2012

Introduction to  Data Science (UC Berkeley, Computer Science) (Spring 2011)

Posted in Big Data, Data Science | Leave a comment

The Hadoop Bubble Quivers As Hortonworks Misses

Hadoop BubbleLast month, Hortonworks announced quarterly results for the first time as a public company and they came below expectations. It had revenues of $12.7 million (up 55% year-over-year), but average Wall Street estimates were $13.42 million. Similarly, Wall Street expected a loss of $2.04 per share and Hortonworks reported a loss of $2.19 per share.

The results could be attributed to a company new to the game of providing guidance to Wall Street. But the company’s management had substantial experience in that department throughout their impressive careers so we must look somewhere else for an explanation. What if November 10, 2014, the day Hortonworks filed the paperwork for its IPO was the beginning of the end of the Hadoop bubble, to quote your humble correspondent? What if December 12, 2014, the day Hortonworks went public, surprising many by its swift action, the bubble “began to quiver and shake preparatory to its bursting”? What if Hortonworks had decided to rush to the exit while expectations were high?

People who had over-inflated expectations—and may have grumbled yesterday “what were we thinking”—should have listened to Mike Stonebraker last August. Here’s what this foremost authority on databases (and serial entrepreneur) said about the new generation of Hadoop from Hortonworks competitor Cloudera:

Impala is architected exactly like all of the shared-nothing parallel SQL DBMSs, serving the data warehouse market. Specifically, notice clearly that the MapReduce layer has been removed, and for good reason. As some of us have been pointing out for years, MapReduce is not a useful internal interface inside a SQL (or Hive) DBMS. Impala was architected by savvy DBMS developers, who know the above pragma. In fact, development activity similar to Impala is being done by both HortonWorks and FaceBook. This, of course, presents the Hadoop vendors with a dilemma. Historically, “Hadoop” referred to the open source version of MapReduce written by Yahoo. However, Impala has thrown this layer out of the stack. How can one be a Hadoop vendor, when Hadoop is no longer in the mainstream stack? The answer is simple: redefine “Hadoop”, and that is exactly what the Hadoop vendors have done. The word “Hadoop” is now used to mean the entire stack.

In my post, I suggested “a few things to ponder when considering the potential success of the current leading Hadoop vendors and whether Hadoop in general is in the first stage of a rapid market expansion or the last stage of a bubble inflating.” One of them was the incorporation of Hadoop and similar tools by established software vendors into their traditional database and information management offerings.  Stonebraker is highlighting the opposite, the recasting of Hadoop into what looks like a traditional database technology. He says: “Meanwhile most of the data warehouse vendors support HDFS, and many offer features to support semi-structured data. Hence, the data warehouse market and the Hadoop market will quickly converge.”

Another argument I made was “Hadoop is so 2004 (at least at Google).” Here’s Stonebraker on the subject:

Google must be “laughing in their beer” about now. They invented MapReduce to support the web crawl for their search engine in 2004. A few years ago they replaced MapReduce in this application with BigTable, because they wanted an interactive storage system and MapReduce was batch-only. Hence, the driving application behind MapReduce moved to a better platform a while ago. Now Google is reporting that they see little-to-no future need for MapReduce. It is indeed ironic that Hadoop is picking up support in the general community about five years after Google moved on to better things. Hence, the rest of the world followed Google into Hadoop with a delay of most of a decade. Google has long since abandoned it. I wonder how long it will take the rest of the world to follow Google’s direction and do likewise…

No matter. Here’s what Matthew Hedberg, an analyst for RBC Capital Markets, wrote (according to Investor’s Business Daily) just before the Hortonworks quarterly earnings announcement:  “We remain bullish on Hortonworks’ opportunity as a pure play on Hadoop and believe it to be one of the better-positioned disruptive vendors in what could be a once-in-a-decade data replatforming opportunity.” An analyst with Cowen and Co, Jesse Hulsing, expressed a similar bullish sentiment: “The Hadoop market is in early stages of adoption. Our view is that most large enterprises (5,000-plus employees) will have adopted or piloted the technology by fiscal year 2020. The underlying driver of this adoption is the growth in analytic applications, which is driven by rapid growth in new data types and new user types. Hortonworks should benefit from this.”

Maybe the market is indeed going gangbusters and Hortonworks is simply losing to better-equipped competitors, primarily Cloudera?

Apparently anticipating this question, Cloudera issued last week a “momentum press release,” announcing that its 2014 revenues “surpassed $100 million,” calling the results “an indicator of Hadoop’s strong momentum.” Derrick Harris at GigaOm had this to say about the news: “That the company, which is still privately held, would choose to disclose even that much information about its finances speaks to the fast maturation, growing competition and big egos in the Hadoop space.”  Similarly, Arik Hesseldahl at Re/Code noted that a “likely motivation for the press release is a battle of optics between Cloudera and its primary rival, Hortonworks… Cloudera may simply be seeking to remind the marketplace which Hadoop company is bigger.”

It is also reminding the marketplace that it’s not going to be subject to the scrutiny accorded public companies anytime soon. Cloudera co-founder and chief strategy officer Mike Olson told Re/Code: “We have no timeline for an IPO, period.” CEO Reilly told Fortune “We’re of the size and scale that we could be a successful public company right now. But we’re so well backed that we don’t need to go public to have access to financing.” Indeed, after riding a bubble and raising a cool $1 billion, who needs Wall Street?

But they need Main Street. Regardless of the close to $1.5 billion in venture capital the key Hadoop competitors left standing—Cloudera, Hortonworks and MapR—have raised, to survive and succeed they need enterprise customers to buy the current (and future) Hadoop incarnations they offer.

In my previous post on the Hadoop Bubble, I quoted a 2014 survey conducted by Wikibon which found that only 36% of the respondents were using Hadoop and the majority of those (64%) were using it in proof-of-concept environments.  Even more important to the financial future of Hadoop vendors, Wikibon found that “only 25% of Hadoop practitioners are paying customers of one or another Hadoop vendor. 24% use a free distribution provided by a vendor, but the majority, 51%, roll their own Hadoop downloaded from the Apache Software Foundation.” Don’t you think this has something to do with Hortonworks’ quarterly results?

The author of the excellent Wikibon report, Jeff Kelly, gave a presentation last week, titled The Big Data Money Trail. About 23 minutes into the presentation, Kelly gets to a slide titled (surprise!) “Is this the beginning of the end of the bubble, or is there something next that matters?”

Kelly definitely thinks (or at least thought last week) that Hadoop still matters. He thinks Cloudera and Hortonworks will survive and doesn’t back down from his previous estimates of how big the big data market will get. He predicts three future developments, all helping accelerate big data adoption, but not necessarily (in my opinion) promising for Hadoop vendors: enterprises will overcome their process and culture obstacles for adopting big data technologies; innovation will continue to drive the market because it is based on open source software; and while Hadoop was the “low-hanging fruit,” offering cost saving opportunities, now enterprises will start building “data-driven applications.”

To illustrate the last point, specifically the value that can be created by all these new applications of big data, Kelly reproduced on the slide the results of previous work done by Wikibon which estimated the “spend and value delivered by industrial internet” to reach $1.2 trillion in 2020. Bert Latamore, his colleague at Silicon Angle, wrote in his summary of Kelly’s talk that “Vendors will do well in the Big Data market over the next decade, Kelly predicts, but the real winners will be the companies that harness the technology creatively. He estimates that practitioners will create $1.2 trillion in new value from Big Data over the coming decade.” (Italics mine)

So a 2013 report on the Industrial Internet has metamorphosed into current (and misattributed) estimates of how many dollars are swimming in the big data lake. This is how bubbles rise, and eventually, burst.

Or maybe not, maybe I’m wrong and what we have is the beginning of a solid market for products from disruptive vendors going after “once-in-a-decade data replatforming opportunity.”  After all, Hortonworks provided above-consensus guidance for the current quarter and they are much closer than I to what is really happening in the marketplace.

In an interview with Derrick Harris conducted last week, Hortonworks CEO Rob Bearden said that he is not backing off his 2014 prediction that Hadoop will soon become a multi-billion-dollar market and Hortonworks will be a billion-dollar company in terms of revenue. Hadoop is actually just a part — albeit a big one — of a major evolution in the data-infrastructure space, he explained to Harris. As companies start replacing the pieces of their data environments, they’ll do so with the open source options that now dominate new technologies. These include Hadoop, NoSQL databases, Storm, Kafka, Spark and the like. “Open source companies can be very successful in terms of revenue growth and in terms of profitability faster than the old proprietary platforms got there,” Bearden said.

 Originally posted on Forbes.com

 

Posted in Big Data | Leave a comment