WhatsTheBigData

The Data Market to Nearly Double in Size by 2019

Posted on May 28, 2015 by GilPress

Consisting of data platforms, data management, analytics, and data mining the Total Data Market is expected to nearly double in size, from $60bn in 2014 to $115bn in 2019. The forecast is based on 451 Research’s new Total Data Market Monitor service, which presents data, generated via a bottom-up analysis, of 202 vendors that participate across the nine Total Data segments the company tracks. Specifically, 451 Research tracks 56 Operational Database participants, 26 in the Analytic Database market, 72 within the Reporting and Analytics segment, 41 Data Management vendors, 11 Performance Management vendors, 11 Event/Stream Processing vendors, 9 Distributed Data Grid/Cache vendors, 25 Hadoop vendors and 15 Search vendors.

Posted in Big Data | Leave a comment

Jake Flomenberg from Accel Partners on the Big Data Market (Video)

Posted on February 26, 2014 by GilPress

[youtube http://www.youtube.com/watch?v=SHOw-2IWHZE]

From VentureBeat:

In a new video, Jake Flomenberg of Accel Partners lays out his view of the big data market and the investing opportunities he’s excited about. He’s talking with another data expert: Stefan Groschupf, the chief executive of well-funded big data startup Datameer.

Flomenberg knows what he’s talking about: He worked on sales, marketing, and product problems at hot data startup (and likely IPO candidate) Cloudera.

He’s one person who works with Accel’s big data fund. He managed to get in on hot data-transformation startup Trifacta, as well as marketing-focused Origami Logic and log-management company Sumo Logic.

Posted in Big Data | Leave a comment

Big Data Landscape 2017: Big Data + AI = New IT Stack

Posted on April 11, 2017 by GilPress

Matt Turck:

We’re witnessing the emergence of a new stack, where Big Data technologies are used to handle core data engineering challenges, and machine learning is used to extract value from the data (in the form of analytical insights, or actions).

In other words: Big Data provides the pipes, and AI provides the smarts.

Of course, this symbiotic relationship has existed for years, but its implementation was only available to a privileged few.

The democratization of those technologies has now started in earnest. “Big Data + AI” is becoming the default stack upon which many modern applications (whether targeting consumers or enterprise) are being built. Both startups and some Fortune 1000 companies are leveraging this new stack…

Often, but not always, the cloud is the third leg of the stool. This trend is precipitated by all the efforts of the cloud giants, who are now in an open war to provide access to a machine learning cloud.

Posted in Artificial Intelligence, Big Data | Tagged matt turck | Leave a comment

Skills for Big Data Jobs and Careers (Infographic)

Posted on May 27, 2015 by GilPress

Source: ObjectRocket

Posted in Big Data | Leave a comment

3 Big Data Milestones

Posted on May 21, 2013 by GilPress

If you were asked to name the top three events in the history of the IT industry, which ones would you choose? Here’s my list:

June 30, 1945: John Von Neumann published the First Draft of a Report on the EDVAC, the first documented discussion of the stored program concept and the blueprint for computer architecture to this day.

May 22, 1973: Bob Metcalfe “banged out the memo inventing Ethernet” at Xerox Palo Alto Research Center (PARC).

March 1989: Tim Berners-Lee circulated “Information management: A proposal” at CERN in which he outlined a global hypertext system.

[Note: if round numbers are your passion, you may opt—without changing the substance of this condensed history—for the ENIAC proposal of April 1943, Ethernet in 1973, and CERN making the World Wide Web available to the world free of charge in April 1993, so that 2013 marks the 70^th, 40^th, and 20^th anniversaries of these events.]

Why bother at all to look back? And why did I select these as the top three milestones in the evolution of information technology?

Most observers of the IT industry prefer and are expected to talk about what’s coming, not what’s happened. But to make educated guesses about the future of the IT industry, it helps to understand its past. Here I depart from most commentators who, if they talk at all about the industry’s past, divide it into hardware-defined “eras,” usually labeled “mainframes,” “PCs,” “Internet,” and “Post-PC.”

Another way of looking at the evolution of IT is to focus on the specific contributions of technological inventions and advances to the industry’s key growth driver: digitization and the resulting growth in the amount of digital data created, shared, and consumed. Each of these three events represents a leap forward, a quantitative and qualitative change in the growth trajectory of what we now call big data.

The industry was born with the first giant calculators digitally processing and manipulating numbers and then expanded to digitize other, mostly transaction-oriented activities, such as airline reservations. But until the 1980s, all computer-related activities revolved around interactions between a person and a computer. That did not change when the first PCs arrived on the scene.

The PC was simply a mainframe on your desk. Of course it unleashed a wonderful stream of personal productivity applications that in turn contributed greatly to the growth of enterprise data and the start of digitizing leisure-related, home-based activities. But I would argue that the major quantitative and qualitative leap occurred only when work PCs were connected to each other via Local Area Networks (LANs)—where Ethernet became the standard—and then long-distance via Wide Area Networks (WANs). With the PC, you could digitally create the memo you previously typed on a typewriter, but to distribute it, you still had to print it and make paper copies. Computer networks (and their “killer app,” email) made the entire process digital, ensuring the proliferation of the message, drastically increasing the amount of data created, stored, moved, and consumed.

Connecting people in a vast and distributed network of computers not only increased the amount of data generated but also led to numerous new ways of getting value out of it, unleashing many new enterprise applications and a new passion for “data mining.” This in turn changed the nature of competition and gave rise to new “horizontal” players, focused on one IT component as opposed to the vertically integrated, “end-to-end solution” business model that has dominated the industry until then. Intel in semiconductors, Microsoft in operating systems, Oracle in databases, Cisco in networking, Dell in PCs (or rather, build-to-order PCs), and EMC in storage have made the 1990s the decade in which “best-of-breed” was what many IT buyers believed in, assembling their IT infrastructures from components sold by focused, specialized IT vendors.

The next phase in the evolution of the industry, the next quantitative and qualitative leap in the amount of data generated, came with the invention of the World Wide Web (commonly mislabeled as “the Internet”). It led to the proliferation of new applications which were no longer limited to enterprise-related activities but digitized almost any activity in our lives. Most important, it provided us with tools that greatly facilitated the creation and sharing of information by anyone with access to the Internet (the open and almost free wide area network only few people cared or knew about before the invention of the World Wide Web). The work memo I typed on a typewriter which became a digital document sent across the enterprise and beyond now became my life journal which I could discuss with others, including people on the other side of the globe I have never met. While computer networks took IT from the accounting department to all corners of the enterprise, the World Wide Web took IT to all corners of the globe, connecting millions of people. Interactive conversations and sharing of information among these millions replaced and augmented broadcasting and drastically increased (again) the amount of data created, stored, moved, and consumed. And just as in the previous phase, a bunch of new players emerged, all of them born on the Web, all of them regarding “IT” not as specific function responsible for running the infrastructure but as the essence of their business, data and its analysis becoming their competitive edge.

We are probably going to see soon—and maybe already are experiencing—a new phase in the evolution of IT and a new quantitative and qualitative leap in the growth of data. The cloud—a new way to deliver IT, big data—a new attitude towards data and its potential value, and The Internet of Things (including wearable computers such as Google Glass)—connecting billions of monitoring and measurement devices quantifying everything—combine to sketch for us the future of IT.

[Originally published on Forbes.com]

Posted in Big Data | Leave a comment

A Very Short History of Big Data

Posted on June 6, 2012 by GilPress

In addition to researching A Very Short History of Data Science, I have also been looking at the history of how data became big. Here I focus on the history of attempts to quantify the growth rate in the volume of data or what has popularly been known as the “information explosion” (a term first used in 1941, according to the OED). The following are the major milestones in the history of sizing data volumes plus other “firsts” or observations pertaining to the evolution of the idea of “big data.”

[An updated version of this timeline is at Forbes.com]

Continue reading →

Posted in Big Data | Leave a comment

The OED, Big Data, and Crowdsourcing

Posted on August 17, 2013 by GilPress

The term “big data” was included in the most recent quarterly online update of the Oxford English Dictionary (OED). So now we have a most authoritative definition of what recently became big news: “data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges.”

Beyond succinct definitions, the enchanting beauty of the OED, at least for those who love words and their history, lies in the collection of quotations illustrating the forms and uses of each word from the earliest known instance of its occurrence to more recent ones.

As someone who has been somewhat preoccupied with uncovering the historical antecedents for our present day usage of the term big data (see A Very Short History of Big Data), I was delightfully surprised to find out that the OED team has discovered that the earliest use of the term happened in 1980, seventeen years before the publication of the first paper in the ACM digital library to use (and define) “big data.” Sociologist Charles Tilly wrote in a 1980 working paper surveying “The old new social history and the new old social history” that “none of the big questions has actually yielded to the bludgeoning of the big-data people.” While the context is the increasing use of computer technology and statistical methods by historians, it is clear that Tilly used the term not to describe specifically the magnitude of the data but as a flourish of the pen following the words “big questions.” The meaning of the sentence would not change if he used only the word “data.”

While I’m quite sure that Tilly did not have in mind big data as it is defined by the OED itself, the context of his discussion is very relevant to today’s debates regarding big data and data science. In the section of the article from which the “big data” quote is taken, Tilly paraphrases the discussion in a 1979 paper by historian Lawrence Stone of the use of quantitative methods in historical research and attempts to make it a “science.”

Stone’s criticism of “cliometricians,” whose “special field is economic history,” reads like a description of the work of many “quants”—in Wall Street, academia, or government—in the forty-five years since he issued his warning: “[Their] great enterprises are necessarily the result of team-work, rather like building the pyramids: squads of diligent assistants assemble data, encode it, programme it, and pass it through the maw of the computer, all under the autocratic direction of a team-leader. The results cannot be tested by any of the traditional methods since the evidence is buried in private computer-tapes, not exposed in published footnotes. In any case the data are often expressed in so mathematically recondite a form that they are unintelligible to the majority of historical profession. The only reassurance to the bemused laity is that the members of this priestly order disagree fiercely and publicly about the validity of each other’s findings.”

Anticipating today’s doubts about the effectiveness of big data and concerns about the ratio of signal to noise, Stone concludes “in general, the sophistication of the methodology has tended to exceed the reliability of the data, while the usefulness of the results seem—up to a point—to be in inverse correlation to the mathematical complexity of the methodology and the grandiose scale of data-collection.” (For a recent enthusiastic embrace of the application of data science to the humanities and a rebuttal.

As Tilly hinted in the title to his paper, the new on many occasions is a very familiar old. Just scratch the surface and you find that the “revolution”—a word which we now tend to use liberally to describe any technological development—nicely delivers us to some place in the past while providing a soothing sense of moving forward. Indeed, the first sense of the word “revolution” in the OED is “The action or fact, on the part of celestial bodies, of moving around in an orbit or circular course” or simply “The return or recurrence of a point or period of time.”

Another word added to the OED online in the recent update affirms the notion that (almost) everything old is new again. While “crowdsourcing” was coined by Jeff Howe in 2006, this “new” (revolutionary?) practice launched the OED a century and a half ago:

In July 1857 a circular was issued by the ‘Unregistered Words Committee’ of the Philological Society of London, which had set up the Committee a few weeks earlier to organize the collection of material to supplement the best existing dictionaries. This circular, which was reprinted in various journals, asked for volunteers to undertake to read particular books and copy out quotations illustrating ‘unregistered’ words and meanings—items not recorded in other dictionaries—that could be included in the proposed supplement. Several dozen volunteers came forward, and the quotations began to pour in.

The volume of the “unregistered” material was such that in January 1858, The Philological Society decided that “efforts should be directed toward the compilation of a complete dictionary, and one of unprecedented comprehensiveness.” It took a while, but in April 1879, the newly-appointed editor James Murray issued an appeal to the public, asking for volunteers to read specific books in search of quotations to be included in the future dictionary. Within a year there were close to 800 volunteers and over the next three years, 3,500,000 quotation slips were received and processed by the OED team.

Was this the first big-data-crowdsourcing project?

Posted in Big Data, Data Science | Leave a comment

What’s the Big Data? 12 Definitions

Posted on September 8, 2014 by GilPress

Last week I got an email from UC Berkeley’s Master of Information and Data Science program, asking me to respond to a survey of data science thought leaders, asking the question “What is big data”? I was especially delighted to be regarded as a “thought leader” by Berkeley’s School of Information, whose previous dean, Hal Varian (now chief economist at Google, answered my challenge fourteen years ago and produced the first study to estimate the amount of new information created in the world annually, a study I consider to be a major milestone in the evolution of our understanding of big data.

The Berkeley researchers estimated that the world had produced about 1.5 billion gigabytes of information in 1999 and in a 2003 replication of the study found out that amount to have doubled in 3 years. Data was already getting bigger and bigger and around that time, in 2001, industry analyst Doug Laney described the “3Vs”—volume, variety, and velocity—as the key “data management challenges” for enterprises, the same “3Vs” that have been used in the last four years by just about anyone attempting to define or describe big data.

The first documented use of the term “big data” appeared in a 1997 paper by scientists at NASA, describing the problem they had with visualization (i.e. computer graphics) which “provides an interesting challenge for computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources.”

In 2008, a number of prominent American computer scientists popularized the term, predicting that “big-data computing” will “transform the activities of companies, scientific researchers, medical practitioners, and our nation’s defense and intelligence operations.” The term “big-data computing,” however, is never defined in the paper.

The traditional database of authoritative definitions is, of course, the Oxford English Dictionary (OED). Here’s how the OED defines big data: (definition #1) “data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges.”

But this is 2014 and maybe the first place to look for definitions should be Wikipedia. Indeed, it looks like the OED followed its lead. Wikipedia defines big data (and it did it before the OED) as (#2) “an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications.”

While a variation of this definition is what is used by most commentators on big data, its similarity to the 1997 definition by the NASA researchers reveals its weakness. “Large” and “traditional” are relative and ambiguous (and potentially self-serving for IT vendors selling either “more resources” of the “traditional” variety or new, non-“traditional” technologies).

The widely-quoted 2011 big data study by McKinsey highlighted that definitional challenge. Defining big data as (#3) “datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze,” the McKinsey researchers acknowledged that “this definition is intentionally subjective and incorporates a moving definition of how big a dataset needs to be in order to be considered big data.” As a result, all the quantitative insights of the study, including the updating of the UC Berkeley numbers by estimating how much new data is stored by enterprises and consumers annually, relate to digital data, rather than just big data, e.g., no attempt was made to estimate how much of the data (or “datasets”) enterprises store is big data.

Another prominent source on big data is Viktor Mayer-Schönberger and Kenneth Cukier’s book on the subject. Noting that “there is no rigorous definition of big data,” they offer one that points to what can be done with the data and why its size matters:

(#4) “The ability of society to harness information in novel ways to produce useful insights or goods and services of significant value” and “…things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value.”

In Big Data@Work, Tom Davenport concludes that because of “the problems with the definition” of big data, “I (and other experts I have consulted) predict a relatively short life span for this unfortunate term.” Still, Davenport offers this definition:

(#5) “The broad range of new and massive data types that have appeared over the last decade or so.”

Let me offer a few other possible definitions:

(#6) The new tools helping us find relevant data and analyze its implications.

(#7) The convergence of enterprise and consumer IT.

(#8) The shift (for enterprises) from processing internal data to mining external data.

(#9) The shift (for individuals) from consuming data to creating data.

(#10) The merger of Madame Olympe Maxime and Lieutenant Commander Data.

#(11) The belief that the more data you have the more insights and answers will rise automatically from the pool of ones and zeros.

#(12) A new attitude by businesses, non-profits, government agencies, and individuals that combining data from multiple sources could lead to better decisions.

I like the last two. #11 is a warning against blindly collecting more data for the sake of collecting more data (see NSA). #12 is an acknowledgment that storing data in “data silos” has been the key obstacle to getting the data to work for us, to improve our work and lives. It’s all about attitude, not technologies or quantities.

What’s your definition of big data?

See here for the compilation of Big data definitions from 40+ thought leaders.

[Originally published on Forbes.com]

Posted in Big Data | Leave a comment

Gartner on Big Data

Posted on August 5, 2012 by GilPress

In its just-published Hype Cycle for Cloud Computing 2012, Gartner predicts that “Big Data will deliver transformational benefits to enterprises within 2 to 5 years, and by 2015 will enable enterprises adopting this technology to outperform competitors by 20% in every available financial metric.” The “transformational benefits,” however, will be delivered to very few enterprises according to another Gartner prediction, from December 2011: “Through 2015, more than 85 percent of Fortune 500 organizations will fail to effectively exploit big data for competitive advantage.”

Gartner currently positions Big Data just below “the peak of inflated expectations.”

Posted in Big Data | Leave a comment

Big Data Quotes of the Week

Posted on November 24, 2012 by GilPress

“Data is everywhere. It exists. We’re just pulling it into one place and our goal is to make it consumable for teachers”–Fahad Hassan, Always Prepped

“A lot of people are changing their title, but they’re not really data scientists, and there’s a lot of talk about the skills shortage. There just aren’t enough of them”–Amit Bendov, SiSense

“Engineering, I think you can pick up. [A data scientist’s] curiosity is built-in”–Scott Nicholson, Accretive Health

“The thought process is the most important ingredient in data science”–Catalin Ciobanu, Carlson Wagonlit Travel.

“We run the company by questions, not by answers. So in the strategy process we’ve so far formulated 30 questions that we have to answer […] You ask it as a question, rather than a pithy answer, and that stimulates conversation. Out of the conversation comes innovation”–Eric Schmidt, Google

“We’re seeing the beginnings of bringing the collaboration models that have been vastly successful in open-source communities to data science… The future looks like this: The entire workflow from data to analysis to result to visualization will be social and collaborative“–Donnie Berkholz

“it’s not hard to imagine a day where [baseball] managers… have their locker room data scientist run real-time, in-game analytics using technologies like Cassandra, Hbase, Drill, and Impala”–Barry Eggers, Lightspeed Venture Partners

“Measuring influence is hard, especially in the context of an online social network. We may not be able to explicitly model the process of persuading others to change their behavior, especially when we do not have all of the necessary data in one place. But it is crucial test of an influence measure’s realism that it recognize human attention as a scarce commodity, and that it be resistant to manipulation. In any case, influence matters too much for us not to try to measure it. Influence is ultimately about the battle for the scarce space in people’s minds–our most precious natural resource”–Daniel Tunkelang, LinkedIn

Posted in Statistics | Leave a comment

WhatsTheBigData

The Data Market to Nearly Double in Size by 2019

Jake Flomenberg from Accel Partners on the Big Data Market (Video)

Big Data Landscape 2017: Big Data + AI = New IT Stack

Skills for Big Data Jobs and Careers (Infographic)

3 Big Data Milestones

A Very Short History of Big Data

The OED, Big Data, and Crowdsourcing

What’s the Big Data? 12 Definitions

Gartner on Big Data

Big Data Quotes of the Week

Categories

Archives