Imagination and Data Science

Today in 1833, Ada Byron (later Countess Lovelace) met Charles Babbage when visiting his house to see a portion the Difference Engine, or what her mother, Lady Byron, called his “thinking machine.” James Gleick writes in The Information: “Babbage saw a sparkling, self-possessed young woman with porcelain features and a notorious name, who managed to reveal that she knew more mathematics than most men graduating from university. She saw an imposing forty-one-year-old, authoritative eyebrows anchoring his strong-boned face, who possessed wit and charm and did not wear these qualities lightly. He seemed a kind of visionary–just what she was seeking. She admired the machine, too.”

With the Analytical Engine, Babbage imagined the modern computer. Gleick quotes Ada on imagination, from an essay she wrote in 1841: “It is that which penetrates into the unseen worlds around us, the worlds of Science. It is that which feels & discovers what is, the real which we see not, which exists not for our senses. Those who have learned to walk the threshold of the unknown worlds… may then with the fair white wings of Imagination hope to soar further into the unexplored amidst which we live.”

In this she anticipated Albert Einstein’s much-quoted observation: “Imagination is more important than knowledge. For knowledge is limited to all we now know and understand, while imagination embraces the entire world, and all there ever will be to know and understand.”

Note to Data Scientists (or more specifically, those making exaggerated claims about IBM’s Watson or the promise of “data-driven” science): Without our imagination, machines can’t learn.

Posted in Data Science | Leave a comment

The Data Science Interview: Mok Oh, PayPal

To Do Data Science, You Need a Team of Specialists

Currently the Chief Scientist at PayPal, Mok Oh came on board when eBay acquired WHERE, where he was Chief Innovation Officer.  Prior to WHERE, Mok founded EveryScape, a data visualization company.  The following is an edited transcript of our recent phone conversation.

How do you define a data scientist?   Continue reading

Posted in Data Science | Leave a comment

Data Scientists Spend Most of Their Time Cleaning Data

A new survey of data scientists found that they spend most of their time massaging rather than mining or modeling data. Still, most are happy with having the sexiest job of the 21st century. The survey of about 80 data scientists was conducted for the second year in a row by CrowdFlower, provider of a “data enrichment” platform for data scientists. Here are the highlights:

Data preparation accounts for about 80% of the work of data scientists

Data scientists spend 60% of their time on cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around 80% of their time on preparing and managing data for analysis.

76% of data scientists view data preparation as the least enjoyable part of their work

57% of data scientists regard cleaning and organizing data as the least enjoyable part of their work and 19% say this about collecting data sets.

These findings are yet another confirmation of a very widely known and lamented fact of the data scientist’s work experience. In 2009, data scientist Mike Driscoll popularized the term “data munging,” describing the “painful process of cleaning, parsing, and proofing one’s data” as one of the three sexy skills of data geeks. In 2013, Josh Wills (then director of Data Science at Cloudera, now Director of Data Engineering at Slack) told Technology Review “I’m a data janitor. That’s the sexiest job of the 21st century. It’s very flattering, but it’s also a little baffling.” And Big Data Borat tweeted that “Data Science is 99% preparation, 1% misinterpretation.”

Given that the median annual base salary in the U.S. of the hard-to-find and much-in-demand data scientists was $104,000 last year, a number of startups have focused on automating a solution to this essential but boring task. In his 2016 Big Data Landscape, Matt Turck lists a number of them in the “data transformation” box plus companies (such as CrowdFlower) that are addressing this need with crowdsourcing (both in the “infrastructure” section).

Investing in solutions to messy data will continue and IDC has predicted that through 2020, spending on self-service visual discovery and data preparation tools will grow 2.5x faster than traditional IT-controlled tools for similar functionality. Following the same trend, Forrester predicted that in 2016, machine learning will begin to replace manual “data wrangling” (another endearing term like “data munging”) and data governance dirty work, and that vendors will market these solutions as a way to make data ingestion, preparation, and discovery quicker.

Indeed, 55% of the respondents to the CrowdFlower survey agreed with Forrester, predicting that over the next year machine learning will have (or will continue to have) a significant importance for their companies and their departments.

Other findings:

35% of data scientists gave their job the highest mark possible.

Only 14% of data scientists felt they were being held back by their tools.

What data scientists want most is more support and direction from their management or executive team (27%).

Finally, CrowdFlower looked at nearly 4,000 data science job postings on LinkedIn to find out what skills organizations wanted from their new hires. Last year they found that the skills most in demand were programming and coding. This year, they looked for more specific data science tools that are mentioned in job posting.

Here are the Top 10 in-demand skills for data scientists:

 Skills  % of jobs with skill
SQL 56%
Hadoop 49%
Python 39%
Java 36%
R 32%
Hive 31%
Mapreduce 22%
NoSQL 18%
Pig 16%
SAS 16%

 I’m sure it is relatively easy for employers to test prospective data scientists for their proficiency in any of the above tools and data platforms. But how do they test for their efficiency in removing commas?

Originally published on Forbes.com

Posted in Data Science | Tagged | Leave a comment

6 Highlights of a New Survey on Big Data Analytics

A new survey of 316 executives from large global companies, conducted by Forbes Insights and sponsored by Teradata in partnership with McKinsey, provides a fresh look at the state of big data analytics implementations. Here are the highlights.

The hype gone, big data is alive and doing well

About 90% of organizations report medium to high levels of investment in big data analytics, and about a third call their investments “very significant.” Most important, about two-thirds of respondents report that big data and analytics initiatives have had a significant, measurable impact on revenues.

59% of the executives surveyed consider big data and analytics either a top five issue or the single most important way to achieve a competitive advantage. This attitude is slightly more prevalent in financial services and much more prevalent in Asia-Pacific, where 41% of executives (compared to the survey average of 21%) consider big data and analytics the single most important way for companies to gain a competitive advantage.

Figure 4

The right organizational culture is key to big data success

No matter how many times you say “data-driven,” decisions are still not based on data. Sounds familiar? 51% of executives said that adapting and refining a data-driven strategy is the single biggest cultural barrier and 47% reported putting big data learning into action as an operational challenge. 43% cited fostering a culture that rewards use of data and valuing creativity and experimentation with data as key challenges.

Companies that don’t get the data-driven culture right tend to fall behind their peers. 47% of executives surveyed do not think that their companies’ big data and analytics capabilities are above par or best of breed. And the survey found that the more the respondents know about big data and analytics, the less likely they are to judge the organization as above average or best of breed. For example, among data scientists, only 8% call their organizations best of breed and 10% think they are above average.

Big data is top of mind when the CEO loves data

If you take big data analytics seriously, you get results. 51% of organizations where big data is viewed as the single most important way to gain competitive advantage are led by CEOs who personally focus on big data initiatives. In organizations where big data is viewed as a top-five issue that gets significant time and attention from top leadership, the sponsor is typically one level below top leadership. Finally, companies that have established data and analytics positions at the CxO level are more likely to have above average data analytics capabilities.

Figure 5

Going from the right attitude to the right action is a long big data journey

Even if you have top leadership sponsorship and the right culture, getting data to drive action and strategy is a challenge.  48% of executives surveyed regard making fact-based business decisions based on data as a key strategic challenge, and 43% cite developing a corporate strategy as a significant hurdle. Other obstacles to realizing the benefits of big data analytics are focusing resources to get the most insights from data (43%) and viewing data as a valuable asset (41%).

Figure 2

There’s gold in them thar brontobyte data mountains

The survey found that big data is driving opportunities for innovation in three key areas: creating new business models (54%); discovering new product offers (52%); and monetizing data to external companies (40%). To pursue these opportunities, companies that are gaining the most traction are looking beyond transactional data—exploring a wide variety of many data types.

The most-cited was location data (used to identify an electronic device’s physical location), collected by over half of the respondents, followed by text data (unstructured data like email messages, slides, Word documents, and instant messages). Social media is tracked and its unstructured data collected by 43% of companies surveyed and about a third finds golden nuggets in images, weblogs, videos, sensor data and speech files.

Big data miners still very much wanted

Realizing the business and innovation opportunities hidden in the mountains of data requires the right set of skills and experiences.  46% of the executives surveyed, however, reported that hiring the talent that can recognize innovations in data is a challenge.

Originally published on Forbes.com

Posted in Big Data, Data Science | Leave a comment

The Data Science Interview: Yun Xiong, Fudan University

The Goal of Data Science is to Study the Phenomena and Laws of Datanature

Yun Xiong is an Associate Professor of Computer Science and the Associate Director of the Center for Data Science and Dataology at Fudan University, Shanghai, China. She received her Ph.D. in Computer and Software Theory from Fudan University in 2008. Her research interests include dataology and data science, data mining, big data analysis, developing effective and efficient data analysis techniques for various applications including finance, economics, insurance, bioinformatics, and sociology. The following is an edited version of our recent email exchange.

How has data science developed in China?    Continue reading

Posted in Data Science | Leave a comment

The Data Scientist Will Be Replaced By Tools

We just started to use the term “data scientist” and the demise of this new profession is already predicted? Well, at least it’s not one more “rise of the machines” prophecy; it’s the provocative title of a proposed panel for the upcoming SXSW.

The organizer of the panel, Scott Hendrickson of Gnip, has provided a useful run-down of some of the arguments for and against the possible disappearance of data scientists. Supporting the proposition are the current scarcity of data science talent and a slew of startups providing “data science as a service.” As an example of the opposition to the “democratization of algorithms,” Hendrickson quotes Cathy (Mathbabe) O’Neil who wrote recently that “if your model fails, you want to be able to figure out why it failed. The only way to do that is to know how it works to begin with. Even if it worked in a given situation, when you train on slightly different data you might run into something that throws it for a loop, and you’d better be able to figure out what that is.” In other words, machines will never have the deep understanding of the tools of data science that is required to practice data science.   Continue reading

Posted in Data Science | Leave a comment

Big Data Analytics and Data Science at Netflix (Video)

Chris Pouliot, the Director of Analytics and Algorithms at Netflix: “…my team does not only personalizations for movies, but we also deal with content demand prediction. Helping our buyer down in Beverly Hills figure out how much do we pay for a piece of content. The personalization recommendations for helping users find good movies and TV shows. Marketing analytics, how do we optimize our marketing spin. Streaming platform, how do we optimize the user experience once I press play. There’s a wide range of data, so theres a lot of diversity. We have a lot of scale, a lot of challenging problems. The question then is, how do we attract great data scientists that can just see this as a playground, a sandbox of really exciting things. Challenging problems, challenging data, great tools, and then just the ability to have fun and create great products.”
[youtube http://www.youtube.com/watch?v=pJd3PKm9XUk]

Posted in Big Data, Data Science | Leave a comment

Cool Data Scientists on Campus

Geek Chic

Hal Varian:  “Data availability is going to continue to grow. To make that data useful is a challenge. It’s generally going to require human beings to do it.”

Source: Carl Bialik, “Data Crunchers Now the Cool Kids on Campus,” The Wall Street Journal, March 1, 2013

See my list of graduate programs in data science and big data analytics

Posted in Big Data, Data Science, Statistics | Leave a comment

5 Origins of Data Science

DataScience_History

Source: Impact of Big Data on Analytics

Posted in Data Science, Misc | Leave a comment

History of Data Science (Infographic)

DataScience_History

 

Source: Capgemini

Posted in Data Science | Leave a comment