The Data Scientist Will Be Replaced By Tools

We just started to use the term “data scientist” and the demise of this new profession is already predicted? Well, at least it’s not one more “rise of the machines” prophecy; it’s the provocative title of a proposed panel for the upcoming SXSW.

The organizer of the panel, Scott Hendrickson of Gnip, has provided a useful run-down of some of the arguments for and against the possible disappearance of data scientists. Supporting the proposition are the current scarcity of data science talent and a slew of startups providing “data science as a service.” As an example of the opposition to the “democratization of algorithms,” Hendrickson quotes Cathy (Mathbabe) O’Neil who wrote recently that “if your model fails, you want to be able to figure out why it failed. The only way to do that is to know how it works to begin with. Even if it worked in a given situation, when you train on slightly different data you might run into something that throws it for a loop, and you’d better be able to figure out what that is.” In other words, machines will never have the deep understanding of the tools of data science that is required to practice data science.   Continue reading

Posted in Data Science | Leave a comment

The Data Science Interview: Mingsheng Hong, Hadapt

Data scientists are data junkieswhen they see a new data set they are just naturally excited and can’t wait to explore.

Mingsheng Hong is Chief Data Scientist at Hadapt, a Boston-based startup that offers an analytical platform that integrates structured and unstructured data in one cloud-optimized system. Before joining Hadapt, Mingsheng was Field CTO for Vertica. He holds a Ph.D. in Computer Science from Cornell and a BSc in Computer Science from Fudan University. Mingsheng is president of NECINA and is active in St. Baldrick’s Foundation, a volunteer-driven charity that funds research to find cures for childhood cancers. I talked to Mingsheng just before he shaved his head, a visual indicator and act of solidarity expected from successful St. Baldrick’s fundraisers.

As a graduate student, were you thinking of an academic career?

At Cornell, I explored both academic and private industry career tracks. I love research and innovation, and discovered my passion for explaining ideas to people from various backgrounds and getting them excited about these ideas. While that aligns with a more academic track, in the end I decided the private sector was a better fit for me. I’m driven by the challenge of taking an idea and carrying it end-to-end, from idea to product development to sales. During graduate school, I had the opportunity to visit Microsoft for a few summers, and I got a lot of exposure to database R&D and came away with a good feel for the industry. My research work there was commercialized in SQL Server 2008 and 2012, which was very exciting.   Continue reading

Posted in Data Science | Leave a comment

The Big Data Interview: Sanjay Mirchandani, CIO, EMC

If data sits on a desk somewhere and is not being used, it’s an opportunity wasted

Sanjay Mirchandani believes IT has to take the lead in adding value to the business in the form of big data “addictive analytics.” Mirchandani is Chief Information Officer and COO, Global Centers of Excellence, at EMC Corporation. He has been recognized as one of Computerworld’s Premier 100 IT Leaders and Boston Business Journal’s CIOs of the Year. The following is an edited transcript of our recent phone conversation.

What would you say to a CIO who dismisses big data as just another buzzword?

I would say that for too long we have been trying to manage down information. The IT world that we have become comfortable with for many years was mostly within the enterprise, maybe connecting to some partners and customers. It was also mostly structured, basically revolving around transactional data. Today, the volume, variety, velocity and complexity of information have changed the IT landscape. These are the four things I challenge CIOs to really think about. We all know how to do structured information. But the moment you throw in unstructured and semi-structured information, life changes. This is where the value is for organizations today.

Does this also change the relationships between IT and the business?

Only IT has a complete picture of all the data in the enterprise. At the same time, IT today cannot have a monopoly on information. That changes the role and responsibilities of IT and the business. We in IT want to deliver more as a service and the business wants to consume more as a service.  And IT and the business increasingly share tools and capabilities. For example, I can offer a tool like Greenplum Chorus, which is a community-based BI-data warehousing-analytics tool, where data scientists in IT work collaboratively with data scientists sitting in the business. If there’s something we can do better, we’ll take it on ourselves; if there’s something they can do better, like creating their own wrappers around the analytics, they will do it. What’s clear is that IT and the business have never been better aligned.    Continue reading

Posted in Big Data, Data Science | Leave a comment

Mingsheng Hong: The Data Scientist is the New Product Manager

Boston’s new data science-related meetup, The Data Scientist, got off to a great start yesterday with a presentation titled “The Scientist, The Team and The Purpose,” entertainingly delivered by Mingsheng Hong, Chief Data Scientist at Hadapt.  

Continue reading
Posted in Big Data, Data Science | Leave a comment

On Data Janitors, Engineers, and Statistics

Big Data Borat tweeted recently that “Data Science is 99% preparation, 1% misinterpretation.” Commenting on the 99% part, Cloudera’s Josh Wills says: “I’m a data janitor. That’s the sexiest job of the 21st century. It’s very flattering, but it’s also a little baffling.” Kaggle, the data-science-as-sport startup, takes care of the “1% misinterpretation” part by providing a matchmaking service between the sexiest of the sexy data janitors and the organizations requiring their hard-to-find skills. It charges $300 per hour for the service, of which $200 go to the data janitor (at least in the case of Shashi Godbole, quoted in the Technology Review article). Kaggle justifies its mark-up by delivering “the best 0.5% of the 95,988 data scientists who compete in data mining competitions,” the top of its data science table league, the ranking of data scientists based on their performance in Kaggle’s competitions, presumably representing  sound interpretation and top-notch productivity.

Kaggle’s co-founder Anthony Goldbloom tells The Atlantic’s Thomas Goetz that the ranking also represents a solution to a “market failure” in assessing the skills and relevant experience of the new breed of data scientists: “Kaggle represents a new sort of labor market, one where skills have been bifurcated from credentials.” Others see this as the creation of a new, $300 per hour, guild. In “Data Scientists Don’t Scale,” ZDnet’s Andrew Brust says that “’Data scientist’ is a title designed to be exclusive, standoffish and protective of a lucrative guild… The solution… isn’t legions of new data scientists. Instead, we need self-service tools that empower smart and tenacious business people to perform Big Data analysis themselves.”

Continue reading

Posted in Data Science, Statistics | Leave a comment

The Data Science Interview: Yun Xiong, Fudan University

The Goal of Data Science is to Study the Phenomena and Laws of Datanature

Yun Xiong is an Associate Professor of Computer Science and the Associate Director of the Center for Data Science and Dataology at Fudan University, Shanghai, China. She received her Ph.D. in Computer and Software Theory from Fudan University in 2008. Her research interests include dataology and data science, data mining, big data analysis, developing effective and efficient data analysis techniques for various applications including finance, economics, insurance, bioinformatics, and sociology. The following is an edited version of our recent email exchange.

How has data science developed in China?    Continue reading

Posted in Data Science | Leave a comment

6 Highlights of a New Survey on Big Data Analytics

A new survey of 316 executives from large global companies, conducted by Forbes Insights and sponsored by Teradata in partnership with McKinsey, provides a fresh look at the state of big data analytics implementations. Here are the highlights.

The hype gone, big data is alive and doing well

About 90% of organizations report medium to high levels of investment in big data analytics, and about a third call their investments “very significant.” Most important, about two-thirds of respondents report that big data and analytics initiatives have had a significant, measurable impact on revenues.

59% of the executives surveyed consider big data and analytics either a top five issue or the single most important way to achieve a competitive advantage. This attitude is slightly more prevalent in financial services and much more prevalent in Asia-Pacific, where 41% of executives (compared to the survey average of 21%) consider big data and analytics the single most important way for companies to gain a competitive advantage.

Figure 4

The right organizational culture is key to big data success

No matter how many times you say “data-driven,” decisions are still not based on data. Sounds familiar? 51% of executives said that adapting and refining a data-driven strategy is the single biggest cultural barrier and 47% reported putting big data learning into action as an operational challenge. 43% cited fostering a culture that rewards use of data and valuing creativity and experimentation with data as key challenges.

Companies that don’t get the data-driven culture right tend to fall behind their peers. 47% of executives surveyed do not think that their companies’ big data and analytics capabilities are above par or best of breed. And the survey found that the more the respondents know about big data and analytics, the less likely they are to judge the organization as above average or best of breed. For example, among data scientists, only 8% call their organizations best of breed and 10% think they are above average.

Big data is top of mind when the CEO loves data

If you take big data analytics seriously, you get results. 51% of organizations where big data is viewed as the single most important way to gain competitive advantage are led by CEOs who personally focus on big data initiatives. In organizations where big data is viewed as a top-five issue that gets significant time and attention from top leadership, the sponsor is typically one level below top leadership. Finally, companies that have established data and analytics positions at the CxO level are more likely to have above average data analytics capabilities.

Figure 5

Going from the right attitude to the right action is a long big data journey

Even if you have top leadership sponsorship and the right culture, getting data to drive action and strategy is a challenge.  48% of executives surveyed regard making fact-based business decisions based on data as a key strategic challenge, and 43% cite developing a corporate strategy as a significant hurdle. Other obstacles to realizing the benefits of big data analytics are focusing resources to get the most insights from data (43%) and viewing data as a valuable asset (41%).

Figure 2

There’s gold in them thar brontobyte data mountains

The survey found that big data is driving opportunities for innovation in three key areas: creating new business models (54%); discovering new product offers (52%); and monetizing data to external companies (40%). To pursue these opportunities, companies that are gaining the most traction are looking beyond transactional data—exploring a wide variety of many data types.

The most-cited was location data (used to identify an electronic device’s physical location), collected by over half of the respondents, followed by text data (unstructured data like email messages, slides, Word documents, and instant messages). Social media is tracked and its unstructured data collected by 43% of companies surveyed and about a third finds golden nuggets in images, weblogs, videos, sensor data and speech files.

Big data miners still very much wanted

Realizing the business and innovation opportunities hidden in the mountains of data requires the right set of skills and experiences.  46% of the executives surveyed, however, reported that hiring the talent that can recognize innovations in data is a challenge.

Originally published on Forbes.com

Posted in Big Data, Data Science | Leave a comment

The Data Science Interview: Mok Oh, PayPal

To Do Data Science, You Need a Team of Specialists

Currently the Chief Scientist at PayPal, Mok Oh came on board when eBay acquired WHERE, where he was Chief Innovation Officer.  Prior to WHERE, Mok founded EveryScape, a data visualization company.  The following is an edited transcript of our recent phone conversation.

How do you define a data scientist?   Continue reading

Posted in Data Science | Leave a comment

Data Scientists Spend Most of Their Time Cleaning Data

A new survey of data scientists found that they spend most of their time massaging rather than mining or modeling data. Still, most are happy with having the sexiest job of the 21st century. The survey of about 80 data scientists was conducted for the second year in a row by CrowdFlower, provider of a “data enrichment” platform for data scientists. Here are the highlights:

Data preparation accounts for about 80% of the work of data scientists

Data scientists spend 60% of their time on cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around 80% of their time on preparing and managing data for analysis.

76% of data scientists view data preparation as the least enjoyable part of their work

57% of data scientists regard cleaning and organizing data as the least enjoyable part of their work and 19% say this about collecting data sets.

These findings are yet another confirmation of a very widely known and lamented fact of the data scientist’s work experience. In 2009, data scientist Mike Driscoll popularized the term “data munging,” describing the “painful process of cleaning, parsing, and proofing one’s data” as one of the three sexy skills of data geeks. In 2013, Josh Wills (then director of Data Science at Cloudera, now Director of Data Engineering at Slack) told Technology Review “I’m a data janitor. That’s the sexiest job of the 21st century. It’s very flattering, but it’s also a little baffling.” And Big Data Borat tweeted that “Data Science is 99% preparation, 1% misinterpretation.”

Given that the median annual base salary in the U.S. of the hard-to-find and much-in-demand data scientists was $104,000 last year, a number of startups have focused on automating a solution to this essential but boring task. In his 2016 Big Data Landscape, Matt Turck lists a number of them in the “data transformation” box plus companies (such as CrowdFlower) that are addressing this need with crowdsourcing (both in the “infrastructure” section).

Investing in solutions to messy data will continue and IDC has predicted that through 2020, spending on self-service visual discovery and data preparation tools will grow 2.5x faster than traditional IT-controlled tools for similar functionality. Following the same trend, Forrester predicted that in 2016, machine learning will begin to replace manual “data wrangling” (another endearing term like “data munging”) and data governance dirty work, and that vendors will market these solutions as a way to make data ingestion, preparation, and discovery quicker.

Indeed, 55% of the respondents to the CrowdFlower survey agreed with Forrester, predicting that over the next year machine learning will have (or will continue to have) a significant importance for their companies and their departments.

Other findings:

35% of data scientists gave their job the highest mark possible.

Only 14% of data scientists felt they were being held back by their tools.

What data scientists want most is more support and direction from their management or executive team (27%).

Finally, CrowdFlower looked at nearly 4,000 data science job postings on LinkedIn to find out what skills organizations wanted from their new hires. Last year they found that the skills most in demand were programming and coding. This year, they looked for more specific data science tools that are mentioned in job posting.

Here are the Top 10 in-demand skills for data scientists:

 Skills  % of jobs with skill
SQL 56%
Hadoop 49%
Python 39%
Java 36%
R 32%
Hive 31%
Mapreduce 22%
NoSQL 18%
Pig 16%
SAS 16%

 I’m sure it is relatively easy for employers to test prospective data scientists for their proficiency in any of the above tools and data platforms. But how do they test for their efficiency in removing commas?

Originally published on Forbes.com

Posted in Data Science | Tagged | Leave a comment

Imagination and Data Science

Today in 1833, Ada Byron (later Countess Lovelace) met Charles Babbage when visiting his house to see a portion the Difference Engine, or what her mother, Lady Byron, called his “thinking machine.” James Gleick writes in The Information: “Babbage saw a sparkling, self-possessed young woman with porcelain features and a notorious name, who managed to reveal that she knew more mathematics than most men graduating from university. She saw an imposing forty-one-year-old, authoritative eyebrows anchoring his strong-boned face, who possessed wit and charm and did not wear these qualities lightly. He seemed a kind of visionary–just what she was seeking. She admired the machine, too.”

With the Analytical Engine, Babbage imagined the modern computer. Gleick quotes Ada on imagination, from an essay she wrote in 1841: “It is that which penetrates into the unseen worlds around us, the worlds of Science. It is that which feels & discovers what is, the real which we see not, which exists not for our senses. Those who have learned to walk the threshold of the unknown worlds… may then with the fair white wings of Imagination hope to soar further into the unexplored amidst which we live.”

In this she anticipated Albert Einstein’s much-quoted observation: “Imagination is more important than knowledge. For knowledge is limited to all we now know and understand, while imagination embraces the entire world, and all there ever will be to know and understand.”

Note to Data Scientists (or more specifically, those making exaggerated claims about IBM’s Watson or the promise of “data-driven” science): Without our imagination, machines can’t learn.

Posted in Data Science | Leave a comment