The New Data Scientist Venn Diagram

DataScientist_Diagram

Stephan Kolassa on StackExchange:

I still think that Hacking Skills, Math & Statistics Knowledge and Substantive Expertise (shortened to “Programming”, “Statistics” and “Business” for legibility) are important… but I think that the role of Communication is important, too. All the insights you derive by leveraging your hacking, stats and business expertise won’t make a bit of a difference unless you can communicate them to people who may not have that unique blend of knowledge. You may need to explain your statistical insights to a business manager who needs to be convinced to spend money or change processes. Or to a programmer who doesn’t think statistically.

So here is the new data science Venn diagram, which also includes communication as one indispensable ingredient.

 on Teradata.com:

Davenport and Patil describe data scientists as curious, self-directed and innovative, i.e., they are not limited by the tools available and when needed fashion their own tools and even conduct academic- style research. Not surprisingly, people with this combination of skills and characteristics are rare, as rare and as much in demand as the computer programmers in the 1990s.

This rarity and high demand for data science skills has meant that statisticians, machine learners, data miners, data analysts, DBAs as well as quantitative analysts, i.e., people with any data or analytics skills have re-badged themselves as data scientists so that they are more marketable. This is not unlike the pre-Y2K hype when computer operators and users of PCs, re-badged themselves as computer programmers.

The term “data scientist” itself has become so diffuse that it represents anybody from data base administrators to analysts doing simplistic summaries on Excel spreadsheet to data engineers setting up Hadoop infrastructure to advanced analytics practitioners who discover valuable insights from data using existing tools as well as those like the data scientists in Google and Facebook who derive insights from data using their own enhanced toolkit.

So, is the name really relevant? Apparently not, since Google’s career pages advertise for Decision Support Analysts, Statisticians, Quantitative Analysts, and Data Scientists and they all mean the same thing. Over the last 50 years, many people have been working as the data scientists described by Davenport and Patil, discovering insights from large volumes of diverse data using existing tools as well as new tools that they fashioned. They have been labelled statisticians, artificial intelligence researchers, data miners, machine learners, advanced analytics experts and the list goes on.

What is relevant is to understand where an individual’s interest lies in the broad data science church and where the needs of the organisation are. The individual’s interest may be developing innovative algorithms to solve a new problem (the high-end data scientist described by Davenport and Patil), or identifying new business problems that can be solved with existing tools or distributed programming for Hadoop. The key is to match the organisation’s needs with an individual’s interest and not be bothered with the position title or the candidate’s label.

Finally, as for finding this rare species, let me point out that the characteristics of curiosity, self-direction and innovation are required in all scientific research. Fashioning tools to overcome a challenge has always been the hallmark of a research scientist. Didn’t Newton invent infinitesimal calculus when the mathematical tools at his disposal were insufficient to calculate the instantaneous speed? Furthermore, scientific research through PhD ensures that they are able to teach themselves new skills.

So, instead of looking to graduates from the newly designed data science majors, develop your own data scientists by first finding a PhD or Masters in a quantitative science such as physics, mathematics, statistics or computer science and then providing them data, time and autonomy. It worked for LinkedIn with Jonathan Goldman and for many other data-driven companies and it can work for you too!!

Posted in Data Science | Leave a comment

Design Thinking for Dummies (Data Scientists)

[slideshare id=30767715&style=border: 1px solid #CCC; border-width: 1px 1px 0; margin-bottom: 5px; max-width: 100%;&sc=no]

Data scientists often face ambiguous challenges and, as a group, should use and make use of the design process to address these challenges. These slides briefly make the case for using the design process.
Posted in Data Science | Leave a comment

Most In-Demand Data Science Skills

Data-Science-Skills2016

Source: CrowdFlower, based on “3500 relevant job openings from LinkedIn.”

The folks at CrowdFlower excluded Excel from their list but noted that “that’s still something you see in myriad job listings. Old habits die hard.” Of course, data scientists don’t want to associate the “sexiest job of the 21st century” with old habits. Employers, however, want to cover all bases, sexy or not.

Posted in Data Science, Misc | Tagged | Leave a comment

Data Scientists Spend Most of Their Time Cleaning Data

A new survey of data scientists found that they spend most of their time massaging rather than mining or modeling data. Still, most are happy with having the sexiest job of the 21st century. The survey of about 80 data scientists was conducted for the second year in a row by CrowdFlower, provider of a “data enrichment” platform for data scientists. Here are the highlights:

Data preparation accounts for about 80% of the work of data scientists

Data scientists spend 60% of their time on cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around 80% of their time on preparing and managing data for analysis.

76% of data scientists view data preparation as the least enjoyable part of their work

57% of data scientists regard cleaning and organizing data as the least enjoyable part of their work and 19% say this about collecting data sets.

These findings are yet another confirmation of a very widely known and lamented fact of the data scientist’s work experience. In 2009, data scientist Mike Driscoll popularized the term “data munging,” describing the “painful process of cleaning, parsing, and proofing one’s data” as one of the three sexy skills of data geeks. In 2013, Josh Wills (then director of Data Science at Cloudera, now Director of Data Engineering at Slack) told Technology Review “I’m a data janitor. That’s the sexiest job of the 21st century. It’s very flattering, but it’s also a little baffling.” And Big Data Borat tweeted that “Data Science is 99% preparation, 1% misinterpretation.”

Given that the median annual base salary in the U.S. of the hard-to-find and much-in-demand data scientists was $104,000 last year, a number of startups have focused on automating a solution to this essential but boring task. In his 2016 Big Data Landscape, Matt Turck lists a number of them in the “data transformation” box plus companies (such as CrowdFlower) that are addressing this need with crowdsourcing (both in the “infrastructure” section).

Investing in solutions to messy data will continue and IDC has predicted that through 2020, spending on self-service visual discovery and data preparation tools will grow 2.5x faster than traditional IT-controlled tools for similar functionality. Following the same trend, Forrester predicted that in 2016, machine learning will begin to replace manual “data wrangling” (another endearing term like “data munging”) and data governance dirty work, and that vendors will market these solutions as a way to make data ingestion, preparation, and discovery quicker.

Indeed, 55% of the respondents to the CrowdFlower survey agreed with Forrester, predicting that over the next year machine learning will have (or will continue to have) a significant importance for their companies and their departments.

Other findings:

35% of data scientists gave their job the highest mark possible.

Only 14% of data scientists felt they were being held back by their tools.

What data scientists want most is more support and direction from their management or executive team (27%).

Finally, CrowdFlower looked at nearly 4,000 data science job postings on LinkedIn to find out what skills organizations wanted from their new hires. Last year they found that the skills most in demand were programming and coding. This year, they looked for more specific data science tools that are mentioned in job posting.

Here are the Top 10 in-demand skills for data scientists:

 Skills  % of jobs with skill
SQL 56%
Hadoop 49%
Python 39%
Java 36%
R 32%
Hive 31%
Mapreduce 22%
NoSQL 18%
Pig 16%
SAS 16%

 I’m sure it is relatively easy for employers to test prospective data scientists for their proficiency in any of the above tools and data platforms. But how do they test for their efficiency in removing commas?

Originally published on Forbes.com

Posted in Data Science | Tagged | Leave a comment

Mingsheng Hong: The Data Scientist is the New Product Manager

Boston’s new data science-related meetup, The Data Scientist, got off to a great start yesterday with a presentation titled “The Scientist, The Team and The Purpose,” entertainingly delivered by Mingsheng Hong, Chief Data Scientist at Hadapt.  

Continue reading
Posted in Big Data, Data Science | Leave a comment

The Big Data Interview: Sanjay Mirchandani, CIO, EMC

If data sits on a desk somewhere and is not being used, it’s an opportunity wasted

Sanjay Mirchandani believes IT has to take the lead in adding value to the business in the form of big data “addictive analytics.” Mirchandani is Chief Information Officer and COO, Global Centers of Excellence, at EMC Corporation. He has been recognized as one of Computerworld’s Premier 100 IT Leaders and Boston Business Journal’s CIOs of the Year. The following is an edited transcript of our recent phone conversation.

What would you say to a CIO who dismisses big data as just another buzzword?

I would say that for too long we have been trying to manage down information. The IT world that we have become comfortable with for many years was mostly within the enterprise, maybe connecting to some partners and customers. It was also mostly structured, basically revolving around transactional data. Today, the volume, variety, velocity and complexity of information have changed the IT landscape. These are the four things I challenge CIOs to really think about. We all know how to do structured information. But the moment you throw in unstructured and semi-structured information, life changes. This is where the value is for organizations today.

Does this also change the relationships between IT and the business?

Only IT has a complete picture of all the data in the enterprise. At the same time, IT today cannot have a monopoly on information. That changes the role and responsibilities of IT and the business. We in IT want to deliver more as a service and the business wants to consume more as a service.  And IT and the business increasingly share tools and capabilities. For example, I can offer a tool like Greenplum Chorus, which is a community-based BI-data warehousing-analytics tool, where data scientists in IT work collaboratively with data scientists sitting in the business. If there’s something we can do better, we’ll take it on ourselves; if there’s something they can do better, like creating their own wrappers around the analytics, they will do it. What’s clear is that IT and the business have never been better aligned.    Continue reading

Posted in Big Data, Data Science | Leave a comment

A Very Short History of Data Science

Source: http://compsocsci.blogspot.com/

I’m in the process of researching the origin and evolution of data science as a discipline and a profession. Here are the milestones that I have picked up so far, tracking the evolution of the term “data science,” attempts to define it, and some related developments.  I would greatly appreciate any pointers to additional key milestones (events, publications, etc.).

[An updated version of this timeline is at Forbes.com]

1974 Peter Naur publishes Concise Survey of Computer Methods in Sweden and the United States. The book is a survey of contemporary data processing methods that are used in a wide range of applications. It is organized around the concept of data as defined in the IFIP Guide to Concepts and Terms in Data Processing, which defines data as “a representation of facts or ideas in a formalized manner capable of being communicated or manipulated by some process.“ The Preface to the book tells the reader that a course plan was presented at the IFIP Congress in 1968, titled “Datalogy, the science of data and of data processes and its place in education,“ and that in the text of the book, ”the term ‘data science’ has been used freely.” Naur offers the following definition of data science: “The science of dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields and sciences.”

Continue reading
Posted in Big Data, Data Science | Leave a comment

The Data Scientist Will Be Replaced By Tools

We just started to use the term “data scientist” and the demise of this new profession is already predicted? Well, at least it’s not one more “rise of the machines” prophecy; it’s the provocative title of a proposed panel for the upcoming SXSW.

The organizer of the panel, Scott Hendrickson of Gnip, has provided a useful run-down of some of the arguments for and against the possible disappearance of data scientists. Supporting the proposition are the current scarcity of data science talent and a slew of startups providing “data science as a service.” As an example of the opposition to the “democratization of algorithms,” Hendrickson quotes Cathy (Mathbabe) O’Neil who wrote recently that “if your model fails, you want to be able to figure out why it failed. The only way to do that is to know how it works to begin with. Even if it worked in a given situation, when you train on slightly different data you might run into something that throws it for a loop, and you’d better be able to figure out what that is.” In other words, machines will never have the deep understanding of the tools of data science that is required to practice data science.   Continue reading

Posted in Data Science | Leave a comment

The Data Science Interview: Mingsheng Hong, Hadapt

Data scientists are data junkieswhen they see a new data set they are just naturally excited and can’t wait to explore.

Mingsheng Hong is Chief Data Scientist at Hadapt, a Boston-based startup that offers an analytical platform that integrates structured and unstructured data in one cloud-optimized system. Before joining Hadapt, Mingsheng was Field CTO for Vertica. He holds a Ph.D. in Computer Science from Cornell and a BSc in Computer Science from Fudan University. Mingsheng is president of NECINA and is active in St. Baldrick’s Foundation, a volunteer-driven charity that funds research to find cures for childhood cancers. I talked to Mingsheng just before he shaved his head, a visual indicator and act of solidarity expected from successful St. Baldrick’s fundraisers.

As a graduate student, were you thinking of an academic career?

At Cornell, I explored both academic and private industry career tracks. I love research and innovation, and discovered my passion for explaining ideas to people from various backgrounds and getting them excited about these ideas. While that aligns with a more academic track, in the end I decided the private sector was a better fit for me. I’m driven by the challenge of taking an idea and carrying it end-to-end, from idea to product development to sales. During graduate school, I had the opportunity to visit Microsoft for a few summers, and I got a lot of exposure to database R&D and came away with a good feel for the industry. My research work there was commercialized in SQL Server 2008 and 2012, which was very exciting.   Continue reading

Posted in Data Science | Leave a comment

6 Highlights of a New Survey on Big Data Analytics

A new survey of 316 executives from large global companies, conducted by Forbes Insights and sponsored by Teradata in partnership with McKinsey, provides a fresh look at the state of big data analytics implementations. Here are the highlights.

The hype gone, big data is alive and doing well

About 90% of organizations report medium to high levels of investment in big data analytics, and about a third call their investments “very significant.” Most important, about two-thirds of respondents report that big data and analytics initiatives have had a significant, measurable impact on revenues.

59% of the executives surveyed consider big data and analytics either a top five issue or the single most important way to achieve a competitive advantage. This attitude is slightly more prevalent in financial services and much more prevalent in Asia-Pacific, where 41% of executives (compared to the survey average of 21%) consider big data and analytics the single most important way for companies to gain a competitive advantage.

Figure 4

The right organizational culture is key to big data success

No matter how many times you say “data-driven,” decisions are still not based on data. Sounds familiar? 51% of executives said that adapting and refining a data-driven strategy is the single biggest cultural barrier and 47% reported putting big data learning into action as an operational challenge. 43% cited fostering a culture that rewards use of data and valuing creativity and experimentation with data as key challenges.

Companies that don’t get the data-driven culture right tend to fall behind their peers. 47% of executives surveyed do not think that their companies’ big data and analytics capabilities are above par or best of breed. And the survey found that the more the respondents know about big data and analytics, the less likely they are to judge the organization as above average or best of breed. For example, among data scientists, only 8% call their organizations best of breed and 10% think they are above average.

Big data is top of mind when the CEO loves data

If you take big data analytics seriously, you get results. 51% of organizations where big data is viewed as the single most important way to gain competitive advantage are led by CEOs who personally focus on big data initiatives. In organizations where big data is viewed as a top-five issue that gets significant time and attention from top leadership, the sponsor is typically one level below top leadership. Finally, companies that have established data and analytics positions at the CxO level are more likely to have above average data analytics capabilities.

Figure 5

Going from the right attitude to the right action is a long big data journey

Even if you have top leadership sponsorship and the right culture, getting data to drive action and strategy is a challenge.  48% of executives surveyed regard making fact-based business decisions based on data as a key strategic challenge, and 43% cite developing a corporate strategy as a significant hurdle. Other obstacles to realizing the benefits of big data analytics are focusing resources to get the most insights from data (43%) and viewing data as a valuable asset (41%).

Figure 2

There’s gold in them thar brontobyte data mountains

The survey found that big data is driving opportunities for innovation in three key areas: creating new business models (54%); discovering new product offers (52%); and monetizing data to external companies (40%). To pursue these opportunities, companies that are gaining the most traction are looking beyond transactional data—exploring a wide variety of many data types.

The most-cited was location data (used to identify an electronic device’s physical location), collected by over half of the respondents, followed by text data (unstructured data like email messages, slides, Word documents, and instant messages). Social media is tracked and its unstructured data collected by 43% of companies surveyed and about a third finds golden nuggets in images, weblogs, videos, sensor data and speech files.

Big data miners still very much wanted

Realizing the business and innovation opportunities hidden in the mountains of data requires the right set of skills and experiences.  46% of the executives surveyed, however, reported that hiring the talent that can recognize innovations in data is a challenge.

Originally published on Forbes.com

Posted in Big Data, Data Science | Leave a comment