What makes a good data scientist? And if you are a good data scientist, how much should you expect to get paid?
Owen Zhang, ranked #1 on Kaggle, the online stadium for data science competitions, lists his skills on his Kaggle profile as “excessive effort,” “luck,” and “other people’s code.” An engineer by training, Zhang says in this ODSC interview that data science is finding “practical solutions to not very well-defined problems,” similar to engineering. He believes that good data scientists, “otherwise known as unicorn data scientists,” have three types of expertise. Since data science deals with practical problems, the first one is being familiar with a specific domain and knowing how to solve a problem in that domain. The second is the ability to distinguish signal from noise, or understanding statistics. The third skill is software engineering.
Not having formal education in statistics or software engineering, Zhang explains that he acquired his data science skills by competing in Kaggle and learning from its community. No doubt being very good at learning on your own is a required skill, to say nothing about hanging out with the right people, preferably unicorn data scientists. Galit Shmueli, Professor of Business Analytics at NTHU, told rjmetrics that her one piece of advice for data scientists just getting started is to “attend a conference or two, see what people are working on, what are the challenges, and what’s the atmosphere.”
Recent data shows that unicorn data scientists can make more than $240,000 annually. This according to the 2015 Data Science Salary Survey where O’Reilly Media’s John King and Roger Magoulas report the results of a survey of 600 “data practitioners” (reflecting the recency of the term, only one-quarter of the respondents have job titles that explicitly identify them as “data scientists”).
The median annual base salary of the survey sample is $91,000, and among U.S. respondents is $104,000, similar to last year’s results. 23% said that it would be “very easy” for them to find another position.
Keep in mind that “23% of the sample hold a doctorate degree,” and additional 44% hold a master’s. The word “sample” here means, as it does in almost all other surveys today, “the people that wanted to answer our survey.” But unlike other survey report authors, King and Magoulas make sure to issue this warning: “We should be careful when making conclusions about survey data from a self-selecting sample—it is a major assumption to claim it is an unbiased representation of all data scientists and engineers… the O’Reilly audience tends to use more newer, open source tools, and underrepresents non-tech industries such as insurance and energy.”
Still, we can learn quite a lot about the background and skills required for admission into this well-paid group of data masters. Two-thirds of respondents had academic backgrounds in computer science, mathematics, statistics, or physics.
Beyond the initial training, it is important to keep abreast of the ever-changing landscape of data science tools: “It seems likely that in the long run knowing the highest paying tools will increase your chances of joining the ranks of the highest paid,” say King and Magoulas. And the most recent additions to the data science tool pantheon provide the greatest boost to salaries: “…learning Spark could apparently have more of an impact on salary than getting a PhD. Scala is another bonus: those who use both are expected to earn over $15,000 more than an otherwise equivalent data professional.”
The bad news is that the more time spent in meetings (even for non-managers), the more money a data scientist makes. Another widely discussed unpleasant part of the job—data cleaning—is the #2 task on which data scientists spend the most time, with 39% of survey participants spending at least one hour per day on this task. The good news is that exploratory data analysis is what occupies them most, with 46% spending one to three hours per day on this task and 12% spending four hours or more.
More data on the skills employed by practicing data scientists comes from an AnalyticsWeek survey of 410 data professionals. In Optimizing Your Data Science Team, Bob E. Hayes reports that respondents were asked to indicate their level of proficiency for 25 different skills.” Solving problems with data,” says Hayes, “requires expertise across different skill areas: 1) Business, 2) Technology, 3) Programming, 4) Math & Modeling and 5) Statistics. Proficiency in each skill area is related to job role.”
All of these skills may not present themselves in a single data scientist but it’s possible to assemble all of them by putting together a top-notch data science team. In “Tips for building a data science capability” from consulting firm Booz Allen Hamilton, we learn that “rather than illuminate a single data science rock star, it is important to highlight a diversity of talent at all levels to help others self-identify with the capability. It is also a more realistic version of the truth. Very rarely will you find ‘magical unicorns’ that embody the full breadth of math and computer science skills along with the requisite domain knowledge. More often, you will build diverse teams that when combined provide you with the ‘triple-threat’ (computer science, math/statistics, and domain expertise) model needed for the toughest data science problems.”
The concept of a data science team, combining various skills and educational backgrounds, is high on the agenda of the 175-year-old American Statistical Association (ASA) which is probably looking in dismay at the oodles of funds going to establishing new data science programs and research centers at American universities, to say nothing about the salaries of data scientists as opposed to the salaries of statisticians.
The ASA issued a “policy statement” on October 1, reminding the world that statistics is one of the three disciplines “foundational to data science” (the other two being database management and distributed and parallel systems, providing a “computational infrastructure”). The statement concludes with “The next generation [of statisticians] must include more researchers with skills that cross the traditional boundaries of statistics, databases and distributed systems; there will be an ever-increasing demand for such ‘multi-lingual’ experts.”
In other words, if you aspire to a $200,000+ salary, better call yourself a data scientist and start coding.