Vincent Granville’s 66 job interview questions for data scientists

 

  1. What is the biggest data set that you processed, and how did you process it, what were the results?
  2. Tell me two success stories about your analytic or computer science projects? How was lift (or success) measured?
  3. What is: lift, KPI, robustness, model fitting, design of experiments, 80/20 rule?
  4. What is: collaborative filtering, n-grams, map reduce, cosine distance?
  5. How to optimize a web crawler to run much faster, extract better information, and better summarize data to produce cleaner databases?
  6. How would you come up with a solution to identify plagiarism?
  7. How to detect individual paid accounts shared by multiple users?
  8. Should click data be handled in real time? Why? In which contexts?
  9. What is better: good data or good models? And how do you define “good”? Is there a universal good model? Are there any models that are definitely not so good?
  10. What is probabilistic merging (AKA fuzzy merging)? Is it easier to handle with SQL or other languages? Which languages would you choose for semi-structured text data reconciliation?

To see the other 56 questions assessing “the technical horizontal knowledge of a senior candidate at a high level” go here 

Posted in Data Science | Leave a comment

2013 Data Science Salary Survey: Open source tools correlate with higher salary

“In our report, 2013 Data Science Salary Survey, we make our own data-driven contribution to the conversation. We collected a survey from attendees of the Strata Conference in New York and Santa Clara, California, about tool usage and salary…

What did we find?

In a sentence: those who use data tools make more.

More specifically, the tools that correlate with higher salary are scalable and generally open source; they are often script-based or built for machine learning.  Those attendees who tend to use one such tool tend to use others––that is, these tools form a ‘cluster’ in terms of usage among our sample.  Perhaps just as interesting is that some of the traditional, popular tools such as Excel and SAS were not used as widely as R and Python. This might be food for thought for those data analysts who have thus far resisted learning how to code or moving beyond query-based data tools.”

Source: 2013 Data Science Salary Survey 

Posted in Big Data, Data Science | Leave a comment

The Data Scientist Will Be Replaced By Tools

We just started to use the term “data scientist” and the demise of this new profession is already predicted? Well, at least it’s not one more “rise of the machines” prophecy; it’s the provocative title of a proposed panel for the upcoming SXSW.

The organizer of the panel, Scott Hendrickson of Gnip, has provided a useful run-down of some of the arguments for and against the possible disappearance of data scientists. Supporting the proposition are the current scarcity of data science talent and a slew of startups providing “data science as a service.” As an example of the opposition to the “democratization of algorithms,” Hendrickson quotes Cathy (Mathbabe) O’Neil who wrote recently that “if your model fails, you want to be able to figure out why it failed. The only way to do that is to know how it works to begin with. Even if it worked in a given situation, when you train on slightly different data you might run into something that throws it for a loop, and you’d better be able to figure out what that is.” In other words, machines will never have the deep understanding of the tools of data science that is required to practice data science.   Continue reading

Posted in Data Science | Leave a comment

The Data Science Interview: Mingsheng Hong, Hadapt

Data scientists are data junkieswhen they see a new data set they are just naturally excited and can’t wait to explore.

Mingsheng Hong is Chief Data Scientist at Hadapt, a Boston-based startup that offers an analytical platform that integrates structured and unstructured data in one cloud-optimized system. Before joining Hadapt, Mingsheng was Field CTO for Vertica. He holds a Ph.D. in Computer Science from Cornell and a BSc in Computer Science from Fudan University. Mingsheng is president of NECINA and is active in St. Baldrick’s Foundation, a volunteer-driven charity that funds research to find cures for childhood cancers. I talked to Mingsheng just before he shaved his head, a visual indicator and act of solidarity expected from successful St. Baldrick’s fundraisers.

As a graduate student, were you thinking of an academic career?

At Cornell, I explored both academic and private industry career tracks. I love research and innovation, and discovered my passion for explaining ideas to people from various backgrounds and getting them excited about these ideas. While that aligns with a more academic track, in the end I decided the private sector was a better fit for me. I’m driven by the challenge of taking an idea and carrying it end-to-end, from idea to product development to sales. During graduate school, I had the opportunity to visit Microsoft for a few summers, and I got a lot of exposure to database R&D and came away with a good feel for the industry. My research work there was commercialized in SQL Server 2008 and 2012, which was very exciting.   Continue reading

Posted in Data Science | Leave a comment

Data Science is so 1996!

 

Source: A History of the International Federation of Classifi cation Societies

Data Science is so 1996!
Posted in Data Science | Leave a comment

A Very Short History of Data Science

data-science-jobs
Source: http://compsocsci.blogspot.com/

I’m in the process of researching the origin and evolution of data science as a discipline and a profession. Here are the milestones that I have picked up so far, tracking the evolution of the term “data science,” attempts to define it, and some related developments.  I would greatly appreciate any pointers to additional key milestones (events, publications, etc.).

[An updated version of this timeline is at Forbes.com]

1974 Peter Naur publishes Concise Survey of Computer Methods in Sweden and the United States. The book is a survey of contemporary data processing methods that are used in a wide range of applications. It is organized around the concept of data as defined in the IFIP Guide to Concepts and Terms in Data Processing, which defines data as “a representation of facts or ideas in a formalized manner capable of being communicated or manipulated by some process.“ The Preface to the book tells the reader that a course plan was presented at the IFIP Congress in 1968, titled “Datalogy, the science of data and of data processes and its place in education,“ and that in the text of the book, ”the term ‘data science’ has been used freely.” Naur offers the following definition of data science: “The science of dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields and sciences.”

Continue reading
Posted in Big Data, Data Science | Leave a comment

Data Scientists In Growing Demand, Survey Says

73% of data science and analytics teams planned to hire in Q1/Q2 of 2021 and 81% planned to hire in Q3/Q4 of 2021. 

Read more

ds-mg-salary-change
Posted in Data Science | Tagged | Leave a comment

6 Highlights of a New Survey on Big Data Analytics

A new survey of 316 executives from large global companies, conducted by Forbes Insights and sponsored by Teradata in partnership with McKinsey, provides a fresh look at the state of big data analytics implementations. Here are the highlights.

The hype gone, big data is alive and doing well

About 90% of organizations report medium to high levels of investment in big data analytics, and about a third call their investments “very significant.” Most important, about two-thirds of respondents report that big data and analytics initiatives have had a significant, measurable impact on revenues.

59% of the executives surveyed consider big data and analytics either a top five issue or the single most important way to achieve a competitive advantage. This attitude is slightly more prevalent in financial services and much more prevalent in Asia-Pacific, where 41% of executives (compared to the survey average of 21%) consider big data and analytics the single most important way for companies to gain a competitive advantage.

Figure 4

The right organizational culture is key to big data success

No matter how many times you say “data-driven,” decisions are still not based on data. Sounds familiar? 51% of executives said that adapting and refining a data-driven strategy is the single biggest cultural barrier and 47% reported putting big data learning into action as an operational challenge. 43% cited fostering a culture that rewards use of data and valuing creativity and experimentation with data as key challenges.

Companies that don’t get the data-driven culture right tend to fall behind their peers. 47% of executives surveyed do not think that their companies’ big data and analytics capabilities are above par or best of breed. And the survey found that the more the respondents know about big data and analytics, the less likely they are to judge the organization as above average or best of breed. For example, among data scientists, only 8% call their organizations best of breed and 10% think they are above average.

Big data is top of mind when the CEO loves data

If you take big data analytics seriously, you get results. 51% of organizations where big data is viewed as the single most important way to gain competitive advantage are led by CEOs who personally focus on big data initiatives. In organizations where big data is viewed as a top-five issue that gets significant time and attention from top leadership, the sponsor is typically one level below top leadership. Finally, companies that have established data and analytics positions at the CxO level are more likely to have above average data analytics capabilities.

Figure 5

Going from the right attitude to the right action is a long big data journey

Even if you have top leadership sponsorship and the right culture, getting data to drive action and strategy is a challenge.  48% of executives surveyed regard making fact-based business decisions based on data as a key strategic challenge, and 43% cite developing a corporate strategy as a significant hurdle. Other obstacles to realizing the benefits of big data analytics are focusing resources to get the most insights from data (43%) and viewing data as a valuable asset (41%).

Figure 2

There’s gold in them thar brontobyte data mountains

The survey found that big data is driving opportunities for innovation in three key areas: creating new business models (54%); discovering new product offers (52%); and monetizing data to external companies (40%). To pursue these opportunities, companies that are gaining the most traction are looking beyond transactional data—exploring a wide variety of many data types.

The most-cited was location data (used to identify an electronic device’s physical location), collected by over half of the respondents, followed by text data (unstructured data like email messages, slides, Word documents, and instant messages). Social media is tracked and its unstructured data collected by 43% of companies surveyed and about a third finds golden nuggets in images, weblogs, videos, sensor data and speech files.

Big data miners still very much wanted

Realizing the business and innovation opportunities hidden in the mountains of data requires the right set of skills and experiences.  46% of the executives surveyed, however, reported that hiring the talent that can recognize innovations in data is a challenge.

Originally published on Forbes.com

Posted in Big Data, Data Science | Leave a comment

The Data Science Interview: Yun Xiong, Fudan University

The Goal of Data Science is to Study the Phenomena and Laws of Datanature

Yun Xiong is an Associate Professor of Computer Science and the Associate Director of the Center for Data Science and Dataology at Fudan University, Shanghai, China. She received her Ph.D. in Computer and Software Theory from Fudan University in 2008. Her research interests include dataology and data science, data mining, big data analysis, developing effective and efficient data analysis techniques for various applications including finance, economics, insurance, bioinformatics, and sociology. The following is an edited version of our recent email exchange.

How has data science developed in China?    Continue reading

Posted in Data Science | Leave a comment

The Big Data Interview: Sanjay Mirchandani, CIO, EMC

If data sits on a desk somewhere and is not being used, it’s an opportunity wasted

Sanjay Mirchandani believes IT has to take the lead in adding value to the business in the form of big data “addictive analytics.” Mirchandani is Chief Information Officer and COO, Global Centers of Excellence, at EMC Corporation. He has been recognized as one of Computerworld’s Premier 100 IT Leaders and Boston Business Journal’s CIOs of the Year. The following is an edited transcript of our recent phone conversation.

What would you say to a CIO who dismisses big data as just another buzzword?

I would say that for too long we have been trying to manage down information. The IT world that we have become comfortable with for many years was mostly within the enterprise, maybe connecting to some partners and customers. It was also mostly structured, basically revolving around transactional data. Today, the volume, variety, velocity and complexity of information have changed the IT landscape. These are the four things I challenge CIOs to really think about. We all know how to do structured information. But the moment you throw in unstructured and semi-structured information, life changes. This is where the value is for organizations today.

Does this also change the relationships between IT and the business?

Only IT has a complete picture of all the data in the enterprise. At the same time, IT today cannot have a monopoly on information. That changes the role and responsibilities of IT and the business. We in IT want to deliver more as a service and the business wants to consume more as a service.  And IT and the business increasingly share tools and capabilities. For example, I can offer a tool like Greenplum Chorus, which is a community-based BI-data warehousing-analytics tool, where data scientists in IT work collaboratively with data scientists sitting in the business. If there’s something we can do better, we’ll take it on ourselves; if there’s something they can do better, like creating their own wrappers around the analytics, they will do it. What’s clear is that IT and the business have never been better aligned.    Continue reading

Posted in Big Data, Data Science | Leave a comment