Data Scientists Spend Most of Their Time Cleaning Data

Time

Least Enjoyable

A new survey of data scientists found that they spend most of their time massaging rather than mining or modeling data. Still, most are happy with having the sexiest job of the 21st century. The survey of about 80 data scientists was conducted for the second year in a row by CrowdFlower, provider of a “data enrichment” platform for data scientists. Here are the highlights:

Data preparation accounts for about 80% of the work of data scientists

Data scientists spend 60% of their time on cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around 80% of their time on preparing and managing data for analysis.

76% of data scientists view data preparation as the least enjoyable part of their work

57% of data scientists regard cleaning and organizing data as the least enjoyable part of their work and 19% say this about collecting data sets.

These findings are yet another confirmation of a very widely known and lamented fact of the data scientist’s work experience. In 2009, data scientist Mike Driscoll popularized the term “data munging,” describing the “painful process of cleaning, parsing, and proofing one’s data” as one of the three sexy skills of data geeks. In 2013, Josh Wills (then director of Data Science at Cloudera, now Director of Data Engineering at Slack) told Technology Review “I’m a data janitor. That’s the sexiest job of the 21st century. It’s very flattering, but it’s also a little baffling.” And Big Data Borat tweeted that “Data Science is 99% preparation, 1% misinterpretation.”

Given that the median annual base salary in the U.S. of the hard-to-find and much-in-demand data scientists was $104,000 last year, a number of startups have focused on automating a solution to this essential but boring task. In his 2016 Big Data Landscape, Matt Turck lists a number of them in the “data transformation” box plus companies (such as CrowdFlower) that are addressing this need with crowdsourcing (both in the “infrastructure” section).

Investing in solutions to messy data will continue and IDC has predicted that through 2020, spending on self-service visual discovery and data preparation tools will grow 2.5x faster than traditional IT-controlled tools for similar functionality. Following the same trend, Forrester predicted that in 2016, machine learning will begin to replace manual “data wrangling” (another endearing term like “data munging”) and data governance dirty work, and that vendors will market these solutions as a way to make data ingestion, preparation, and discovery quicker.

Indeed, 55% of the respondents to the CrowdFlower survey agreed with Forrester, predicting that over the next year machine learning will have (or will continue to have) a significant importance for their companies and their departments.

Other findings:

35% of data scientists gave their job the highest mark possible.

Only 14% of data scientists felt they were being held back by their tools.

What data scientists want most is more support and direction from their management or executive team (27%).

Finally, CrowdFlower looked at nearly 4,000 data science job postings on LinkedIn to find out what skills organizations wanted from their new hires. Last year they found that the skills most in demand were programming and coding. This year, they looked for more specific data science tools that are mentioned in job posting.

Here are the Top 10 in-demand skills for data scientists:

 Skills  

% of jobs with skill

SQL 56%
Hadoop 49%
Python 39%
Java 36%
R 32%
Hive 31%
Mapreduce 22%
NoSQL 18%
Pig 16%
SAS 16%

 I’m sure it is relatively easy for employers to test prospective data scientists for their proficiency in any of the above tools and data platforms. But how do they test for their efficiency in removing commas?

Originally published on Forbes.com

Posted in Data Science, Data Science Careers, Data Scientists | Tagged | Leave a comment

IoT: The Explosion of Connected Things

[vimeo 94011734 w=640 h=360]

See also A Very Short History of the Internet of Things

Posted in Internet of Things | Leave a comment

Video Surveillance Market to Reach $71.28 Billion by 2022

Video-Surveillance-As-A-Service-Cloud-Video-Camera

MarketsAndMarkets:

Video Surveillance market is expected to be worth $71.28 Billion by 2022, growing at an estimated CAGR of 16.56%.

The market for the service segment is expected to grow at the highest CAGR between 2016 and 2022. Cloud services and video surveillance as a service (VSaaS) play an important role in the video surveillance system.

Software components include video analytics and video management software. Also, the use of neural networks and algorithms in the biometric surveillance system is a part of software component.  The advancement in software technologies and networking services would lead the video surveillance market.

Posted in Misc | Tagged | Leave a comment

A Growing Share of IoT Investment Goes to Industrial IoT

IoT-Vs-Industrial

CB Insights:

A growing slice of deals to Internet of Things startups are going to applications relevant to asset-heavy industries, including manufacturing, logistics, mining, oil, utilities and agriculture.

Q1’16, for example, saw financings to enterprise drone developer Airware and industrial augmented-reality headset maker Daqri.

We used CB Insights data to compare quarterly financing to the IoT and industrial IoT (IIoT), in order to visualize the industrial share of overall IoT funding,

IIoT companies have taken an increasingly larger piece of the overall IoT pie. In 2011, IIoT accounted for 17% of all funding dollars. Fast-forward to 2015, and IIoT accounted for 40% of investment in the year.

Most recently, Q1’16 saw more than one-third of IoT funding going to industrial-focused startups.

 

Posted in Internet of Things | Tagged | Leave a comment

How Americans Spend Their Time (Infographic)

How Americans Spend Their Time

Posted in Misc | Tagged | Leave a comment

10 Most Successful Big Data Technologies

Forrester graphic

As the big data analytics market rapidly expands to include mainstream customers, which technologies are most in demand and promise the most growth potential? The answers can be found in TechRadar: Big Data, Q1 2016, a new Forrester Research report evaluating the maturity and trajectory of 22 technologies across the entire data life cycle. The winners all contribute to real-time, predictive, and integrated insights, what big data customers want now.

Here is my take on the 10 hottest big data technologies based on Forrester’s analysis:

  1. Predictive analytics: software and/or hardware solutions that allow firms to discover, evaluate, optimize, and deploy predictive models by analyzing big data sources to improve business performance or mitigate risk.
  2. NoSQL databases: key-value, document, and graph databases.
  3. Search and knowledge discovery: tools and technologies to support self-service extraction of information and new insights from large repositories of unstructured and structured data that resides in multiple sources such as file systems, databases, streams, APIs, and other platforms and applications.
  4. Stream analytics: software that can filter, aggregate, enrich, and analyze a high throughput of data from multiple disparate live data sources and in any data format.
  5. In-memory data fabric: provides low-latency access and processing of large quantities of data by distributing data across the dynamic random access memory (DRAM), Flash, or SSD of a distributed computer system.
  6. Distributed file stores: a computer network where data is stored on more than one node, often in a replicated fashion, for redundancy and performance.
  7. Data virtualization: a technology that delivers information from various data sources, including big data sources such as Hadoop and distributed data stores in real-time and near-real time.
  8. Data integration: tools for data orchestration across solutions such as Amazon Elastic MapReduce (EMR), Apache Hive, Apache Pig, Apache Spark, MapReduce, Couchbase, Hadoop, and MongoDB.
  9. Data preparation: software that eases the burden of sourcing, shaping, cleansing, and sharing diverse and messy data sets to accelerate data’s usefulness for analytics.
  10. Data quality: products that conduct data cleansing and enrichment on large, high-velocity data sets, using parallel operations on distributed data stores and databases.

Forrester’s TechRadar methodology evaluates the potential success of each technology and all 10 above are projected to have “significant success.” In addition, each technology is placed in a specific maturity phase—from creation to decline—based on the level of development of its technology ecosystem. The first 8 technologies above are considered to be in the Growth stage and the last 2 in the Survival stage.

Forrester also estimates the time it will take the technology to get to the next stage and predictive analytics is the only one with a “>10 years” designation, expected to “deliver high business value in late Growth through Equilibrium phase for a long time.” Technologies #2 to #8 above are all expected to reach the next phase in 3 to 5 years and the last 2 technologies are expected to move from the Survival to the Growth phase in 1-3 years.

Finally, Forrester provides for each technology an assessment of its business value-add, adjusted for uncertainty. This is based not only on potential impact but also on feedback and evidence from implementations and market reputation. Says Forrester: “If the technology and its ecosystem are at an early stage of development, we have to assume that its potential for damage and disruption is higher than that of a better-known technology.” The first 2 technologies in the list above are rated as “high” business value-add, the next 2 as “medium,” and all the rest “low,” no doubt because of their emerging status and lack of maturity.

Why did I add to the list of hottest technologies two that are still in the Survival phase—data preparation and data quality? In the same report, Forrester also provides the following data from its Q4 2015 survey of 63 big data vendors:

What is the level of customer interest in each of the following capabilities? (% answering “very high”)

Data preparation and discovery                                    52%

Data integration                                                               48%

Advanced analytics                                                          46%

Customer analytics                                                          46%

Data security                                                                     38%

In-memory computing                                                    37%

While Forrester predicts that a few standalone vendors of data preparation will survive, it believes this is “an essential capability for achieving democratization of data,” or rather, its analysis, letting data scientists spend more time on modeling and discovering insights and allowing more business users to have fun with data mining.  Data Quality includes data security from the table above, in addition to other features ensuring decisions are based on reliable and accurate data. Forrester “expects that data quality will have significant success in the coming years as firms formalize a data certification process. Data certification efforts seek to guarantee that data meets expected standards for quality; security; and regulatory compliance supporting business decision-making, business performance, and business processes.”

“Big Data” as a topic of conversation has reached mainstream audiences probably far more than any other technology buzzword before it. That did not help the discussion of this amorphous term, defined for the masses as “the planet’s nervous system” (see my rant here) or as “Hadoop” for technical audiences.  Forrester’s report helps clarify the term, defining big data as the ecosystem of 22 technologies, each with its specific benefits for enterprises and, through them, consumers.

Big data, specifically one its attributes, big volume, has recently gave rise to a new general topic of discussion, Artificial Intelligence. The availability of very large data sets is one of the reasons Deep Learning, a sub-set of AI, has been in the limelight, from identifying Internet cats to beating a Go champion.  In its turn, AI may lead to the emergence of new tools for collecting and analyzing data.

Says Forrester: “In addition to more data and more computing power, we now have expanded analytic techniques like deep learning and semantic services for context that make artificial intelligence an ideal tool to solve a wider array of business problems. As a result, Forrester is seeing a number of new companies offering tools and services that attempt to support applications and processes with machines that mimic some aspects of human intelligence.”

Prediction is difficult, especially about the future, but it’s a (relatively) safe bet that the race to mimic elements of human intelligence, led by Google, Facebook, Baidu, Amazon, IBM, and Microsoft, all with very deep pockets, will change what we mean by “big data” in the very near future.

Originally posted on Forbes.com

Posted in Big Data Analytics | Tagged | Leave a comment

What Happens on the Internet in 60 Seconds

InternetMinute2016.png

Excelrcom:

So, what’s happened since 2015? The mass majority of these numbers have significantly increased from what happened in an Internet minute last year. This goes to show how consumers are continuously utilizing the Internet more and more each day, pressuring Internet speeds to increase as well. Here are some key difference from Internet speeds in 2016 vs. 2015:

  • Uber – 695 more rides per minute (100% increase)
  • Amazon – $83,836 more in sales per minute (70% increase)
  • Spotify – 24,752 more hours of music uploaded per minute (186% increase)
Posted in Data Growth, Misc | Leave a comment

Top 10 Data Science Influencers on Twitter

Twitter_influencers

ODSC:

To build a network and find the most influential data science twitter uses, we will use the NetworkX2 package to create a directed graph and to calculate eigenvector centrality (a measure of network influence) among the nodes (twitter users)…

Nodes represent twitter handles and the edges between the nodes represent user mentions. The size and color of the nodes correspond to eigenvector centrality values, which, again, is one measure of network influence. Let’s take a quick peek at the top 10 influencers (who are also plotted above):

  1. GilPress
  2. KirkDBorne
  3. Forbes
  4. BernardMarr
  5. bobehayes
  6. kdnuggets
  7. Ronald_vanLoon
  8. LinkedIn
  9. DataScienceCtrl
  10. BoozAllen

The top 10 influencers include some of the most respected individuals and organizations in data science, and so their influence among data scientists on twitter is not at all surprising.

Posted in Misc | Tagged | Leave a comment

Market for securing IoT devices will increase 5X over next 5 years

BI_cybersecurity market forecast.png

Business Intelligence:

A new report from Argus Insights analyzed more than 2.3 social media comments about the IoT since the start of 2016, and “concerns” and “real world applications” were the two biggest topics of conversation…

The conversations on social media were broad, as less than 10% of comments analyzed mentioned a specific company. This indicates that the IoT is still in its early stages because no company has been able to solve these problems and truly open the doors for widespread IoT implementation. Argus noted the market is more focused on how to solve the problems than who will solve them…

Jonathan Camhi of BI Intelligence, Business Insider’s premium research service, has compiled a detailed report on IoT Security that examines how vulnerable IoT devices will create new opportunities for different types of hackers.

  • Research has repeatedly shown that many IoT device manufacturers and service providers are failing to implement common security measures in their products.
  • Hackers could exploit these new devices to conduct data breaches, corporate or government espionage, and damage critical infrastructure like electrical grids.
  • Investment in securing IoT devices will increase five-fold over the next five years as adoption of these devices picks up.
  • Traditional IT security practices like network monitoring and segmentation will become even more critical as businesses and governments deploy IoT devices.
Posted in Internet of Things | Tagged | Leave a comment

5 Origins of Data Science

DataScience_History

Source: Impact of Big Data on Analytics

Posted in Data Science History, Misc | Leave a comment