2013 Data Science Salary Survey: Open source tools correlate with higher salary

“In our report, 2013 Data Science Salary Survey, we make our own data-driven contribution to the conversation. We collected a survey from attendees of the Strata Conference in New York and Santa Clara, California, about tool usage and salary…

What did we find?

In a sentence: those who use data tools make more.

More specifically, the tools that correlate with higher salary are scalable and generally open source; they are often script-based or built for machine learning.  Those attendees who tend to use one such tool tend to use others––that is, these tools form a ‘cluster’ in terms of usage among our sample.  Perhaps just as interesting is that some of the traditional, popular tools such as Excel and SAS were not used as widely as R and Python. This might be food for thought for those data analysts who have thus far resisted learning how to code or moving beyond query-based data tools.”

Source: 2013 Data Science Salary Survey 

Posted in Big Data Analytics, Data Science, Data Science Careers, Data Scientists | Leave a comment

The New Apple Wristop Computer: Not Designed for the Internet of Things

MIT Media Lab cofounder Nicholas Negroponte observed at a recent TED event that “I look today at some of the work being done around the Internet of Things and it’s kind of tragically pathetic.”

The “tragically pathetic” label has been especially fitting for wearables, considered the hottest segment of the Internet of Things.  Lauren Goode at Re/Code wrote back in March: “Let me guess: Your activity-tracking wristband is sitting on your dresser or in a drawer somewhere right now, while it seems that every day there’s a news report out about an upcoming wearable product that’s going to be better, cooler, smarter.”

All of this was going to change when Apple finally entered the category with its smart watch. Many observers hoped that Apple’s design principles, obsession with simplicity, and track record of delighting users with easy-to-use products, are going to finally give the world a useful and fun wearable.

Instead, we got a good-looking wrist-top computer. Not a simple, intuitive, and focused device but a generic, complex product with too many functions and options. Kevin McCullagh wrote in fastcodesing.com: “I can’t help but think Steve Jobs would have stopped the kitchen sink being thrown in like this. Do we really need photos and maps on a stamp-sized screen, when our phones are rarely out of reach? For all the claims of a ‘thousand no’s for every yes,’ the post-Jobs era is shaping up to be defined by less ruthless focus.” Back in June, Adam Lashinsky already made this general observation about the potential loss of the famed product development discipline: “Apple, once the epitome of simplicity, is becoming the unlikely poster child for complexity.”

“Complexity,” however, does not tell the whole story. By introducing a watch that is basically a computer on your wrist, Apple missed an opportunity not just to reorient the wearables market to something much better than “tragically pathetic,” but also to define the design and usability principles for the Internet of Things.

In his TED talk, Negroponte highlighted what he called “not a particularly enlightened view of the Internet of Things.” This is the tendency to move the intelligence (or functionality of many devices) into the cell phone (or the wearable), instead of building the intelligence into the “thing,” whatever the thing is – the oven, the refrigerator, the road, the walls, all the physical things around us. More generally, it is the tendency to continue evolving the current computer paradigm—from the mainframe to the laptop to the wristop computer—instead of developing a completely new Internet of Things paradigm.

The new paradigm should embrace and evolve the principles of what was once called “ubiquitous computing.” The history of that vision over the last two decades may help illuminate where the Internet of Things is today and where it may or may not go.

In 1991, Mark Weiser, then head of the Computer Science Lab at Xerox PARC, published an article in Scientific American titled “The Computer for the 21st Century.” The article opens with what should be the rallying cry for the Internet of Things today: “The most profound technologies are those that disappear. They weave themselves into the fabric of everyday life until they are indistinguishable from it.”

Weiser went on to explain what was wrong with the personal computing revolution brought on by Apple and others: “The arcane aura that surrounds personal computers is not just a ‘user interface’ problem. My colleague and I at the Xerox Palo Alto Research Center think that the idea of a ‘personal’ computer itself is misplaced and that the visions of laptop machines, dynabooks and ‘knowledge navigators’ is only a transitional step toward achieving the real potential of information technology.  Such machines cannot truly make computing an integral, invisible part of people’s lives.”

Weiser understood that, conceptually, the PC was simply a mainframe on a desk, albeit with easier-to-use applications.  He misjudged, however, the powerful and long-lasting impact that this new productivity and life-enhancing tool would exert on millions of users worldwide. Weiser wrote: “My colleagues and I at PARC believe that what we call ubiquitous computing will gradually emerge as the dominant mode of computer access over the next 20 years. … [B]y making everything faster and easier to do, with less strain and fewer mental gymnastics, it will transform what is apparently possible. … [M]achines that fit the human environment instead of forcing humans to enter theirs will make using a computer as refreshing as taking a walk in the woods.”

Ubiquitous computing has not become the “dominant mode of computer access” mostly because of Steve Jobs’ Apple. It successfully invented variations on the theme of the Internet of Computers: The iPod, the iPhone, the iPad. All of them beautifully designed, easy-to-use, and useful. All of them cementing and enlarging the dominance of the Internet of Computers paradigm. Now Apple has extended the paradigm by inventing a wristop computer. That the Apple Watch is more complex and less focused than Apple’s previous successful inventions matters less than the fact that it continues in their well-trodden path.

While the dominant paradigm has been reinforced and expanded by the successful innovations of Apple and others, the vision of ubiquitous computing has not died. Today, when we are adding intelligence to things at an accelerating rate, it is more important than ever. Earlier this year, I asked Bob Metcalfe what is required to make us happy with our Internet of Things experience. “Not so much good UX, but no UX at all,” he said. “The IoT should disappear into the woodwork, even faster than Ethernet has.” Metcalfe invented the Ethernet at Xerox PARC at the time Weiser and others were working on making computers disappear.

Besides ubiquity, there are at least two other dimensions to the new paradigm of the Internet of Things. One is seamless connectivity. In response to the same question, Google’s Hal Varian told me, “I think that the big challenge now is interoperability. Given the fact that there will be an explosion of new devices, it is important that they talk to each other. For example, I want my smoke alarm to talk to my bedroom lights, and my garden moisture detector to talk to my lawn sprinkler.” No more islands of computing, a hallmark of the Internet of (isolated) Computers.

Another important dimension of the new paradigm is useful data. Not big or small, nor irrelevant or trapped in a silo, just useful. The value of the “things” in the Internet of Things paradigm is measured by how well the data they collect is analyzed and how quickly useful feedback based on this analysis is delivered to the user.

Disappearing into the woodwork. All things talking to all things. Useful data. It may not be Apple, but the company or companies that will master these will usher in the new era of the Internet of Things where we finally get over our mainframe/PC/Wristop computer habit.

[Originally published on Forbes.com]

Posted in Internet of Things | Leave a comment

Big Data Quotes: Einstein, Come Back When You’ve Got Data

“Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away”—George Dyson (quoted by Tim O’Reilly)

“If the engineers have their way, every idea, memory, and feeling—the recorded consciousness of a single lifetime—will be stored in the cloud… ‘Information overload’ once referred to the difficulty of absorbing intelligently the data produced by others. Now we face the peril of choking on our own…By remembering everything, we may become haunted by our pasts and immobilized by digital distractions—or we may gain new powers to prevent the bad and promote the good”—G. Pascal Zachary

“[I]n a world where massive datasets can be analysed to identify patterns not easily identified using simpler analogue methods, what happens to genius of the Einstein variety?

Genius is about big ideas, not big data. Analysing the attributes and characteristics of anything is guaranteed to find some patterns. It is inherently a theoretical exercise, one that requires minimal thought once you’ve figured out what you want to measure. If you’re not sure, just measure everything you can get your hands on. Since the number of observations — the size of the sample — is by definition huge, the laws of statistics kick in quickly to ensure that significant relationships will be identified. And who could argue with the data?

Unfortunately, analysing data to identify patterns requires you to have the data. That means that big data is, by necessity, backward-looking; you can only analyze what has happened in the past, not what you can imagine happening in the future. In fact, there is no room for imagination, for serendipitous connections to be made, for learning new things that go beyond the data. Big data gives you the answer to whatever problem you might have (as long as you can collect enough relevant information to plug into your handy supercomputer). In that world, there is nothing to learn; the right answer is given…

What if Albert Einstein lived today and not 100 years ago? What would big data say about the general theory of relativity, about quantum theory? There was no empirical support for his ideas at the time — that’s why we call them breakthroughs.

Today, Einstein might be looked at as a curiosity, an ‘interesting’ man whose ideas were so out of the mainstream that a blogger would barely pay attention. Come back when you’ve got some data to support your point”—Sidney Finkelstein

Posted in Big Data Analytics, Quotes | Leave a comment

The Data Science Interview: Mingsheng Hong, Hadapt

Data scientists are data junkieswhen they see a new data set they are just naturally excited and can’t wait to explore.

Mingsheng Hong is Chief Data Scientist at Hadapt, a Boston-based startup that offers an analytical platform that integrates structured and unstructured data in one cloud-optimized system. Before joining Hadapt, Mingsheng was Field CTO for Vertica. He holds a Ph.D. in Computer Science from Cornell and a BSc in Computer Science from Fudan University. Mingsheng is president of NECINA and is active in St. Baldrick’s Foundation, a volunteer-driven charity that funds research to find cures for childhood cancers. I talked to Mingsheng just before he shaved his head, a visual indicator and act of solidarity expected from successful St. Baldrick’s fundraisers.

As a graduate student, were you thinking of an academic career?

At Cornell, I explored both academic and private industry career tracks. I love research and innovation, and discovered my passion for explaining ideas to people from various backgrounds and getting them excited about these ideas. While that aligns with a more academic track, in the end I decided the private sector was a better fit for me. I’m driven by the challenge of taking an idea and carrying it end-to-end, from idea to product development to sales. During graduate school, I had the opportunity to visit Microsoft for a few summers, and I got a lot of exposure to database R&D and came away with a good feel for the industry. My research work there was commercialized in SQL Server 2008 and 2012, which was very exciting.   Continue reading

Posted in Data Scientists | Leave a comment

The CIO Interview: Annabelle Bexiga, TIAA-CREF

“Innovation is everyone’s job,” Annabelle Bexiga, EVP and CIO at TIAA-CREF told me recently. “The most mundane thing,” says Bexiga, “even stacking servers in the data center, can be innovative if you can think of a different way of doing it.”

Contrary to repeated predictions heralding the end of IT innovation, IT is now synonymous with the ever-changing technological landscape of all aspects of our lives. It is also synonymous, for the most part, with business innovation, as IT transforms all business activities from operations to manufacturing to customer relations.

At TIAA-CREF, the IT organization is innovating in support of the growth and expansion of the business. Founded in 1918 to provide retirement services to university faculty, TIAA-CREF is expanding to provide a wider range of financial services and establish a growing presence in other not-for-profit sectors, including health care, research, cultural organizations, and the public sector.  It is already one of the largest pension funds in the U.S., with $520 billion of assets under management, serving 3.9 million active and retired individuals, in addition to institutional investors, retirement plan sponsors, and financial planners.   Continue reading

Posted in Digitization | Leave a comment

The Data Scientist Will Be Replaced By Tools

We just started to use the term “data scientist” and the demise of this new profession is already predicted? Well, at least it’s not one more “rise of the machines” prophecy; it’s the provocative title of a proposed panel for the upcoming SXSW.

The organizer of the panel, Scott Hendrickson of Gnip, has provided a useful run-down of some of the arguments for and against the possible disappearance of data scientists. Supporting the proposition are the current scarcity of data science talent and a slew of startups providing “data science as a service.” As an example of the opposition to the “democratization of algorithms,” Hendrickson quotes Cathy (Mathbabe) O’Neil who wrote recently that “if your model fails, you want to be able to figure out why it failed. The only way to do that is to know how it works to begin with. Even if it worked in a given situation, when you train on slightly different data you might run into something that throws it for a loop, and you’d better be able to figure out what that is.” In other words, machines will never have the deep understanding of the tools of data science that is required to practice data science.   Continue reading

Posted in Data Science, Data Scientists | Leave a comment

Doing Data Science at Manheim

As ones and zeros eat the world, data is the new product and data science is the new process of innovation.

The International Institute for Analytics predicts that in 2014 companies in a variety of industries will increasingly use analytics on the data they have accumulated to develop new products and services. NewVantage Partners’ most recent Big Data Survey reports that 68% of executives felt that “new product innovations” was the greatest value to their organization from big data. In releasing the Accenture Technology Vision 2014, Accenture’s CTO Paul Daugherty said that “Digital is rapidly becoming part of the fabric of [large enterprises’] operating DNA and they are poised to become the digital power brokers of tomorrow.”

The best example of this trend I’ve encountered recently came from an industry one does not necessarily associate with data crunching and analysis—the vehicle remarketing industry, better known as used cars auctions. In 2012, Manheim, a subsidiary of Cox Enterprises, handled nearly 8 million used vehicles, facilitating transactions representing more than $50 billion in value.  With annual revenues of more than $2.5 billion, Manheim offers its services in 14 countries, from physical and online auction channels to financing, transportation, and mobile solutions. Manheim’s research and consulting arm, Manheim Consulting, provides market intelligence and publishes the monthly Used Vehicle Value Index and the annual Used Car Market Report (see here for the 2014 version).

Manheim has provided for free this type of analysis, seeing it as part of the value it offers to auto dealers who are members of its network.  But now it has moved into using its deep knowledge of the used car market and its analytics expertise to offer a new, fee-based service.  Shifting the analytics team from supporting the business to generating revenues, “we’ve decided to look at how we can help dealers in managing the risk associated with their inventory,” T. Glenn Bailey told me.

Bailey is Senior Director of Enterprise Product Planning at Manheim, and his responsibilities include market segmentation, forecasting, and optimization.  He and his team started testing last year a new service called DealShield. The idea came from the financial markets, specifically put option contracts. Just like a put option protects the buyer from a decline in the price of a stock below a specified price, so does DealShield offer a guarantee that Manheim will buy a car back from the dealer, within a certain time frame, for what they paid for it plus the fee they paid.  “It is as if they never bought the car,” says Bailey.

Manheim’s market knowledge and analytics skills give it confidence in its estimates of the value of a car and what they would be able to offer for it if it comes back to them. “We see a lot of value in it,” says Bailey, “because one of the things dealers like to have is liquidity. They use wholesale financing to buy used cars and typically repay the loan within seven to fourteen days. The inventory that’s sitting out there is money that is tied up. DealShield allows them to get out of that car and get their money back in a certain period of time.”

To do their analysis, the Manheim team uses tools that have served this purpose for years, demonstrating that for certain types of analysis and data you can do data science without using any of the new big data technologies. The data is collected and stored in an IBM DB2 database and the analysis is done using a variety of SAS analytics tools.  “The need to combine data from different sources is why we moved into a SAS cloud,” says Bailey. “I wanted our analyst team to be focused on the analytics and not worry about the administrative side.”

Speaking of the analyst team, Bailey says that “we are in the same market for analytics and data science talent with everybody.” In the competition for these hard-to-find professionals, Bailey looks for creativity, communications skills and willingness to learn the business. “In my experience,” he says, “it is fairly easy to tell if you have the technical chops.” He spends most of the time when he interviews people trying to determine if they are creative and can come up with new ideas on how to apply analytics tools to the data to find new insights. “Reversing the flow of cause and effect,” Bailey calls it. “Maybe optimization can tell us where to send a vehicle to maximize value.”

In addition to looking for “people that can bring technology to the business,” Bailey also looks for people who are comfortable with “getting with the business itself.” He calls it “putting on the polo shirt,” spending time with the dealers and getting engaged with them to understand their business first-hand.  This practical bent does not stop with the hiring of the right people but continues with establishing the right work environment and a “fail-fast” culture. “In some sense,” says Bailey, ”failure is rewarded because it means you are testing this thing out.” When they developed DealShield, “we had a chart that over a 2-month period showed all the things that failed. If it doesn’t work, kill it.”

In addition to being the first knowledge-based service that is expected to bring in a new revenue stream, DealShield breaks new ground for Manheim because it is the first time the company actually owns cars (when they come back from the dealer), not just acting as a middle-man. That became an opportunity for an analyst on Bailey’s team to hone further her knowledge of the business.  “She is now responsible for selling the cars. She is setting the auction, the floor price, where to run the auction,” says Bailey.

Doing data science means engaging with the business, inventing new data-based products, even becoming an integral part of revenue stream for the business.

[Originally published on Forbes.com]

Posted in Data Science, Data Scientists | Leave a comment

Sources and Types of Big Data (Infographic)

Posted in Big Data | Leave a comment

Top 10 Predictions for $2.14 Trillion IT Market in 2014: IDC

IDC issued recently its top 10 predictions for 2014. IDC’s Frank Gens predicted that 2014 “will be about pitched battles” and a coming IT industry consolidation around a small number of big “winners.” The industry landscape will change as “incumbents will no longer be foolish enough to say we don’t compete with Amazon.”

Here’s my edited version of the predictions in the IDC press release and webcast:

Overall IT spending to grow 5.1% to $2.14 trillion, PC revenues to decline 6%

Worldwide sales of smartphones (12% growth) and tablets (18%) will continue at a “torrid pace” (accounting for over 60% of total IT market growth) at the expense of PC sales which will continue to decline. Spending on servers, storage, networks, software, and services will “fare better” than in 2013.

Android vs Apple, round 6

The Samsung-led Android community “will maintain its volume advantage over Apple,” but Apple will continue to enjoy “higher average selling prices and an established ecosystem of apps.” Google Play (Android) app downloads and revenues, however, “are making dramatic gains.” IDC advises Microsoft to “quickly double mobile developer interest in Windows.” Or else?

Amazon (and possibly Google) to take on traditional IT suppliers

Amazon Web Services’ “avalanche of platform-as-a-service offerings for developers and higher value services for businesses” will force traditional IT suppliers to “urgently reconfigure themselves.” Google, IDC predicts, will join in the fight, as it realizes “it is at risk of being boxed out of a market where it should be vying for leadership.”***

Emerging markets will return to double-digit growth of 10%

Emerging markets will account for 35% of worldwide IT revenues and, for the first time, more than 60% of worldwide IT spending growth. “In dollar terms,” IDC says, “China’s IT spending growth will match that of the United States, even though the Chinese market is only one third the size of the U.S. market.” In 2014, the number of smart connected devices shipped
in emerging markets will be almost double that shipped in developed markets and emerging markets will be a hotbed of Internet of Things market development.

In Pictures: Gartner’s 10 Strategic Technology Trends For 2013

There’s a $100 billion cloud in our future

Spending on cloud services and the technology to enable these services “will surge by 25% in 2014, reaching over $100 billion.” IDC predicts “a dramatic increase in the number of datacenters as cloud players race to achieve global scale.”

Cloud service providers will increasingly drive the IT market

As cloud-dedicated datacenters grow in number and importance, the market for server, storage, and networking components “will increasingly be driven by cloud service providers, who have traditionally favored highly componentized and commoditized designs.” The incumbent IT hardware vendors will be forced to adopt a “cloud-first” strategy, IDC predicts. 25–30% of server shipments will go to datacenters managed by service providers, growing to 43% by 2017.

Bigger big data spending

IDC predicts spending of more than $14 billion on big data technologies and services or 30% growth year-over-year, “as demand for big data analytics skills continues to outstrip supply.” The cloud will play a bigger role with IDC predicting a race to develop cloud-based platforms capable of streaming data in real time. There will be increased use by enterprises of externally-sourced data and applications and “data brokers will proliferate.” IDC predicts explosive growth in big data analytics services, with the number of providers to triple in three years. 2014 spending on these services will exceed $4.5 billion, growing by 21%.

Here comes the social enterprise

IDC predicts increased integration of social technologies into existing enterprise applications. “In addition to being a strategic component in virtually all customer engagement and marketing strategies,” IDC says, “data from social applications will feed the product and service development process.” By 2017, 80% of Fortune 500 companies will have an active customer community, up from 30% today.

Here comes the Internet of Things

By 2020, the Internet of Things will generate 30 billion autonomously connected end points and $8.9 trillion in revenues. IDC predicts that in 2014 we will see new partnerships among IT vendors, service providers, and semiconductor vendors that will address this market. Again, China will be a key player:  The average Chinese home in 2030 will have 40–50 intelligent devices/sensors, generating 200TB of data annually.

The digitization of all industries

By 2018, 1/3 of share leaders in virtually all industries will be “Amazoned” by new and incumbent players. “A key to competing in these disrupted and reinvented industries,” IDC says, “will be to create industry-focused innovation platforms (like GE’s Predix) that attract and enable large communities of innovators – dozens to hundreds will emerge in the next several years.” Concomitant with this digitization of everything trend, “the IT buyer profile continues to shift to business executives. In 2014, and through 2017, IT spending by groups outside of IT departments will grow at more than 6% per year.”

***Can’t resist quoting my August 2011 post: “Consumer vs. enterprise is an old and soon-to-be obsolete distinction. If Google will not take away some of Microsoft’s (and IBM’s, etc. for that matter) “enterprise” revenues, someone else will. At stake are the $1.5 trillion spent annually by enterprises on hardware, software, and services. If you include what enterprises spend on IT internally (staff, etc.), you get at least $3 trillion. A big chunk of that will move to the cloud over the next fifteen years. Compare this $3 trillion to the $400 billion spent annually on all types of advertising worldwide.  Why leave money on the table?”

[Originally published on Forbes.com]

Posted in Big Data Analytics, Internet of Things | Leave a comment

On Data Janitors, Engineers, and Statistics

Big Data Borat tweeted recently that “Data Science is 99% preparation, 1% misinterpretation.” Commenting on the 99% part, Cloudera’s Josh Wills says: “I’m a data janitor. That’s the sexiest job of the 21st century. It’s very flattering, but it’s also a little baffling.” Kaggle, the data-science-as-sport startup, takes care of the “1% misinterpretation” part by providing a matchmaking service between the sexiest of the sexy data janitors and the organizations requiring their hard-to-find skills. It charges $300 per hour for the service, of which $200 go to the data janitor (at least in the case of Shashi Godbole, quoted in the Technology Review article). Kaggle justifies its mark-up by delivering “the best 0.5% of the 95,988 data scientists who compete in data mining competitions,” the top of its data science table league, the ranking of data scientists based on their performance in Kaggle’s competitions, presumably representing  sound interpretation and top-notch productivity.

Kaggle’s co-founder Anthony Goldbloom tells The Atlantic’s Thomas Goetz that the ranking also represents a solution to a “market failure” in assessing the skills and relevant experience of the new breed of data scientists: “Kaggle represents a new sort of labor market, one where skills have been bifurcated from credentials.” Others see this as the creation of a new, $300 per hour, guild. In “Data Scientists Don’t Scale,” ZDnet’s Andrew Brust says that “’Data scientist’ is a title designed to be exclusive, standoffish and protective of a lucrative guild… The solution… isn’t legions of new data scientists. Instead, we need self-service tools that empower smart and tenacious business people to perform Big Data analysis themselves.”

Continue reading

Posted in Data Science Careers, Data Scientists, Statistics | Leave a comment