How to evaluate a data scientist

datascience_skills

Jerry Overton:

What’s commonly expected from a data scientist is a combination of subject matter expertise, mathematics, and computer science. This is a tall order and it makes sense that there would be a shortage of people who fit the description. The more knowledge you have, the better, however, I’ve found that the skillset you need to be effective, in practice, tends to be more specific and much more attainable. This approach changes both what you look for from data science and what you look for in a data scientist.

A background in computer science helps with understanding software engineering, but writing working data products requires specific techniques for writing solid data science code. Subject matter expertise is needed to pose interesting questions and interpret results, but this is often done in collaboration between the data scientist and subject matter experts (SMEs). In practice, it is much more important for data scientists to be skilled at engaging SMEs in agile experimentation. A background in mathematics and statistics is necessary to understand the details of most machine learning algorithms, but to be effective at applying those algorithms requires a more specific understanding of how to evaluate hypotheses…

We tend to judge data scientists by how much they’ve stored in their heads. We look for detailed knowledge of machine learning algorithms, a history of experiences in a particular domain, and an all-around understanding of computers. I believe it’s better, however, to judge the skill of a data scientist based on their track record of shepherding ideas through funnels of evidence and arriving at insights that are useful in the real world.

 

 

 

Posted in Data Science, Data Science Careers, Data Scientists, Misc | Tagged , | Leave a comment

65% growth in mobile data traffic over last 12 months driven by video

Ericsson_Mobility2015_5

Ericsson_Mobility2015

Ericsson_Mobility2015_2

Ericsson_Mobility2015_3

Ericsson_Mobility2015_4

 

Ericsson Mobility Report:

  • Video dominates data traffic: Global mobile data traffic is forecast to grow ten-fold by 2021, and video is forecast to account for 70 percent of total mobile traffic in the same year. In many networks today, YouTube accounts for up to 70 percent of all video traffic, while Netflix’s share of video traffic can reach as high as 20 percent in markets where it is available.
  • Mainland China overtakes the US as world’s largest LTE market: By the end of 2015, Mainland China will have 350 million LTE subscriptions – nearly 35 percent of the world’s total LTE subscriptions. The market is predicted to have 1.2 billion LTE subscriptions by 2021.
  • Africa becomes an increasingly connected continent: Five years ago (2010) there were 500 million mobile subscriptions across Africa; by the end of 2015 this number will double to 1 billion. Increased connectivity improves the prospect of financial inclusion for the 70 percent unbanked through mobile money services starting to take form across Africa.
Posted in Data Growth, Misc | Tagged , , | Leave a comment

Only 16% of IT budgets are allocated to investments in innovation and growth

While 45% of CIOs identify “innovation” and 44% point to “growth” as their organizations’ most important priorities, only 15% are investing in emerging technologies . Only 16% of IT budgets are allocated to investments in innovation and growth, with the balance spent on running day-to-day operations and incremental change.

This gap between aspirations and reality is one of the key findings of the just-published Deloitte 2015 Global CIO Survey. The study is based on interviews with 1,271 CIOs from 43 countries, with the majority of the participants working for organizations with revenues of more than $1 billion.

Figure 7 Deloitte CIO survey

“Every company today is a technology company,” says Khalid Kark, U.S. CIO research director at Deloitte Services, and the business priorities reported by CIOs are the same regardless of industry, geography, and company size.  All companies are embracing digital technologies and see them as critical for their future.

The study uncovered, however, differences in how CIOs see themselves, what is expected of them, and how they would like their career to progress.  In contrast to other surveys that focus on spending priorities, the Deloitte study is mostly about different roles CIOs play today in their organizations and the impact they would like to make in the coming years.

The CIOs were asked detailed questions along the four dimensions that frame the impact of a CIO—the organization’s priorities, competencies and strengths of the CIO, building relationships internally and externally, and technology investments. Here are some of the more interesting results of the survey:

Only 9% of CIOs say they have all the skills they need to succeed

CIOs have to be ambidextrous, mixing business strategy skills with operations management: Out of 12 leadership capabilities, CIOs selected six as the most important for success in their role—influence with internal stakeholders, communication skills, understanding strategic business priorities, talent management, technology vision and leadership, and the ability to lead complex, fast-changing environments.

CIOs think they need especially to improve their leadership skills

The CIOs were asked to select the top five competencies that a successful technology leader need and to identify their own top five strengths. The skills with the largest gaps were the ability to influence internal stakeholders, talent management, and technology vision and leadership. Conversely, CIOs think they are strong in operations and execution, ability to run large-scale projects, and leverage with external partners but do not consider these as differentiating skills for successful technology leaders.

Figure 3 Deloitte CIO survey (1)

Strong relationships with other executives do not necessarily mean strong influence on the business

48% of CIOs report “strong relationships” with their CEO and interaction at least once a week, and an additional 17% report daily interactions.  But only 42% of the CIOs were co-leaders in business strategy decisions and only 19% in M&A activities.

No common definition for “digital”

Digital (71% of respondents), along with analytics and business intelligence (77%) are expected to have the most impact on the business over the next two years. But when asked further to describe their digital initiatives, the answers ranged from analyzing customer data and developing new products and services to improving customer experience and enabling the workforce to better collaborate or be more productive. The lack of common definition, says the study’s report, “is often confusing for business leaders and can lead to misunderstandings and conflicting goals.”

Figure 8 Deloitte CIO survey

Analyzing the answers to the questions about CIO performance and impact, Deloitte uncovered three distinct CIO “archetypes,” describing how CIOs are delivering value today—and how they are preparing for what comes next:

Trusted Operators keep the lights on. They focus on cost, operational efficiency, and performance reliability. They also provide enabling technologies, support business transformation efforts, and align to business strategy. Their core competency is to drive down costs by rationalizing, renewing, and consolidating technology, and they focus on internal customers. 42% of the CIOs surveyed fall into this category.

Change Instigators drive transformation. They take the lead on technology-enabled business transformation and change initiatives. They look outside the organization for partners and are focused on the end-customer of the business.  They are 21% more likely than other CIOs to call technology vision a strength. 22% of the CIOs surveyed fall into this category.

Business Co-Creators perform a balancing act, handling both business strategy and efficient operations. They operate across multiple dimensions of creating and delivering value, and were 24% more likely to cite ability to influence internal stakeholders as a top-five strength. They invest in emerging technologies as a way to drive new sources of revenue or to transform the way they deliver value to customers. 36% of the CIOs surveyed fall into this category.

“Change Instigators try to bring enhancements to existing business models, while Business Co-Creators often have the mandate to find new business opportunities and define new business models,” says Deloitte’s Kark. It’s the Business Co-Creators that tend to invest more in emerging technologies and co-create new business models with internal business partners.

Kark thinks about the three archetypes as a self-diagnostic tool for CIOs to examine where they are and how they fit the needs of their organization. It can also help identify shifting business needs and with them, an emerging shift in how the business defines the CIO role.

Many of the CIOs surveyed indeed see a transformation in their roles in the near future or would like to see such a transformation.  The proportion of Change Instigators is expected to remain the same at 22%. A big shift will occur, however with the other two types of roles:  The proportion of Trusted Operators will go down from 42% to 12% and the proportion of Business Co-Creators will expand from 36% to 66%.

Almost a third of the CIOs surveyed aspire to shift their role into a business leadership position, working with other business executives to define and pursue new business opportunities, while maintaining their reputation as top-notch IT operators.  But, says Kark, “if they don’t build the right skill set, if they don’t build the relationships, it’s going to be hard for them to make that transition. CIOs have to drive technology into the core of the business and if they are not able to do it, someone else will.”

The good news is that for those making the transition, career opportunities abound. “Over the next 3 years, more than half of all businesses will need CIOs of the Business Co-Creators type,” predicts Kark.

Originally published on Forbes.com

 

 

 

Posted in digital transformation, Misc | Tagged | Leave a comment

Artifical Intelligence Machines to Replace Physicians and Transform Healthcare

Dilbert_WatsonHealth

Posted in Misc | Leave a comment

10 New Big Data Observations from Tom Davenport

[youtube https://www.youtube.com/watch?v=DdHhD4n3iFE?rel=0]

The term “big data” has become nearly ubiquitous. Indeed, it seems that every day we hear new reports of how some company is using big data and sophisticated analytics to become increasingly competitive. The topic first began to take off in late 2010 (at least according to search results from Google Trends) and, now that we’re approaching a five-year anniversary, perhaps it’s a good time to take a step back and reflect on this major approach to doing business. This article describes 10 of my observations about big data.

See also Tom Davenport’s Guide to Big Data

Posted in Big Data Analytics, Misc | Tagged | Leave a comment

Google’s RankBrain Outranks the Best Brains in the Industry

google-brainBloomberg recently broke the news that Google is “turning its lucrative Web search over to AI machines.” Google revealed to the reporter that for the past few months, a very large fraction of the millions of search queries Google responds to every second have been “interpreted by an artificial intelligence system, nicknamed RankBrain.”

The company that has tried hard to automate its mission to organize the world’s information was happy to report that its machines have again triumphed over humans. When Google search engineers “were asked to eyeball some pages and guess which they thought Google’s search engine technology would rank on top,” RankBrain had an 80% success rate compared to “the humans [who] guessed correctly 70 percent of the time.”

There you have it. Google’s AI machine RankBrain, after only a few months on the job, already outranks the best brains in the industry, the elite engineers that Google typically hires.

Or maybe not. Is RankBrain really “smarter than your average engineer” and already “living up to its AI hype,” as the Bloomberg article informs us, or is this all just, well, hype?

Desperate to find out how far our future machine overlords are already ahead of the best and the brightest (certainly not “average”), I asked Google to shed more light on the test, e.g., how do they determine the “success rate”?

Here’s the answer I got from a Google spokesperson:

“That test was fairly informal, but it was some of our top search engineers looking at search queries and potential search results and guessing which would be favored by users. (We don’t have more detail to share on how that’s determined; our evaluations are a pretty complex process).”

I guess both RankBrain and Google search engineers were given possible search results to a given query and RankBrain outperformed humans in guessing which are the “better” results, according to some undisclosed criteria.

I don’t know about you, but my TinyBrain is still confused. Wouldn’t Google search engine, with or without RankBrain, outperform any human being, including the smartest people on earth, in terms of “guessing” which search results “would be favored by users”? Haven’t they been mining the entire corpus of human knowledge for more than fifteen years and, by definition, have produced a search engine that “understands” relevance more than any individual human being?

The key to the competition, I guess, is that the “search queries” used in it were not just any search queries but complex queries containing words that have different meaning in different context. It’s the kind of queries that will stump most human beings and it’s quite surprising that Google engineers scored 70% on search queries that presumably require deep domain knowledge in all human endeavors, in addition to search expertise.

The only example of a complex query given in the Bloomberg article is “What’s the title of the consumer at the highest level of a food chain?” The word “consumer” in this context is a scientific term for something that consumes food and the label (the “title”) at highest level of the food chain is “predator.”

This explanation comes from search guru Danny Sullivan who has come to the rescue of perplexed humans like me, providing a detailed RankBrain FAQ, up to the limits imposed by Google’s legitimate reluctance to fully share its secrets. Sullivan: “From emailing with Google, I gather RankBrain is mainly used as a way to interpret the searches that people submit to find pages that might not have the exact words that were searched for.”

Sullivan points out that a lot of work done by humans is behind Google’s outstanding search results (e.g., creating a synonym list or a database with connections between “entities”—places, people, ideas, objects, etc.). But Google needs now to respond to some 450 million new queries per day, queries that have never been entered before into its search engine.

RankBrain “can see patterns between seemingly unconnected complex searches to understand how they’re actually similar to each other,” writes Sullivan. In addition, “RankBrain might be able to better summarize what a page is about than Google’s existing systems have done.”

Finding out the “unknown unknowns,” discovering previously unknown (to humans) links between words and concepts is the marriage of search technology with the hottest trend in big data analysis—deep learning. The real news about RankBrain is that it is the first time Google applied deep learning, the latest incarnation of “neural networks” and a specific type of machine learning, to its most prized asset—its search engine.

Google has been doing machine learning since its inception. The first published paper listed in the AI and  machine learning section of its research page is from 2001, and, to use just one example, Gmail is so good at detecting spam because of machine learning). But Goggle hasn’t applied machine learning to search. That there has been internal opposition to doing so we learn from a summary of a 2008 conversation between Anand Rajaraman and Peter Norvig, co-author of the most popular AI textbook and leader of Google search R&D since 2001. Here’s the most relevant excerpt:

The big surprise is that Google still uses the manually-crafted formula for its search results. They haven’t cut over to the machine learned model yet. Peter suggests two reasons for this. The first is hubris: the human experts who created the algorithm believe they can do better than a machine-learned model. The second reason is more interesting. Google’s search team worries that machine-learned models may be susceptible to catastrophic errors on searches that look very different from the training data. They believe the manually crafted model is less susceptible to such catastrophic errors on unforeseen query types.

This was written three years after Microsoft has applied machine learning to its search technology. But now, Google got over its hubris. 450 million unforeseen query types per day are probably too much for “manually crafted models” and google has decided that a “deep learning” system such as RankBrain provides good enough protection against “catastrophic errors.”

Deep learning has taken the computer science community by storm since it was used to win an image recognition competition in 2012, performing better than traditional approaches to teaching computers to identify images.

With deep learning, the computer “learns” by putting together the pieces of a puzzle (e.g., an image of a cat), moving up a hierarchy created by the computer scientist, from simple concepts to more complex ones. (see here and here for overviews of deep learning). Decades ago this idea got the unfortunate name “neural networks” under the misguided (and hype-generating) notion that the computer networks were “mimicking the brain” (what they were mimicking were speculations about how neurons work in the human brain). The hype did not produce the promised results but starting about ten years ago, with the availability of greater computer power and much larger sets of data and more sophisticated algorithms, neural networks have been reincarnated as deep learning.

In 2012, Google engineers made their first deep learning splash when they announced that Google computers have detected the image of a cat after processing zillions of unlabeled still frames from YouTube videos.

In their post on this deep learning experiment, Jeff Dean, a Google Fellow, and Andrew Ng, A Stanford professor on leave at Google at the time, wrote:

“And this isn’t just about images—we’re actively working with other groups within Google on applying this artificial neural network approach to other areas such as speech recognition and natural language modeling.”

And in 2013, Google engineers announced an open source toolkit called word2vec “that aims to learn the meaning behind words.” They wrote: “Now we apply neural networks to understanding words by having them ‘read’ vast quantities of text on the web. We’re scaling this approach to datasets thousands of times larger than what has been possible before, and we’ve seen a dramatic improvement of performance — but we think it could be even better.”

2013 was also the year Google hired Geoffrey Hinton of the University of Toronto, “widely known as the godfather of neural networks,” according to Wired.  But the two other widely known members of the (self-labeled) “deep learning conspiracy” went to Google’s competitors: Yann LeCun to Facebook (leading a new AI research lab) and Yoshua Bengio to IBM (teaching Watson a few deep learning tricks).

Then there’s Apple, Yelp, Twitter and others—all of Google’s competitors are rushing to adopt deep learning.

This creates a serious competition for talent, for all the graduate students who three or four years ago switched the topic of their dissertations to something related to deep learning and all the others who have joined recently this “computers can learn on their own” movement. Hence the need to tell the world via Bloomberg that Google is in the game and for Google’s CEO to insist on its latest earnings call that “machine learning is a core transformative way by which we are rethinking everything we are doing.”

But beyond PR and prestige, future profits could be the most important incentive for Google to add deep learning to its search technology. It’s not only reducing costs by reducing the need to rely on humans and their “manually crafted models.” It’s also search quality, the reason Google has become the dominant search engine and a verb.

A Search Engine Land columnist, Kristine Schachinger, sheds further light on RankBrain in the context of search quality and Google’s shift in 2013 (the “Hummingbird” overhaul of their search algorithms) from providing search results based on words (strings of letters) to search results based on its knowledge of “things” (entities, facts):

Google has become really excellent at telling you all about the weather, the movie, the restaurant and what the score of last night’s game happened to be. It can give you definitions and related terms and even act like a digital encyclopedia. It is great at pulling back data points based around entity understanding.

Therein lies the rub. Things Google returns well are known and have known, mapped or inferred relationships. However, if the item is not easily mapped or the items are not mapped to each other, Google has difficulty in understanding the query…

While Google has been experimenting with RankBrain, they have lost market share — not a lot, but still, their US numbers are down. In fact, Google has lost approximately three percent of share since Hummingbird launched, so it seems these results were not received as more relevant or improved (and in some cases, you could say they are worse)…

Google might have to decide whether it is an answer engine or a search engine, or maybe it will separate these and do both.

I will go even further and speculate that Google is seeing the end of search as we know it (and they perfected), the possibility that in the future we will not enter search queries into search boxes but will rely on “knowledge navigators” (to use the term Apple coined in 1986), going beyond the current answer engines to communicating with us, providing relevant information and news, and anticipating our needs by linking things in our past, present, and future.

Now, is it possible that with Facebook’s investment in AI and deep learning, it will be the first to provide us with a futuristic knowledge navigator? What will happen to Google’s advertising revenues if the social network will consist not only of people but also deep learning machines?

Given its past performance and the competitive people running it (and its parent company), it’s obvious that RankBrain is just one of the many investments Google is making in “disrupting itself before others do” (I’m pretty sure that’s how they talk about it). Google will continue to provide outstanding, free, advertising-supported service to its users, no matter what form this service will take in the future.

Or maybe not. Being a devoted and admiring Google search user, I was a bit skeptical when I read Schachinger’s words quoted above that Google’s search results “were not received as more relevant or improved (and in some cases, you could say they are worse).”  But one very surprising search result I recently got from Google, led me to think that, indeed, sometimes when you invest in the future, you sacrifice the present.

I Googled the address “75 Amherst Street, Cambridge, MA 02139.” What I got (a number of times, over three days) at the top of the search results was a map of 75 Amherst Alley, Cambridge, MA 02139.

There is such a place, but I have never heard about it or ever been there. What’s more, 75 Amherst Street is the home of MIT’s Media Lab, so this is not only a very simple query but also one that probably has been entered into Google numerous times (the Media Lab’s contact page appears as the second result, just under the erroneous map).

Time to invest in more humans working diligently on “manually crafted models”?

Posted in AI, Machine Learning | Tagged , , , , , | Leave a comment

Cloud traffic will grow at an annual rate of 33% over the next 5 years, Cisco predicts

The new version of the Cisco Cloud Index computes the rapid expansion of today’s stampede to the cloud. “We have never seen anything like this in terms of speed of customer adoption,” Oracle Co-CEO Mark Hurd said recently, describing how his corporate customers have enthusiastically embraced the cloud.

One of them, General Electric, has moved, in just the last 18 months, 10% more of its computing load into the cloud, and expects to run 70% of its applications in the cloud by 2020. In their latest quarterly financial reports, Amazon reported that its cloud business has surged 79% year-over-year and Microsoft announced that its cloud business has “more than doubled.”

Here are the highlights of Cisco’s ongoing study of the growth of global data center and cloud-based data traffic.

Almost all of the work of IT will be done in cloud data centers

Based on its hands-on knowledge of the movement of data over global computer networks, Cisco predicts that cloud traffic will grow at an annual rate of 33% over the next 5 years, quadrupling from 2.1 zettabytes (2.1 trillion gigabytes) in 2014 to 8.6 zettabytes by the end of 2019. 86% of workloads will be processed by cloud data centers in 2019  and only 14% will be processed by traditional data centers.

Cisco Figure 3 DC and Cloud Growth

Source: Cisco Global Cloud Index, 2014–2019

Cloud traffic is expected to account for 83% of total data center traffic by 2019. Cloud traffic is a subset of data center traffic and is generated by cloud services accessible through the Internet from scalable, virtualized cloud data centers. Total data center traffic, which Cisco projects will reach 10.4 zettabytes by the end of 2019, is comprised of all traffic traversing within and between data centers as well as to end users.

10.4 trillion gigabytes is the equivalent of 144 trillion hours of streaming music or 6.8 trillion of high-definition (HD) movies viewed online. Ones and zeros are eating the world and the companies providing consumers with digital entertainment and other services have been at the forefront of the migration to the cloud.  Indeed, The Wall Street Journal has reported recently that Netflix has shut down the last of its data centers, moving the last piece of its IT infrastructure to the public cloud.

The public cloud will grow faster than the private cloud

Source: Cisco Global Cloud Index, 2014–2019

Source: Cisco Global Cloud Index, 2014–2019

While overall cloud workloads will grow at a CAGR of 27% from 2014 to 2019, the public cloud workloads are going to grow at 44% CAGR over that period, and private cloud (where cloud services are  delivered to corporate users by their IT department) workloads will grow at a slower pace of 16%. By 2019, there will be more workloads (56%) in the public cloud than in private clouds (44%).

New sources of data, especially the Internet of Things, will keep the clouds very busy

Source: Cisco Global Cloud Index, 2014–2019

Source: Cisco Global Cloud Index, 2014–2019

The total volume of stored data on client devices and in data centers will more than double to reach 3.5 zettabytes by 2019. Most stored data resides in client devices today and will continue to do so over the next 5 years, but more data will move to the data center over time, representing 18% of all data in 2019, up from 12% in 2014.

In addition to larger volumes of stored data, the stored data will be coming from a wider range of devices by 2019. Currently, 73% of data stored on client devices resides on PCs. By 2019, stored data on PCs will go down to 49%, with a greater portion of data on smartphones, tablets, and machine-to-machine (M2M) modules. Stored data associated with M2M will grow at a faster rate than any other device category at an 89% CAGR.

A broad range of Internet of Things (IoT) applications are generating large volumes of data that could reach, Cisco estimates, 507.5 zettabytes annually by 2019. That’s 49 times greater than the projected data center traffic for 2019 (10.4 zettabytes). Today, only a small portion of this content is stored in data centers, but that could change as big data analytics tools are applied to greater volumes of the data collected and transmitted by IoT applications.

The figure below maps several M2M applications for their frequency of network communications, average traffic per connection, and data analytic needs. Applications such as smart metering can benefit from real-time analytics of aggregated data that can optimize the usage of resources such as electricity, gas, and water. On the other hand, applications such as emergency services and environment and public safety can be much enhanced through distributed real-time analytics that can help make real-time decisions that affect entire communities. Although some other applications such as manufacturing and processing can have potential efficiencies from real-time analytics, their need is not very imminent.

Source: Cisco Global Cloud Index, 2014–2019

Source: Cisco Global Cloud Index, 2014–2019

More consumers will keep their data in the cloud

Cisco estimates that by 2019, 55% (2 billion) of the Internet-connected consumer population will use personal cloud storage, up from 42% (1.1 billion users) in 2014.

Source: Cisco Global Cloud Index, 2014–2019

Source: Cisco Global Cloud Index, 2014–2019

Global consumer cloud storage traffic will grow from 14 exabytes (14 billion gigabytes) annually in 2014 to 39 exabytes by 2019 at a 23% CAGR. This growth translates to per-user traffic of 1.6 gigabytes per month by 2019, compared to 992 megabytes per month in 2014.

Source: Cisco Global Cloud Index, 2014–2019

Source: Cisco Global Cloud Index, 2014–2019

Ones and zeros are eating the world and today we got fresh insights into how much, how fast, and how their movement changes the way IT services are delivered to businesses and consumers.  For more data and the study’s methodology, go to the Cisco Global Cloud Index webpage.

Originally published on Forbes.com

Posted in Data Growth | Tagged , , | Leave a comment

Google Knows Everything (#IoT)

Iot_Marketing

Posted in Internet of Things | Tagged , , | Leave a comment

Who Does What in Data Science (Infographic)

DataScientists_roles

Posted in Data Scientists | Tagged , | Leave a comment

Hunting Unicorns and Rapidly Becoming the Master of the Startup Universe

Anand Headshot - black and white

Anand Sanwal, co-founder and CEO, CB Insights

CB Insights is riding the Unicorn Boom, doubling its headcount since the beginning of the year, propelled by its unique database of companies and investors and everything there is to know about them. The frequency of “according to CB Insights” appearing in a wide range of media outlets has gone up dramatically this year. The fast-growing subscriber list for its engaging newsletter, bursting with visually-appealing data nuggets and topical analysis, is now at more than 100,000. Recently, it has supplied the New York Times with a list of the “50 Companies That May Be the Next Start-Up Unicorns.

This master of the startup universe got its start as Chubby Brain. “In our early days we were talking to an investment bank and they said they really liked our product but they will never buy something called Chubby Brain,” recalls Anand Sanwal, CB Insights’ co-founder and CEO. “At that moment we understood we needed to lose some of our edgy internet entrepreneur desire, given the market we were going after,” he adds.

The market they were going after consisted of all the people that need to understand the health of private companies. Doing M&As for American Express and managing, among other things, investments in companies trying to disrupt AmEx, Sanwal found out how difficult it was to use traditional information providers such as Dow Jones and Thomson (“their products, in one word, are terrible,” he says). To find out what’s going on with startups and other private companies, people were spending a lot of time manually gathering data by calling investors and VCs. Besides, the scope of this data collection was severely limited by the fact that private companies do their best to keep their financial performance private.

The answer to this need was in the explosion of publicly available data on the Web. “Better understanding private companies by using public information was the germ of the idea for CB Insights,” says Sanwal.

So a new digital business was born. CB Insights uses big data tools to automate the data collection, crawling about 100,000 sources daily, and big data algorithms to analyze the data about investors, companies, and industries. Most important, it identifies and tracks the publicly available signals that serve as good indicators of the health of private companies, e.g., hiring statistics from job boards, news and sentiment about the news, and information about new partners and customers. “I don’t think any of these [signals] is going to be independently a smoking gun,” says Sanwal. “We build this mosaic of a private company that’s instructive in understanding its health.” Doing it since 2009, CB Insights has amassed a large historical record that allows it to pinpoint which signals are strong (serving as valid indicators of a company’s success or failure) and which are weak.

Ironically for a startup that started up by providing recommendations to other entrepreneurs about the best funding sources for their startups, the founders of CB Insights did not seek angel or VC investment. Instead, they applied for a grant from the Small Business Innovation Research (SBIR) program of the National Science Foundation (NSF). The timing was right, as banks stopped lending to small businesses after the financial crisis. “Banks think about private companies as one monolithic entity,” says Sanwal, “and when times are tough they see all small businesses as a risk. Our thesis was—can we give lenders data that will help them make better decisions.”

They got an initial $150,000 grant to prove their thesis. When they did, they received a $500,000 grant, and when they started generating revenues, an additional $500,000, for a total of $1.15 million. “I don’t think we needed the NSF money from a survival perspective,” says Sanwal, “but it let us pursue some of the more moonshot ideas.”

In addition to this unusual funding mechanism, CB Insights is also quite unique in this Unicorn Boom era in that it has been revenue-funded from the beginning. “We’ve been very disciplined, always making more revenues than we spend every month,” says Sanwal. That’s a lesson he learned working for Kozmo.com, one of the poster boys of the dot-com bubble which shut down after raising about $250 million. “I saw the perils of growth at all costs,” he says.

On the flip side, Sanwal probably also saw the benefits of free publicity, generated by the media’s obsession with dot-com startups. The SBIR grants helped in marketing the company as “a National Science Foundation-backed big data company” to potential customers and employees, but CB Insights needed more than the prestige of government-backed research to reach its targeted audience.

“We had zero marketing dollars,” says Sanwal, “and unlike Dow Jones or Thomson we could not take [prospects] to a dinner or a Yankees game.” Instead, their “weapon of choice” was their excellence at Excel. They started building a “content marketing engine,” providing potential customers—and the media—with a taste of what can be done with their data and analysis, via a newsletter and on their research blog. This marketing effort has showcased their data visualization skills, knack for knowing what will be quoted in the media, and an engaging combination of far-from-suppressed edgy  humor, “data geeks” passion, and maverick attitude (Sanwal signs all newsletters with “I love you” or, most recently, with “even if you never say it back, I still love you”).

The Unicorn Boom has provided a lot of opportunities for CB Insights to demonstrate their predictive analytics skills and get lots of free publicity, although hunting unicorns is a very insignificant part of the business. But, by popular demand, Sanwal has been happy to offer an opinion in the press and public speaking engagements regarding the perennial question—are we in a bubble? No, he says, ”the mechanism that’s going to force valuations down isn’t there as the public markets are closed to private companies right now. If companies start to IPO that have no business going public, then we will start to worry. A unicorn might fail and this will generate headlines but it will not cause any systemic risk to anybody. Right now, it’s only a private market euphoria, but no doubt it’s a little crazy.” (In this presentation, Sanwal explains in more detail why there is no bubble right now).

Sanwal says he has always wanted to be an entrepreneur: “I grew up in a family that was entrepreneurial. My father is a chemical engineer and started his own chemical manufacturing firm long time ago. I always wanted to be my own boss.”  Sanwal got at Wharton a chemical engineering degree and a finance/accounting degree, so I asked him what did his father think about him not pursuing an engineering career. “I think he knew he is a much better engineer than I’ll ever be and that the world is a much safer place because I’m not engineer,” Sanwal answered.

Like other successful entrepreneurs, Sanwal has a larger vision, going beyond the specific business opportunity he has spotted. Providing lenders, investors, and others a risk assessment tool akin to a FICO score for private companies, CB Insights makes private markets work faster, enabling faster decision-making. Correcting the inefficiencies he discovered in the market for information on private companies, leads to smoothing the inefficiencies in a variety of economic decisions, activities, and endeavors.

That vision was behind the development of a predictive analytics platform on top of high-quality database, serving as the foundation from which to launch a variety of applications or services targeted at specific audiences and needs.  In addition to a subscription-based access to its database, CB Insights has offered so far applications and tools for assessing the health of private companies and investors, mapping the links between investors and companies, tracking valuation and valuation multiples data, monitoring the health and growth potential of markets, and industry analytics.

About a month ago, the company launched CB Insights for Sales, “helping sales teams fill the top of their funnel with more prospects,” says Sanwal. It is targeted at companies selling “high-value products, $10,000 and above,” and corrects yet another inefficiency—the business-to-business selling process which is “hopelessly antiquated.”  Salespeople need not only new leads, but also to nurture their prospects. CB Insights’ database—which Sanwal argues is a competitive differentiator in the crowded sales analytics market—alerts them to news about the prospect which provide them with a reason to call. A company signing up for CB Insights for Sales uploads a list of their existing clients, which helps the application provide a similar list of companies to target. This is a big step for CB Insights towards customizing their database for the need of a specific customer.

Other recent and potential applications include recommending the likely acquirers of a private company, identifying the industries and markets that are hot, indicating for accounts receivables departments when they should tighten up credit terms for specific companies, and identifying for recruiters companies that are not doing too well so they can poach their talent. The long-term goal is to provide “a predictive analytics API that other people can pull into their own use cases and platforms,” says Sanwal.

CB Insights aims to be “the Bloomberg for private companies,” Sanwal tells his public audiences. But it’s more than that. “Our mantra internally is that probability trumps punditry,” he says. “We want to take on all of those people who make bold prognostications of where the world is going but they completely pull it out of [thin air]. We want to use data to inform the conversation about what’s next.”

Update: On November 9, 2015, CB Insights announced it has raised a $10 million Series A and provided details regarding its business metrics.

Originally published on Forbes.com

Posted in Misc | Leave a comment