Chris Pouliot, the Director of Analytics and Algorithms at Netflix: “…my team does not only personalizations for movies, but we also deal with content demand prediction. Helping our buyer down in Beverly Hills figure out how much do we pay for a piece of content. The personalization recommendations for helping users find good movies and TV shows. Marketing analytics, how do we optimize our marketing spin. Streaming platform, how do we optimize the user experience once I press play. There’s a wide range of data, so theres a lot of diversity. We have a lot of scale, a lot of challenging problems. The question then is, how do we attract great data scientists that can just see this as a playground, a sandbox of really exciting things. Challenging problems, challenging data, great tools, and then just the ability to have fun and create great products.”
[youtube http://www.youtube.com/watch?v=pJd3PKm9XUk]
Big Data Analytics and Data Science at Netflix (Video)
The Data Science Interview: Yun Xiong, Fudan University
The Goal of Data Science is to Study the Phenomena and Laws of Datanature
Yun Xiong is an Associate Professor of Computer Science and the Associate Director of the Center for Data Science and Dataology at Fudan University, Shanghai, China. She received her Ph.D. in Computer and Software Theory from Fudan University in 2008. Her research interests include dataology and data science, data mining, big data analysis, developing effective and efficient data analysis techniques for various applications including finance, economics, insurance, bioinformatics, and sociology. The following is an edited version of our recent email exchange.
How has data science developed in China? Continue reading
Big Data Observations: The Science of Asking Questions
“I am a firm believer that without speculation there is no good and original observation”—Charles Darwin
“It is the theory that determines what we can observe”—Albert Einstein
“I suspect, however, like as it is happening in many academic fields, the NSA is sorely tempted by all the data at its fingertips and is adjusting its methods to the data rather than to its research questions. That’s called looking for your keys under the light”—Zeynep Tufekci
“Large open-access data sets offer unprecedented opportunities for scientific discovery—the current global collapse of bee and frog populations are classic examples. However, we must resist the temptation to do science backwards by posing questions after, rather than before, data analysis. A scant understanding of the context in which data sets were collected can lead to poorly framed questions and results, and to conclusions that are plain wrong. Scientists intending to make use of large composite data sets need to work closely with those responsible for gathering the data. Standard scientific principles and practice then demand that they first frame the important questions, then design and execute the data analyses needed to answer them”—David B. Lindenmayer and Gene E. Likens
“The wonderful thing about being a data scientist is that I get all of the credibility of genuine science, with none of the irritating peer review or reproducibility worries… I thought I was publishing an entertaining view of some data I’d extracted, but it was treated like a scientific study… I’ve enjoyed publishing a lot of data-driven stories since then, but I’ve never ceased to be disturbed at how the inclusion of numbers and the mention of large data sets numbs criticism”—Pete Warden
The Big Data Debate: Correlation vs. Causation
In the first quarter of 2013, the stock of big data has experienced sudden declines followed by sporadic bouts of enthusiasm. The volatility—a new big data “V”—continues this month and Ted Cuzzillo summed up the recent negative sentiment in “Big data, big hype, big danger” on SmartDataCollective:
“A remarkable thing happened in Big Data last week. One of Big Data’s best friends poked fun at one of its cornerstones: the Three V’s. The well-networked and alert observer Shawn Rogers, vice president of research at Enterprise Management Associates, tweeted his eight V’s: ‘…Vast, Volumes of Vigorously, Verified, Vexingly Variable Verbose yet Valuable Visualized high Velocity Data.’ He was quick to explain to me that this is no comment on Gartner analyst Doug Laney’s three-V definition. Shawn’s just tired of people getting stuck on V’s.”
Indeed, all the people who “got stuck” on Laney’s “definition,” conveniently forgot that he first used the “three-Vs” to describe data management challenges in 2001. Yes, 2001. If big data is a “revolution,” how come its widely-used “definition” is based on a dozen year-old analyst note?
Coursera: Data Science in Education
Monica Rogati in The LinkedIn Blog on Coursera, which offers free online courses: “Coursera’s approach to feedback and assessment is a very interesting application of data science. Tests are either computer-graded or peer-graded — the latter following industry tested crowdsourcing best practices (clear instructions, gold standards, training, qualification tasks assessor agreement monitoring etc.). Peer grading isn’t just treated as a means for scaling — it is part of the learning process. One of Daphne’s charts showed that students significantly improved on subsequent tests after peer- and self-grading. Interestingly, the better students learned even more from self-grading than from grading others… Continue reading
The Web at 25: Tim Berners-Lee on the Web of Data
In 2009, on the occasion of the 20th anniversary of the Web, Jason Rubin and I talked to Tim Berners-Lee about his invention and its future, the Semantic Web, which he described as “the Web of data.”
Twenty years on, the World Wide Web has proven itself both ubiquitous and indispensible. Did you anticipate it would reach this status, and in this time frame?
Tim Berners-Lee: I think while it’s very tempting for us to look at the Web and say, “Well, here it is, and this is what it is,” it has, of course, been constantly growing and changing—and it will continue to do so. So to think of this as a static “This is how the Web is” sort of thing is, I think, unwise. In fact, it’s changed in the last few years faster than it changed before, and it’s crazy for us to imagine this acceleration will suddenly stop. So yes, the 20-year point goes by in a flash, but we should realize that, and we are constantly changing it, and it’s very important that we do so.
I believe that 20 years from now, people will look back at where we are today as being a time when the Web of documents was fairly well established, such that if someone wanted to find a document, there’s a pretty good chance it could be found on the Web. The Web of data, though, which we call the Semantic Web, would be seen as just starting to take off. We have the standards but still just a small community of true believers who recognize the value of putting data on the Web for people to share and mash up and use at will. And there are other aspects of the online world that are still fairly “pre-Web.” Social networking sites, for example, are still siloed; you can’t share your information from one site with a contact on another site. Hopefully, in a few years’ time, we’ll see that quite large category of social information truly Web-ized, rather than being held in individual lockdown applications.
You mentioned a “small community” of people who see the value of the Semantic Web. Is that a repeat occurrence of the struggle 20 years ago to get people to understand the scope and potential impact of the World Wide Web?
It’s remarkably similar. It’s very funny. You’d think that once people had seen the effect of Web-izing documents to produce the World Wide Web, doing likewise with their data would seem the next logical step. But for one thing, the Web was a paradigm shift. A paradigm shift is when you don’t have in your vocabulary the concepts and the ideas with which to understand the new world. Today, the idea that a web link could connect to a document that originates anywhere on the planet is completely second nature, but back then it took a very strong imagination for somebody to understand it.
Now, with data, almost all the data you come across is locked in a database. The idea that you could access and combine data anywhere in the world and immediately make it part of your spreadsheet is another paradigm shift. It’s difficult to get people to buy into it. But in the same way as before, those who do get it become tremendously fired up. Once somebody has realized what it would be like to have linked data across the world, then they become very enthusiastic, and so we now have this corps of people in many countries all working together to make it happen.
Do you see the Semantic Web as enabling greater collaboration between and among parties, as opposed to the point-to-point or point-to-many communication that seems more prevalent in the current Web?
The original web browser was a browser editor and it was supposed to be a collaborative tool, but it only ran on the NeXT workstation on which it was developed. However, the idea that the Web should be a collaborative place has always been a very important goal for me. I think harnessing the creative energy of people is really important. When you get people who are trying to solve big problems like cure AIDS, fight cancer, and understand Alzheimer’s disease, there are a huge number of people involved, all of them with half-formed ideas in their minds. How do we get them communicating so that the half of an idea in one person’s head will connect with half of an idea in somebody else’s head, and they’ll come up with the solution?
That’s been a goal for the Web of documents, and it’s certainly a goal for the Web of data, where different pieces of data can be used for all kinds of different things. For example, a genomist may suspect that a particular protein is connected to a certain syndrome in a cell line, search for and find data relating to each area, and then suddenly put together the different strains of data and discover something new. And this is something he can do with the owners of the respective pieces of data, who might never have found each other or known that their data was connected. So the Web of data will absolutely lead to greater collaboration.
Is your vision of the Semantic Web one in which data is freely available, or are there access rights attached to it?
A lot of information is already public, so one of the simple things to do in building the new Web of data is to start with that information. And recently, I’ve been working with both the U.K. government and the U.S. government in trying not only to get more information on the Web, but also to make it linked data. But it’s also very important that systems are aware of the social aspects of data. And it’s not just access control, because an authorized user can still use the right data for the wrong purpose. So we need to focus on what are the purposes for accessing different kinds of data, and for that we’ve been looking at accountable systems.
Accountable systems are aware of the appropriate use of data, and they allow you to make sure that certain kinds of information that you are comfortable sharing with people in a social context, for example, are not able to be accessed and considered by people looking to hire you. For example, I have a GPS trail that I took on vacation. Certainly, I want to give it to my friends and my family, but I don’t necessarily wish to license people I don’t know who are curious about me and my work and let them see where I’ve been. Companies may want to do the same thing. They might say, “We’re going to give you access to certain product information because you’re part of our supply chain and you can use it to fine-tune your manufacturing schedule to meet our demand. However, we do not license you to use it to give to our competition to modify their pricing.”
You need to be able to ask the system to show you just the data that you can use for a given task, because how you wish to use it will be the difference in whether you can use it. So we need systems for recording what the appropriate use of data is, and we need systems for helping people use data in an appropriate way so they can meet an ethical standard.
Ultimately, what is one of the most significant things the Semantic Web will enable?
One thing I think we’ll be able to do is to write intelligent programs that run across the Web of data looking for patterns when something went wrong—like when a company failed, or when a product turned out to be dangerous, or when an ecological catastrophe happened. We can then identify patterns in a broad range of data types that resulted in something serious happening, and that will allow us to identify when these patterns recur, and we’ll be better able to prepare for or prevent the situation.
I think when we have a lot of data available on the Web about the world, including social data, ecological data, meteorological data, and financial data, we’ll be able to make much better models. It’s been quite evident over the last year, for example, that we have a really bad grasp of the financial system. Part of the reason for that might be that we have insufficient data from which to draw conclusions, or that the experts are too selective in which data they use. The more data we have, the more accurate our models will be.
After 20 years, what about the Web—either its current or future capabilities—excites you the most?
One of the things that gets me the most excited are the mash-ups, where there’s one market of people providing data and there’s a second layer of people mashing up the data, picking from a rich variety of data sources to create a useful new application or service. A classic example of a mash-up is when I find a seminar I want to go to, and the web page has information about the sponsor, the presenter, the topic, and the logistics. I have to write all that down on the back of an envelope and then go and put it in my address book; I have to put it in my calendar; I have to enter the address in my GPS—basically, I have to copy this information into every device I use to manage my life, which is inefficient and time-consuming. This is because there is no common format for this data to become integrated into my devices.
Now, the vision of Semantic Web is that the seminar’s web page has information pointed at data about the event. So I just tell my computer I’m going to be attending that seminar and then, automatically, there is a calendar that shows things that I’m attending. And automatically, an address book I define as having in it the people who have given seminars that I’ve attended within the last six months appears, with a link to the presenter’s public profile. And automatically, my PDA starts pointing towards somewhere I need to be at an appropriate time to get me there. All I need to do is say, “I’m going to that seminar,” and then the rest should follow.
The Web is such a mélange of useful, noble content and stuff that runs the gamut from the mundane to the grotesque. Do you think humanity is using this incredible invention of yours appropriately?
Yes. The Web, after all, is just a tool. It’s a powerful one, and it reconfigures what we can do, but it’s just a tool, a piece of white paper, if you will. So what you see on it reflects humanity—or at least the 20 percent of humanity that currently has access to the Web.
As a standards body, the W3C is not interested in policing the Web or in censoring content, nor should we be. No one owns the World Wide Web, no one has a copyright for it, and no one collects royalties from it. It belongs to humanity, and when it comes to humanity, I’m tremendously optimistic. After 20 years, I’m still very excited and extremely hopeful.
[First published in ON magazine]
What is the Internet of Things? (Infographic)
Cool Data Scientists on Campus
Hal Varian: “Data availability is going to continue to grow. To make that data useful is a challenge. It’s generally going to require human beings to do it.”
Source: Carl Bialik, “Data Crunchers Now the Cool Kids on Campus,” The Wall Street Journal, March 1, 2013
See my list of graduate programs in data science and big data analytics