Monica Rogati in The LinkedIn Blog on Coursera, which offers free online courses: “Coursera’s approach to feedback and assessment is a very interesting application of data science. Tests are either computer-graded or peer-graded — the latter following industry tested crowdsourcing best practices (clear instructions, gold standards, training, qualification tasks assessor agreement monitoring etc.). Peer grading isn’t just treated as a means for scaling — it is part of the learning process. One of Daphne’s charts showed that students significantly improved on subsequent tests after peer- and self-grading. Interestingly, the better students learned even more from self-grading than from grading others… Continue reading
The Web at 25: Tim Berners-Lee on the Web of Data
In 2009, on the occasion of the 20th anniversary of the Web, Jason Rubin and I talked to Tim Berners-Lee about his invention and its future, the Semantic Web, which he described as “the Web of data.”
Twenty years on, the World Wide Web has proven itself both ubiquitous and indispensible. Did you anticipate it would reach this status, and in this time frame?
Tim Berners-Lee: I think while it’s very tempting for us to look at the Web and say, “Well, here it is, and this is what it is,” it has, of course, been constantly growing and changing—and it will continue to do so. So to think of this as a static “This is how the Web is” sort of thing is, I think, unwise. In fact, it’s changed in the last few years faster than it changed before, and it’s crazy for us to imagine this acceleration will suddenly stop. So yes, the 20-year point goes by in a flash, but we should realize that, and we are constantly changing it, and it’s very important that we do so.
I believe that 20 years from now, people will look back at where we are today as being a time when the Web of documents was fairly well established, such that if someone wanted to find a document, there’s a pretty good chance it could be found on the Web. The Web of data, though, which we call the Semantic Web, would be seen as just starting to take off. We have the standards but still just a small community of true believers who recognize the value of putting data on the Web for people to share and mash up and use at will. And there are other aspects of the online world that are still fairly “pre-Web.” Social networking sites, for example, are still siloed; you can’t share your information from one site with a contact on another site. Hopefully, in a few years’ time, we’ll see that quite large category of social information truly Web-ized, rather than being held in individual lockdown applications.
You mentioned a “small community” of people who see the value of the Semantic Web. Is that a repeat occurrence of the struggle 20 years ago to get people to understand the scope and potential impact of the World Wide Web?
It’s remarkably similar. It’s very funny. You’d think that once people had seen the effect of Web-izing documents to produce the World Wide Web, doing likewise with their data would seem the next logical step. But for one thing, the Web was a paradigm shift. A paradigm shift is when you don’t have in your vocabulary the concepts and the ideas with which to understand the new world. Today, the idea that a web link could connect to a document that originates anywhere on the planet is completely second nature, but back then it took a very strong imagination for somebody to understand it.
Now, with data, almost all the data you come across is locked in a database. The idea that you could access and combine data anywhere in the world and immediately make it part of your spreadsheet is another paradigm shift. It’s difficult to get people to buy into it. But in the same way as before, those who do get it become tremendously fired up. Once somebody has realized what it would be like to have linked data across the world, then they become very enthusiastic, and so we now have this corps of people in many countries all working together to make it happen.
Do you see the Semantic Web as enabling greater collaboration between and among parties, as opposed to the point-to-point or point-to-many communication that seems more prevalent in the current Web?
The original web browser was a browser editor and it was supposed to be a collaborative tool, but it only ran on the NeXT workstation on which it was developed. However, the idea that the Web should be a collaborative place has always been a very important goal for me. I think harnessing the creative energy of people is really important. When you get people who are trying to solve big problems like cure AIDS, fight cancer, and understand Alzheimer’s disease, there are a huge number of people involved, all of them with half-formed ideas in their minds. How do we get them communicating so that the half of an idea in one person’s head will connect with half of an idea in somebody else’s head, and they’ll come up with the solution?
That’s been a goal for the Web of documents, and it’s certainly a goal for the Web of data, where different pieces of data can be used for all kinds of different things. For example, a genomist may suspect that a particular protein is connected to a certain syndrome in a cell line, search for and find data relating to each area, and then suddenly put together the different strains of data and discover something new. And this is something he can do with the owners of the respective pieces of data, who might never have found each other or known that their data was connected. So the Web of data will absolutely lead to greater collaboration.
Is your vision of the Semantic Web one in which data is freely available, or are there access rights attached to it?
A lot of information is already public, so one of the simple things to do in building the new Web of data is to start with that information. And recently, I’ve been working with both the U.K. government and the U.S. government in trying not only to get more information on the Web, but also to make it linked data. But it’s also very important that systems are aware of the social aspects of data. And it’s not just access control, because an authorized user can still use the right data for the wrong purpose. So we need to focus on what are the purposes for accessing different kinds of data, and for that we’ve been looking at accountable systems.
Accountable systems are aware of the appropriate use of data, and they allow you to make sure that certain kinds of information that you are comfortable sharing with people in a social context, for example, are not able to be accessed and considered by people looking to hire you. For example, I have a GPS trail that I took on vacation. Certainly, I want to give it to my friends and my family, but I don’t necessarily wish to license people I don’t know who are curious about me and my work and let them see where I’ve been. Companies may want to do the same thing. They might say, “We’re going to give you access to certain product information because you’re part of our supply chain and you can use it to fine-tune your manufacturing schedule to meet our demand. However, we do not license you to use it to give to our competition to modify their pricing.”
You need to be able to ask the system to show you just the data that you can use for a given task, because how you wish to use it will be the difference in whether you can use it. So we need systems for recording what the appropriate use of data is, and we need systems for helping people use data in an appropriate way so they can meet an ethical standard.
Ultimately, what is one of the most significant things the Semantic Web will enable?
One thing I think we’ll be able to do is to write intelligent programs that run across the Web of data looking for patterns when something went wrong—like when a company failed, or when a product turned out to be dangerous, or when an ecological catastrophe happened. We can then identify patterns in a broad range of data types that resulted in something serious happening, and that will allow us to identify when these patterns recur, and we’ll be better able to prepare for or prevent the situation.
I think when we have a lot of data available on the Web about the world, including social data, ecological data, meteorological data, and financial data, we’ll be able to make much better models. It’s been quite evident over the last year, for example, that we have a really bad grasp of the financial system. Part of the reason for that might be that we have insufficient data from which to draw conclusions, or that the experts are too selective in which data they use. The more data we have, the more accurate our models will be.
After 20 years, what about the Web—either its current or future capabilities—excites you the most?
One of the things that gets me the most excited are the mash-ups, where there’s one market of people providing data and there’s a second layer of people mashing up the data, picking from a rich variety of data sources to create a useful new application or service. A classic example of a mash-up is when I find a seminar I want to go to, and the web page has information about the sponsor, the presenter, the topic, and the logistics. I have to write all that down on the back of an envelope and then go and put it in my address book; I have to put it in my calendar; I have to enter the address in my GPS—basically, I have to copy this information into every device I use to manage my life, which is inefficient and time-consuming. This is because there is no common format for this data to become integrated into my devices.
Now, the vision of Semantic Web is that the seminar’s web page has information pointed at data about the event. So I just tell my computer I’m going to be attending that seminar and then, automatically, there is a calendar that shows things that I’m attending. And automatically, an address book I define as having in it the people who have given seminars that I’ve attended within the last six months appears, with a link to the presenter’s public profile. And automatically, my PDA starts pointing towards somewhere I need to be at an appropriate time to get me there. All I need to do is say, “I’m going to that seminar,” and then the rest should follow.
The Web is such a mélange of useful, noble content and stuff that runs the gamut from the mundane to the grotesque. Do you think humanity is using this incredible invention of yours appropriately?
Yes. The Web, after all, is just a tool. It’s a powerful one, and it reconfigures what we can do, but it’s just a tool, a piece of white paper, if you will. So what you see on it reflects humanity—or at least the 20 percent of humanity that currently has access to the Web.
As a standards body, the W3C is not interested in policing the Web or in censoring content, nor should we be. No one owns the World Wide Web, no one has a copyright for it, and no one collects royalties from it. It belongs to humanity, and when it comes to humanity, I’m tremendously optimistic. After 20 years, I’m still very excited and extremely hopeful.
[First published in ON magazine]
The Web at 25: The Value of Open
The Internet started as a network for linking research centers. The World Wide Web started as a way to share information among researchers at CERN. Both have expanded to touch today a third of the world’s population because they have been based on open standards.
Creating a closed and proprietary system has been the business model of choice for many great inventors and some of the greatest inventions of the computer age. That’s where we were headed towards in the early 1990s: The establishment of global proprietary networks owned by a few computer and telecommunications companies, whether old (IBM, AT&T) or new (AOL). Tim Berners-Lee’s invention and CERN’s decision to offer it to the world for free in 1993 changed the course of this proprietary march, giving a new—and much expanded—life to the Internet (itself a response to proprietary systems that did not inter-communicate) and establishing a new, open platform, for a seemingly infinite number of applications and services.
As Bob Metcalfe told me in 2009: “Tim Berners-Lee invented the URL, HTTP, and HTML standards… three adequate standards that, when used together, ignited the explosive growth of the Web… What this has demonstrated is the efficacy of the layered architecture of the Internet. The Web demonstrates how powerful that is, both by being layered on top of things that were invented 17 years before, and by giving rise to amazing new functions in the following decades.”
Metcalfe also touched on the power and potential of an open platform: “Tim Berners-Lee tells this joke, which I hasten to retell because it’s so good. He was introduced at a conference as the inventor of the World Wide Web. As often happens when someone is introduced that way, there are at least three people in the audience who want to fight about that, because they invented it or a friend of theirs invented it. Someone said, ‘You didn’t. You can’t have invented it. There’s just not enough time in the day for you to have typed in all that information.’ That poor schlemiel completely missed the point that Tim didn’t create the World Wide Web. He created the mechanism by which many, many people could create the World Wide Web.”
“All that information” was what the Web gave us (and what was also on the mind of one of the Internet’s many parents, J.C.R. Licklider, who envisioned it as a giant library). But this information comes in the form of ones and zeros, it is digital information. In 2007, 94% of storage capacity in the world was digital, a complete reversal from 1986, when 99.2% of all storage capacity was analog. The Web was the glue and the catalyst that would speed up the spread of digitization to all analog devices and channels for the creation, communications, and consumption of information. It has been breaking down, one by one, proprietary and closed systems with the force of its ones and zeros.
Metcalfe’s comments were first published in ON magazine which I created and published for my employer at the time, EMC Corporation. For a special issue (PDF) commemorating the 20th anniversary of the invention of the Web, we asked some 20 members of the Inforati how the Web has changed their and our lives and what it will look like in the future. Here’s a sample of their answers:
Guy Kawasaki: “With the Web, I’ve become a lot more digital… I have gone from three or four meetings a day to zero meetings per day… Truly the best will be when there is a 3-D hologram of Guy giving a speech. You can pass your hand through him. That’s ultimate.”
Chris Brogan: “We look at the Web as this set of tools that allow people to try any idea without a whole lot of expense… Anyone can start anything with very little money, and then it’s just a meritocracy in terms of winning the attention wars.”
Tim O’Reilly: “This next stage of the Web is being driven by devices other than computers. Our phones have six or seven sensors. The applications that are coming will take data from our devices and the data that is being built up in these big user-contributed databases and mash them together in new kinds of services.”
John Seely Brown: “When I ran Xerox PARC, I had access to one of the world’s best intellectual infrastructures: 250 researchers, probably another 50 craftspeople, and six reference librarians all in the same building. Then one day to go cold turkey—when I did my first retirement—was a complete shock. But with the Web, in a year or two, I had managed to hone a new kind of intellectual infrastructure that in many ways matched what I already had. That’s obviously the power of the Web, the power to connect and interact at a distance.”
Jimmy Wales: “One of the things I would like to see in the future is large-scale, collaborative video projects. Imagine what the expense would be with traditional methods if you wanted to do a documentary film where you go to 90 different countries… with the Web, a large community online could easily make that happen.”
Paul Saffo: “I love that story of when Tim Berners-Lee took his proposal to his boss, who scribbled on it, ‘Sounds exciting, though a little vague.’ But Tim was allowed to do it. I’m alarmed because at this moment in time, I don’t think there are any institutions our there where people are still allowed to think so big.”
Dany Levy (founder of DailyCandy): “With the Web, everything comes so easily. I wonder about the future and the human ability to research and to seek and to find, which is really an important skill. I wonder, will human beings lose their ability to navigate?”
Howard Rheingold: “The Web allows people to do things together that they weren’t allowed to do before. But… I think we are in danger of drowning in a sea of misinformation, disinformation, spam, porn, urban legends, and hoaxes.”
Paul Graham: “[With the Web] you don’t just have to use whatever information is local. You can ship information to anyone anywhere. The key is to have the right filter. This is often what startups make.”
How many startups and grown-up companies today are entirely based on an idea first flashed out in a modest proposal 25 years ago? And there is no end in sight for the expanding membership in this club, now also increasingly including the analogs of the world. All businesses, all governments, all non-profits, all activities are being eaten by ones and zeros. Tim Berners-Lee has unleashed an open, ever-expanding system for the digitization of everything.
We also interviewed Berners-Lee in 2009. He said that the Web has “changed in the last few years faster than it changed before, and it is crazy to for us to imagine this acceleration will suddenly stop.” He pointed out the ongoing tendency to lock what we do with computers in a proprietary jail: “…there are aspects of the online world that are still fairly ‘pre-Web.’ Social networking sites, for example, are still siloed; you can’t share your information from one site with a contact on another site.” But he remained both realistic and optimistic, the hallmarks of an entrepreneur: “The Web, after all, is just a tool…. What you see on it reflects humanity—or at least the 20 percent of humanity that currently has access to the Web… No one owns the World Wide Web, no one has a copyright for it, and no one collects royalties from it. It belongs to humanity, and when it comes to humanity, I’m tremendously optimistic.”
The Pew Research Center is marking the 25th anniversary of the Web in a series of reports. Berners-Lee says in a press release issued today by the World Wide Web Consortium: “I hope this anniversary will spark a global conversation about our need to defend principles that have made the Web successful, and to unlock the Web’s untapped potential. I believe we can build a Web that truly is for everyone: one that is accessible to all, from any device, and one that empowers all of us to achieve our dignity, rights and potential as humans.”
See also Berners-Lee post on Google’s official blog: “…today is a day to celebrate. But it’s also an occasion to think, discuss—and do. Key decisions on the governance and future of the Internet are looming, and it’s vital for all of us to speak up for the web’s future. How can we ensure that the other 60 percent around the world who are not connected get online fast? How can we make sure that the web supports all languages and cultures, not just the dominant ones? How do we build consensus around open standards to link the coming Internet of Things? Will we allow others to package and restrict our online experience, or will we protect the magic of the open web and the power it gives us to say, discover, and create anything? How can we build systems of checks and balances to hold the groups that can spy on the net accountable to the public? These are some of my questions—what are yours?”
The Big Data Landscape Revisited
Bruce Reading, CEO of VoltDB, has an interesting and original take on the big data landscape.
Last year, Dave Feinleib published the Big Data Landscape, “to organize this rapidly growing technology sector.” One prominent data scientist told me “it’s just a bunch of logos on a slide,” but it has become a popular reference point for categorizing the different players in this bustling market. Sqrrl, a big data start-up, published recently its own version of Feinleib’s chart, its “take on the big data ecosystem.” Sqrrl’s eleven big data “buckets” are somewhat different from Feinleib’s, demonstrating a lack of agreement, understandable at this stage, on what exactly are the different segments of the big data market and what to call them. Furthermore, Sqrrl positions itself “at the intersection of four of these boxes” which raises questions about the accuracy of its positioning of other big data companies inside just one or two boxes.
Another interesting recent attempt to make sense of the big data landscape comes from The 451’s Matt Aslett in the form of a “Database Landscape Map.” Taking its inspiration from the map of the London Underground and a content technology map from the Real Story Group, it charts the links between an ever-expanding database market and the data storing/organizing/mining technologies and tools (Hadoop, NoSQL, NewSQL…) that now form the core of the big data market.
Which brings me to Bruce Reading, VoltDB, and their take on the big data landscape. “It’s a very noisy market,” Bruce told a packed room at a recent VoltDB event. “It’s like shopping in a mall at Christmas time when there’s a lot of noise and a lot of information about a lot of technologies. We are trying to work with the marketplace to understand what you are trying to accomplish. Instead of using market maps based on technologies, we are looking at use cases.”
What is the Internet of Things? (Infographic)
Cool Data Scientists on Campus
Hal Varian: “Data availability is going to continue to grow. To make that data useful is a challenge. It’s generally going to require human beings to do it.”
Source: Carl Bialik, “Data Crunchers Now the Cool Kids on Campus,” The Wall Street Journal, March 1, 2013
See my list of graduate programs in data science and big data analytics
What Happens on the Web in 60 Seconds (Infographic)
Source: Qmee
SAS CTO on Big Data and Big Compute
“One of my biggest challenges,” Keith Collins told me recently, “is helping SAS understand how to communicate to IT organizations. We present workloads which look odd and different. IT does not know how to have an SLA (Service Level Agreement) around them. We take all of the compute and I/O capacity that they can give us.”
SAS, the largest independent vendor in the business intelligence market, used to be a prime example of “shadow IT,” the purchasing of information technology tools by business users without the knowledge and approval of the central IT organization. But this is changing in the era of big data. The collection and analysis of data are becoming a very large part of many business activities and the IT organization is asked to provide support, even leadership, in tying together these disparate efforts.
Collins is SVP and CTO at SAS, where he has spent almost 30 years, helping the company grow with the market through a number of phases (and buzzwords)—statistical analysis, decision-support, data mining, knowledge and risk management, business intelligence, and business analytics. Now SAS is helping its customers, including CIOs and their IT teams, address the challenges of big data. Collins has seen this movie before: “People are all hyped up about Hadoop. But what is it, really? It is big and wide record sizes, big block sizes, designed specifically for high-volume, sequential processing. Just like a SAS data set in 1968… The only difference between a SAS data set and Hadoop is that now the disks are cheap enough that you can do replication.” The following is an edited transcript of our conversation.
Gil Press: Indeed, many people talk about Hadoop as a replacement for tape.
Keith Collins: We love that people get that as a pattern now, because it really helps them understand SAS. So it is a really good time for us to have the conversation with IT about it. But they are still struggling. They see it as “what is my next big data repository?” They do not see it as “this is my next big way to answer questions.”