I don’t do pure research—my analysis enables real-world functionality
Currently mining terabytes of tweets as a data scientist with Twitter, Edwin Chen studied math and linguistics at MIT and then crunched numbers at Peter Thiel’s hedge fund, Clarium Capital Management. He blogs on topics of interest to data scientists such as crowdsourcing text analysis with Amazon’s Mechanical Turk or ggplot2, a data visualization tool. The following is an edited transcript of our recent phone conversation.
When you went to MIT, what were your future plans?
I always thought I was going to stay in academia. I was interested in theoretical computer science and linguistics and I thought I’ll become a professor working in one of these fields or hopefully in the intersection of both.
But you haven’t stayed in academia. Why?
If I stayed in academia, I felt like I’d be stuck down a particular path, and not necessarily pursuing my exact interests if they changed. It seems like a great thing to be a tenured professor—you get to do the research you want, with students helping you. But instead of waiting until then, I just thought I’d have more control of my life in the real world.
In the real world you became a data scientist. What advice would you give to undergraduates today regarding what skills they should invest in if they want to become data scientists?
A data scientist is someone who has a mix of good quantitative skills. These can be developed studying subjects like machine learning, math and physics. A data scientist also needs a good mix of software engineering skills. If you can’t code, it’s not easy for you to do what you want to do and to apply your insights into a working system. I spend a lot of time building up the systems for collecting data and for inputting my data into other systems. And then there is a third set of skills, a bit fuzzier and sometimes neglected in discussions of the required skills for data scientists: The ability to visualize and explain to other people what you found in the data.
Is this something that can be learned?
I think I got better at it just by explaining a lot of things. You find out what kind of things you need to explain and what kind of things you need to explain better. Explaining the data is the fuzzier ability—it’s like being a good writer or a good teacher. I do think you can learn it, but I’m not sure how exactly, besides just doing it and getting feedback.
The ability to explain your work is important because as a data scientist you need to interact not only with business executives but also with other functions.
In any job, it’s obviously important to be able to explain well. You’re making decisions all the time that you’ll often have to defend, and a big reason people want to work with you is so that they can learn from you.
But it’s perhaps particularly important for the work I do because my main job is often to find interesting insights in the data that can then be applied. Maybe a less technical person in the company has asked me about the growth of a product in different countries, or maybe I’m looking into where another team’s algorithm could be improved—I’m often helping to answer other people’s questions and not just my own. The better I make someone understand a piece of data, the better they can connect it to their own context (which I may not always fully have), and the more likely it is that my findings will have helped out in some way. It’s even better if I can showcase the data in a way that encourages people to explore it for themselves, and answer questions that I haven’t thought of.
I don’t do pure research. The objective of all the analysis I do is to enable some kind of functionality, some kind of real-world functionality. If all I do is collect data without presenting it in a compelling way that inspires further action, then I’ve only done half my job.
Your blog is a good place to see the complexity and sophistication of your work and the way you visualize and explain the results. How do you come up with topics for your blog?
I sometime come across an interesting set of data that I want to look at and then I’ll play around with it and blog about it. Also, people reading my blog sometimes email me and ask if I can explain a certain topic. So it’s a mix of things I played around with or find interesting or think that there is a need for them to be better explained.
Your blog post about Amazon’s Mechanical Turk highlights an interesting dimension of the data scientist’s work—the management of crowdsourcing.
Whenever people talk about data science they usually talk about big data, like mining data from blogs or clicks. But this is a totally different dimension, where you are working with small data sets that can’t necessarily be automatically mined. You need to crowdsource this kind of work, sentiment analysis for example, to get help from human judgment. I think crowdsourcing is extremely useful in the areas where computers still perform pretty poorly—things like computer vision or determining relevance—or for gathering training data that your algorithms can build upon.
I guess your blog is a learning experience for you—sometimes the best way to learn is by teaching other people. How do you keep up with the latest in data science?
Two ways, either on twitter or through an RSS feed on Google Reader. I subscribe to some archive aggregators so if there is new academic work I’ll find it there. I also go to meet-ups and tech talks but I don’t do it as much as I like to.
Of course, it’s all a matter of finding the time. What do you do to relax when you find the time to relax?
I like to read.
Read what?
Science fiction.
Do you think science fiction is popular among data scientists?
[Laughs] I guess it’s slightly more popular among data scientists than in the general population.
And I guess the general population would see some of what you do as science fiction. “Machine learning” sounds to the general population as machines that can think. That’s certainly a theme in science fiction.
We are starting to build more interactive data analysis platforms lately so I guess I’m feeling sometime that we are seeing the future that science fiction writers in the past imagined.
So is the ”Singularity” fast approaching?
I’d like to believe that but I’m not so sure about it as some people are.
Let’s close with a more immediate and certain future—where do you see data science in three to five years?
I see two important trends continuing over the next few years: Better open-source tools and the increasing availability of open datasets. If you’re a data lover interested in hacking on incredible datasets in your spare time, the range of tools freely available to you and the amount of interesting data publicly available are getting better and better. Every day, it seems like a new piece of free software gets released that makes it even easier to dig into more and more data.