[英]《資料科學實戰》作者Cathy O'Neil:大資料並不神奇(圖靈訪談)

盼盼姐發表於2015-04-08

Cathy O'Neil是約翰遜實驗室高階資料科學家、哈佛大學數學博士、麻省理工學院數學系博士後、巴納德學院教授,曾發表過大量算術代數幾何方面的論文。他曾在著名的全球投資管理公司D.E. Shaw擔任對衝基金金融師,後加入專門評估銀行和對衝基金風險的軟體公司RiskMetrics。Cathy是一位數學家,後來轉型為資料科學家,她的個人部落格http://mathbabe.org/廣受歡迎。她和哥倫比亞大學統計系兼職教授Rachel Schutt根據一門名為“資料科學導論”的課程撰寫了《資料科學實戰》一書。

[英]《資料科學實戰》作者Cathy O'Neil:大資料並不神奇(圖靈訪談)

iTuring: Where does being data scientist attract you the most? Have you found any clues to this question: what can a non-academic mathematician do that makes the world a better place?

I love data! I love seeing how we can learn about the way things work by measuring them. I particularly enjoy figuring out how to quantify something we only vaguely knew, or to compare the effects of two things that until that moment seemed incomparable.

My biggest clue is that I need to spend more time and energy making sure data scientists think before they act. Data science is powerful and influential and can be used for evil or good. We need to recognize that fact.

iTuring: Many readers are inspired by your blog Mathbebe, have you been inspired by them through interaction?

Absolutely! My readers have brought me on many strange and intellectually stimulating journeys. I am thankful every day to them for doing so.

iTuring: In your perspectives, what quality and learning background of a person would be most qualified to do the job of data science?

It really depends. I've written the book Doing Data Science with a mathematical background in mind, but honestly a data science team should also have people with backgrounds in philosophy and ethics who learn more scientific approaches. We need diversity of thought to solve a problem well.

iTuring: Some people believe that applications that based on big data actually indulge people’s reliance on old habits, which constrain tryouts of various experiences, do you agree?

That can be true. For example, a resume or application sorting algorithm that simply learns from historical data and the regenerates that old-fashioned decision-making is merely codifying all the biases that the system had, whether it's sexism, or an over-reliance on certain college degrees. I suggest to people to try to figure out what it is that they are actually looking for, and how they can locate those skills while staying as unbiased as possible. We should at least attempt this.

iTuring: Many companies have benefited magnificently from big data analysis, but there are also companies who use big data to formulate policies and strategies but have benefited little or even failed. What are their mistakes during the process?

Often they think that big data is magical. Of course it's not, you need good questions, and moreover you don't just need big data, you need the right data, which often hasn't been collected.

iTuring: For a large part, big data are used for prediction. Do you think accidental incident could be predicted by determinate data?

This question is a bit vague, but if I understand it correctly it's asking whether something that is fundamentally unpredictable can be predicted. I guess not! However, it's of course true that even stochastic processes have some underlying characteristics. For example, if you have a waiting time process, you can talk about when you'd be "surprised" that the event hasn't occurred, after defining what surprises you.

iTuring: To better access web data, NoSQL arises. While traditional database also come up with the concept of Data Space, in which data comes first then comes the model. How does this technology apply nowadays? Are there similar topics that are not familiar to most people?

Generally speaking, big data uses unstructured and dirty data, at least to develop models. After the models go into production, there's sometimes a standard database being used, and definitely by the time the results and daily reports are being made, it is using standard databases.

I tend to ignore the details of this kind of data storage question, not because it's uninteresting but because it is rapidly changing. When I need to work on a new project, I go figure out what the current best technology is.

iTuring: In machine learning, training data are usually given. Engineeringly speaking, what is the most important (tricky) thing while extracting training data from database? Data traits, data size or the way data is extracted?

Really hard to say in general! Of course, sometimes you need just a huge amount of training data to train your model, and other times not so much but you need to be careful you are pulling a representative sample.

For my part I almost always train my models according to timestamps, when possible. I start earlier and train my data, then I test it on later data.

iTuring: In order to extract the key factors of a model, data analysts often had to have good understandings of certain business. Is there any easy way to do it? Or it’s the inevitable part of the job?

It is truly inevitable; only domain experts will be able to guide the modeling, at least near the beginning, when there are still easily achieved goals. Later on, when all the domain expertise has been included, it might become less domain specific.

iTuring: As data science has greatly advanced these years, do you think any of the contents in your book need to be updated? And what contents would remain unchanged for a long time?

Certainly! This is a fast moving field which I wanted to explain as an overview. If I rewrote this book today every chapter would be different. Even so, the overall approach of learning what you need, and being technical without losing sight of the human impact, will remain. As things progress, the techniques will get better and more mathematically complex, so in some sense this is the best time to be a data scientist.


更多精彩,加入圖靈訪談微信!

相關文章