[英]《資料科學實戰》作者Cathy O'Neil:大資料並不神奇(圖靈訪談)
Cathy O'Neil是約翰遜實驗室高階資料科學家、哈佛大學數學博士、麻省理工學院數學系博士後、巴納德學院教授,曾發表過大量算術代數幾何方面的論文。他曾在著名的全球投資管理公司D.E. Shaw擔任對衝基金金融師,後加入專門評估銀行和對衝基金風險的軟體公司RiskMetrics。Cathy是一位數學家,後來轉型為資料科學家,她的個人部落格http://mathbabe.org/廣受歡迎。她和哥倫比亞大學統計系兼職教授Rachel Schutt根據一門名為“資料科學導論”的課程撰寫了《資料科學實戰》一書。
iTuring: Where does being data scientist attract you the most? Have you found any clues to this question: what can a non-academic mathematician do that makes the world a better place?
I love data! I love seeing how we can learn about the way things work by measuring them. I particularly enjoy figuring out how to quantify something we only vaguely knew, or to compare the effects of two things that until that moment seemed incomparable.
My biggest clue is that I need to spend more time and energy making sure data scientists think before they act. Data science is powerful and influential and can be used for evil or good. We need to recognize that fact.
iTuring: Many readers are inspired by your blog Mathbebe, have you been inspired by them through interaction?
Absolutely! My readers have brought me on many strange and intellectually stimulating journeys. I am thankful every day to them for doing so.
iTuring: In your perspectives, what quality and learning background of a person would be most qualified to do the job of data science?
It really depends. I've written the book Doing Data Science with a mathematical background in mind, but honestly a data science team should also have people with backgrounds in philosophy and ethics who learn more scientific approaches. We need diversity of thought to solve a problem well.
iTuring: Some people believe that applications that based on big data actually indulge people’s reliance on old habits, which constrain tryouts of various experiences, do you agree?
That can be true. For example, a resume or application sorting algorithm that simply learns from historical data and the regenerates that old-fashioned decision-making is merely codifying all the biases that the system had, whether it's sexism, or an over-reliance on certain college degrees. I suggest to people to try to figure out what it is that they are actually looking for, and how they can locate those skills while staying as unbiased as possible. We should at least attempt this.
iTuring: Many companies have benefited magnificently from big data analysis, but there are also companies who use big data to formulate policies and strategies but have benefited little or even failed. What are their mistakes during the process?
Often they think that big data is magical. Of course it's not, you need good questions, and moreover you don't just need big data, you need the right data, which often hasn't been collected.
iTuring: For a large part, big data are used for prediction. Do you think accidental incident could be predicted by determinate data?
This question is a bit vague, but if I understand it correctly it's asking whether something that is fundamentally unpredictable can be predicted. I guess not! However, it's of course true that even stochastic processes have some underlying characteristics. For example, if you have a waiting time process, you can talk about when you'd be "surprised" that the event hasn't occurred, after defining what surprises you.
iTuring: To better access web data, NoSQL arises. While traditional database also come up with the concept of Data Space, in which data comes first then comes the model. How does this technology apply nowadays? Are there similar topics that are not familiar to most people?
Generally speaking, big data uses unstructured and dirty data, at least to develop models. After the models go into production, there's sometimes a standard database being used, and definitely by the time the results and daily reports are being made, it is using standard databases.
I tend to ignore the details of this kind of data storage question, not because it's uninteresting but because it is rapidly changing. When I need to work on a new project, I go figure out what the current best technology is.
iTuring: In machine learning, training data are usually given. Engineeringly speaking, what is the most important (tricky) thing while extracting training data from database? Data traits, data size or the way data is extracted?
Really hard to say in general! Of course, sometimes you need just a huge amount of training data to train your model, and other times not so much but you need to be careful you are pulling a representative sample.
For my part I almost always train my models according to timestamps, when possible. I start earlier and train my data, then I test it on later data.
iTuring: In order to extract the key factors of a model, data analysts often had to have good understandings of certain business. Is there any easy way to do it? Or it’s the inevitable part of the job?
It is truly inevitable; only domain experts will be able to guide the modeling, at least near the beginning, when there are still easily achieved goals. Later on, when all the domain expertise has been included, it might become less domain specific.
iTuring: As data science has greatly advanced these years, do you think any of the contents in your book need to be updated? And what contents would remain unchanged for a long time?
Certainly! This is a fast moving field which I wanted to explain as an overview. If I rewrote this book today every chapter would be different. Even so, the overall approach of learning what you need, and being technical without losing sight of the human impact, will remain. As things progress, the techniques will get better and more mathematically complex, so in some sense this is the best time to be a data scientist.
更多精彩,加入圖靈訪談微信!
相關文章
- 《資料科學實戰》作者Cathy O'Neil:大資料並不神奇(圖靈訪談)資料科學大資料圖靈
- 《資料科學實戰》作者Cathy O'Neil訪談問題有獎徵集(圖靈訪談)資料科學圖靈
- 《機器學習實戰》作者Peter Harrington:如何成為一位資料科學家(圖靈訪談)機器學習資料科學圖靈
- 向《命令列中的資料科學》作者Jeroen Janssens提問!(圖靈訪談)命令列資料科學圖靈
- [英]《奇思妙想》作者Dennis Shasha:小品電腦科學(圖靈訪談)圖靈
- 《Spark快速大資料分析》作者Holden Karau訪談問題有獎徵集(圖靈訪談)Spark大資料圖靈
- 《用資料講故事》作者Cole Knaflic訪談話題有獎徵集(圖靈訪談)圖靈
- 《HTTP權威指南》作者Anshu Aggarwal:用大資料來節能(圖靈訪談)HTTP大資料圖靈
- 《精益資料分析》作者Alistair Croll訪談問題有獎徵集(圖靈訪談)AI圖靈
- 《奇思妙想》作者Dennis Shasha:小品電腦科學(圖靈訪談)圖靈
- 《R包開發》作者Hadley Wickham:資料結構“神童”(圖靈訪談)資料結構圖靈
- [英]專訪《寫給大家看的設計書》作者Robin Williams(圖靈訪談)圖靈
- [英]《禽獸心理學》作者Anna Salter:防止兒童性侵害,從你我做起(圖靈訪談)圖靈
- 《大資料》作者Jeffrey D. Ullman訪談問題徵集大資料
- [英]《學習響應式設計》作者Clarissa Peterson:響應式設計並不是萬能的(圖靈訪談)圖靈
- Coursera資料工程師董飛:矽谷大資料的過去與未來(圖靈訪談)工程師大資料圖靈
- 肖鵬:微博資料庫那些事兒(圖靈訪談)資料庫圖靈
- O'Reilly精品圖書推薦:資料科學入門資料科學
- 再訪《Scratch少兒趣味程式設計》系列圖書作者阿部和廣、倉本大資(圖靈訪談)程式設計圖靈
- 《跟阿銘學Linux》作者阿銘訪談話題有獎徵集(圖靈訪談)Linux圖靈
- 《禽獸心理學》作者Anna Salter訪談問題有獎徵集(圖靈訪談)圖靈
- 《用資料講故事》作者Cole N. Knaflic:消除一切無效的圖表(圖靈訪談)圖靈
- 《演算法問題實戰策略》作者具宗萬訪談問題有獎徵集(圖靈訪談)演算法圖靈
- [英]“龍書”作者Jeffery Ullman:相信你自己,自由地思考(圖靈訪談)圖靈
- Jolt大獎獲獎作者Venkat Subramaniam訪談問題有獎徵集(圖靈訪談)圖靈
- 韓冀中:淺談Hadoop實戰(圖靈訪談)Hadoop圖靈
- [英]《Linux/Unix設計思想》作者Mike Gancerz:Linux/Unix哲學的印證(圖靈訪談)Linux圖靈
- 《機器學習》作者Peter Flach訪談問題有獎徵集(圖靈訪談)機器學習圖靈
- 再訪《Scratch少兒趣味程式設計》系列圖書作者阿部和廣、倉本大資訪談問題有獎徵集(圖靈訪談)程式設計圖靈
- 大資料與生命科學大資料
- 《演算法圖解》作者Adit Bhargava訪談問題徵集(圖靈訪談)演算法圖解圖靈
- 《特斯拉》作者卡爾森教授訪談問題有獎徵集(圖靈訪談)圖靈
- 《釋出!》作者Michael Nygard訪談問題有獎徵集(圖靈訪談)圖靈
- 《精益資料分析》作者Alistair Croll: Spammers are actually much better at marketing than I am(圖靈訪談)AI圖靈
- 蔡學鏞:這一題不答(圖靈訪談)圖靈
- 奇虎360資料專家傅志華訪談問題有獎徵集(圖靈訪談)圖靈
- 《跟阿銘學Linux》作者李世明:談不上告白,就是幫女友轉型而已(圖靈訪談)Linux圖靈
- [英] 《七週七併發模型》作者Paul Butcher:用併發計算實現最大效率(圖靈訪談)模型圖靈