[英] IBM美女工程師Holden Karau:尋找友善的人一起共事(圖靈訪談)
Holden Karau是IBM首席軟體工程師,負責改進Apache Spark並協助開發者向Spark貢獻程式碼。Holden曾是Databricks的軟體開發工程師,負責Spark和Databricks Cloud的後端開發。她曾在Google和亞馬遜從事軟體開發工作,分別負責Google+的後端開發和亞馬遜的智慧分類系統。她在大資料和搜尋領域有著豐富的經驗,精通Scala, Scheme, Java, Perl, C, C++, Ruby等語言。Holden著有《Spark快速資料處理》,與人合著有《Spark快速大資料分析》。
iTuring: You have been authors of Fast Data Processing With Sparkand Learning Spark, what are the differences between these two books? What was your writing experience like?
Fast Data Processing with Soark was the first book written on Apache Spark and was very focused on just getting people started. Learning Spark was written much later, after Spark SQL and other important components were added to Spark and is a bit more detail oriented while still targeted at individuals new to Spark. My writing practice between the two changed a lot for a mixture of reasons. Learning Spark is a much more collaborative book and we had early releases along with technical reviewers involved from a very early stage so it was much easier to make changes and the feedback we got was quite helpful to making Learning Spark. I was also working at Databricks while writing Learning Spark so it was much easier to fact check and get feedback from the committees since many of them worked in the same office.
iTuring: What is the biggest difference between your job in Databricks and that in IBM? Do you have to make any adjustments in your work?
Probably the biggest difference in my day-to-day at IBM is more time to focus working on Spark, while I was at Databricks I had to spend a lot of my time working on Databricks Cloud (the commercial offering). Some other changes are Databricks has most of the Spark committers so getting questions answered and code reviewed was faster. There are also the usual small company versus big company differences, but within our group things are surprisingly flexible.
iTuring: As a developer who had spent much time building Spark, considering the popularity of R language in the open source world, do you think Spark will provide interface for R in the future?
They already are! The SparkR project is now part of Spark and offers an R API, although as the newest component it is far from done and quite a way from feature parity with Scala.
iTuring: There are many enterprises who have difficulties transforming from relational databases to modern big data processing tools, such as Spark. What are your suggestions for these companies?
I think moving from traditional relational to more distributed systems involves a lot of changes for the developers. Spark SQL can bridge some of the gap for analytics - but I think an important part is gaining the understanding of how distributed systems work in practice. Rather than try and start by rewriting an existing complex system, starting with a new project from scratch (perhaps on a new data source) can help build the instructional knowledge.
iTuring: Many people believe that Spark will overthrow Hadoop with its superb performance, do you agree? What will the ecosystem of big data processing technologies—like Hadoop, Pig, Tez, Hive, and Spark—be like in the future?
Its difficult to predict whats going to happen with the Big Data ecosystem over time, especially with so many people involved in the open source community. I believe Spark will replace much of Map/Reduce and many specialized systems over time, and other systems may use Spark as an execution engine. There will still be use cases where specialized systems will be a better fit.
iTuring: How do you choose between command line and Spark for different environments of data analysis?
Generally I tend to be more comfortable in the command line, although for exploratory work rather than debugging, using things like notebooks is really quite useful. There is of course Databricks cloud, but I've also had good experiences with Jupyter and Zeppelin. For production jobs though I find notebooks to be too limiting and difficult to test, so more traditional packaged jars are what I use when moving beyond the exploratory phase.
iTuring: What is the relationship between Hive On Spark and SparkSQL? Which one do you believe will have a more promising future?
Spark SQL is an important component of Spark - with the introduction of Datasets bringing functional style programming to Spark SQL in addition to the existing relational APIs. I'm very excited about the future for Spark SQL.
iTuring: For someone who has already mastered Hadoop, what is his roadmap of learning Spark? Is reading source code a recommendable way to learn Spark?
I'm obviously a little biased and think Learning Spark would be a great book - but also doing exploratory work in the Spark shell can be a great way to get up to speed. I think Spark is at the point where reading the code makes sense for people who are going to be developers on Spark its self, but for end users hopefully its not necessary unless you want to use the latest features.
iTuring: How to effectively read source code of giant open source projects like Spark and Hadoop? Are there any tools would help in the process?
I think reading source code of Spark is an excellent activity for people interested in contributing to Spark. Since I'm an emacs user I tend to use magit, but I've also used ensime. A lot of other developers find IntelliJ to be quite useful.
iTuring: Female developers are rarely seen in China, especially in the field of “big data”. What suggestions do you have for the girls and women in China who want to be developers or software engineers?
I wish I had better advice and obviously what advice I do have comes from my experiences which may be different. That being said, I've found joining groups like Women Who Code and Double Union (a local Women's hacker space in San Francisco) really useful both for learning and having a network. I think getting involved in Open Source can be a good way to gain experience and build a portfolio when getting started & can help when interviewing. That being said open source communities can sometimes have a lot of infighting depending on the project, so I always try and look for friendly people or work with my friends when possible. I also think giving talks can be helpful as a way to showcase your work and also meet interesting people in the field.
更多精彩,加入圖靈訪談微信!
相關文章
- IBM美女工程師Holden Karau:尋找友善的人一起共事(圖靈訪談)IBM工程師圖靈
- 《Spark快速大資料分析》作者Holden Karau訪談問題有獎徵集(圖靈訪談)Spark大資料圖靈
- 圖靈訪談圖靈
- 薑餅人:旅行才是我真正的人生(圖靈訪談)圖靈
- 池建強:我的人生超程式設計(圖靈訪談)程式設計圖靈
- 趙望野:前端工程師的困惑(圖靈訪談)前端工程師圖靈
- 訪談嘉賓推薦(圖靈訪談)圖靈
- [英]Brian X. Chen:永遠線上的時代(圖靈訪談)圖靈
- [英]專訪《寫給大家看的設計書》作者Robin Williams(圖靈訪談)圖靈
- [英]Dave Wooldridge:蘋果應用開發與營銷(圖靈訪談)蘋果圖靈
- [英]《奇思妙想》作者Dennis Shasha:小品電腦科學(圖靈訪談)圖靈
- 紀念阿蘭·圖靈誕辰,評選“精彩·好訪談”(圖靈訪談)圖靈
- [英]Joe Armstrong:關於Erlang,有很多東西我想改變(圖靈訪談)圖靈
- [英]“龍書”作者Jeffery Ullman:相信你自己,自由地思考(圖靈訪談)圖靈
- [英]Susan Lammers:與程式設計大師們的對話(圖靈訪談)程式設計圖靈
- 一個優秀的軟體工程師必然是全棧工程師(圖靈訪談)軟體工程工程師全棧圖靈
- Miguel Nicolelis:僅通過思考就能實現的人機互動(圖靈訪談)圖靈
- 阮一峰訪談問題有獎徵集(圖靈訪談)圖靈
- [英]《禽獸心理學》作者Anna Salter:防止兒童性侵害,從你我做起(圖靈訪談)圖靈
- [英]Allen Downey:自由的未來,使用者/讀者的新定義(圖靈訪談)圖靈
- [英]Bob大叔:程式設計“老師傅”和他的職業素養(圖靈訪談)程式設計圖靈
- [英]Nicholas C. Zakas:愛工作、愛技術才能成長(圖靈訪談)圖靈
- [英]Brian W. Kernighan:我與CS的半個世紀(圖靈訪談)圖靈
- 2013,做你的圖靈訪談圖靈
- 《機器學習》作者Peter Flach訪談問題有獎徵集(圖靈訪談)機器學習圖靈
- 惠新宸:我也曾經是“不適合”程式設計的人(圖靈訪談)程式設計圖靈
- 《演算法圖解》作者Adit Bhargava訪談問題徵集(圖靈訪談)演算法圖解圖靈
- 韓冀中:淺談Hadoop實戰(圖靈訪談)Hadoop圖靈
- [英]Jon Skeet:我不想知道我在SO上到底花了多少時間(圖靈訪談)圖靈
- [英]《HTTP權威指南》作者Anshu Aggarwal:用計算機來節能(圖靈訪談)HTTP計算機圖靈
- [英]Bruce Tate:恐懼是我寫作《七週七語言》的初衷(圖靈訪談)圖靈
- 樸靈:打破限制,從前端到全棧(圖靈訪談)前端全棧圖靈
- Google面試官Gayle McDowell:找一家能讓你開心的公司(圖靈訪談)Go面試圖靈
- 《圖靈的祕密》作者Charles Petzold:我眼中的圖靈機和Windows(圖靈訪談)圖靈Windows
- 360前端月影(吳亮)訪談問題有獎徵集(圖靈訪談)前端圖靈
- Hulu(北京)周涵寧訪談話題有獎徵集(圖靈訪談)圖靈
- C++之父Bjarne Stroustrup訪談問題有獎徵集(圖靈訪談)C++JAR圖靈
- 《特斯拉》作者卡爾森教授訪談問題有獎徵集(圖靈訪談)圖靈