Hadley Wickham, author of R Packages: Data Structure Prodigy(圖靈訪談)

劉敏ituring發表於2016-10-31

Hadley Wickham  
RStudio的首席科學家,萊斯大學的助理教授,資深R社群成員,已開發了30多個R包。因在資料處理和視覺化開發工具方面的卓越貢獻,獲得專為統計計算而設立的約翰·錢伯斯獎。

enter image description here

Hadley(哈德利)出生在紐西蘭 · 漢密爾頓的一個從事資料統計的家庭。他的父親布萊恩•韋翰是康奈爾大學動物育種方面的資料統計博士,妹妹獲得了加州大學伯克利分校資料統計的博士學位。

如果資料結構方面存在神童一說的話,Hadley應該算一個。他曾自豪地講述自己的經歷:

"15歲時,我的第一份工作就是開發Microsoft Access資料庫,很有趣。我當時做一些資料庫文件,現在人們仍然在使用我寫的資料庫。”

Hadley第一次接觸R語言是在紐西蘭奧克蘭大學的統計專業課上。他認為R語言是“一門用於理解資料的程式語言。”同SQL和Python一樣,R語言對於資料科學家來說,是最流行的程式語言。

和Hadley一樣,R程式語言也來自紐西蘭。R語言成立於1993年,由奧克蘭大學的統計學家Ross Ihaka和Robert Gentleman一起建立,主要用於資料分析,卻也存在一些怪癖(如索引資料結構的方式、實體記憶體儲存的方式等)。所以,其他開發語言的使用者大都認為R語言很奇怪。使用過Java、VBA和PHP之後,Hadley發現R“與眾不同”。“(許多程式設計師)認為R語言荒謬、笨拙,我不這麼認為,”他說,“我認為R非常有趣。”

到美國的愛荷華州立大學攻讀博士之後,Hadley開始開發R包。用哈德利自己的話說,開發包需要涵蓋“幫助人們解決問題的程式碼,然後必須用文件記錄下這些程式碼,別人才可以理解怎樣使用這些程式碼。”他建立的第一個包,作為類專案的一部分,用於生物資訊學資料的視覺化。雖然這個包從未公開過,這絲毫不影響他喜歡分享的態度。

2005年,他釋出了reshape包,廣受關注,也是R包開發的起點。這個包已經被下載了成千上萬次。reshape的目的是減少聚合和運算元據過程中的“乏味和痛苦”。簡化資料轉化的過程看上去並不是什麼難事兒,但對於資料科學家和統計學家來說,這往往是最耗時的工作。

顯然,Hadley很享受reshape開發包的成功。他認為現有的方法並不完美,所以需要開發出新的包。這並不是吹噓,他有足夠的信心,“我堅信我掌握了正確的開發方法,”他再次強調,“要麼更好,要麼更糟。”

--------------

最新力作《R包開發》,著眼於將讀者從R包的使用者晉升為R包的開發者,展示了R包開發的哲學。書中詳細介紹瞭如何將可重用的R函式、示例資料以及文件一起打包,以便與他人分享程式碼、節省開發時間、組織資料分析,儘可能讓工作自動化。

  • 學習R包最有用的元件,包括使用指南和單元測試
  • 利用devtools自動執行任務
  • 掌握良好編碼風格的技巧,比如如何把函式組織成檔案
  • 使用devtools簡化開發流程
  • 發現提交包到CRAN的最佳途徑

--------

點選檢視中文版

Having seen your picture, some followers suggest, “With such a pretty face, why don't you make a living by appearance instead of coding?” In fact, it's a tendency for fans to praise their superstars with words like “With such a perfect face, (Johnny Depp,etc.) can completely live well, but he struggles with acting improvement.” So what's your reason for coding?

I love coding for two main reasons. Firstly, I really enjoy figuring out the underlying structure behind problems that on the surface seem very different. For example, I found it very satisfying to develop the ideas behind tidy data and the tidyr package because it I enjoyed figuring out the deeper underlying theory.

Secondly, I really enjoy programming because it helps other people. Producing R packages is a great way to turn my ideas in to tools that other people can take advantage, and I enjoy all the feedback that I get from the R community. Hearing that people are using my code and finding it useful is one of the things that keeps my motivated.

R Packages is available for free online. Don't you fear it may decrease paper version's sales? Or why do you choose to publish the book then, since there is little financial incentive?

My goal from writing books is not to make money, but to reach as many people as possible. I think making the book available in both forms achieves this goal well. Younger people who don’t have a lot of money to spend on books can use the website. People who enjoy reading physical books can still buy one, and the marketing around a physical book is more likely to reach people who aren’t as active on the internet.

I know R Packages was written in the open. Could you describe your understandings on the crowd-sourcing experience?

I think writing a book is a truly excellent way to write. One of the challenges of writing a book is that it is a large project that can take one or more years. It’s hard to maintain excitement and motivation about such a big project. However, when you write in the open, you constantly get feedback. This makes it much easier to stay motivated!

I’m also quite bad at proof reading, and I really enjoy that the R community can contribute through github pull requests to fix all of my silly mistakes! People also contribute larger fixes, and point out other problems with the text. All together, writing in the open makes the book much better than it would otherwise be!

Could we compare R packages development to API design? In addition to encapsulation, robustness and usability, is there anything special need to be paid attention to?

I think there are some general principles that make my packages work together particularly smoothly. Currently, those principles are mostly intutuitive to me: I know what to do, but I can’t explain it well so that other people can learn. I am trying to change that by writing up the principles that underlie the “tidyverse”, and you can find my first attempt at https://github.com/hadley/tidyverse/blob/master/vignettes/manifesto.Rmd. I think these are important principles for the design of R packages because they make an API feel like R, and help packages work together naturally.

R was designed for data analysis, but has some quirks, like data structures are indexed and have to be stored in physical memory. Do you think the memory management way of C++ and Spark would be referred to ?

R is not perfect, but I think it does a really good job of making the human data analyst as effective as possible. R is a very flexible language which means that it’s possible to design domain specific languages like ggplot2 and dplyr that help solve certain subdomains of the data anaylsis problem. That flexibility has it’s downsides: generally slower performance. I think it’s worthwhile to have different languages for different domains: R is great for making humans efficient at doing data analysis; C++ is great for making computers calculate as efficiently as possible. I personally don’t believe it’s possible to have one language that does both. (In other words, I believe in Ousterhout’s dichotomy, https://en.wikipedia.org/wiki/Ousterhout%27s_dichotomy)

Data statistics and analysis with R has its unique advantages but with low efficiency. Could interfaces of C be used in the development of R packages so as to build components easily and efficiently to be used ?

Yes, and many many packages now use Rcpp and C++ to do exactly that. As we see more experienced programmers learn R, and more R users become experienced programmers, I think we will see more and more packages that are designed for high-efficiency.

Microsoft and IBM have employed R. There are also commercial companies providing R packages with better performance like H2o. What's your idea concerning company's influences on R development?

I think it’s a great sign of R’s continued evolution and it’s growing-up as a programming language. R is now a critical part of many companies, and that means that there will be more resources to work on R generally. One particularly exciting initiative that I’m involved with is the R consortium (https://www.r-consortium.org). This is a way for companies to give back to the R community, and have their money be spend to make R better for everyone.

According to you, RStudio is the best development environment for R users. A few readers concern your books might be too focused on RStudio. They suggest it's better to separate from integration with RStudio.

There are other ways to use R apart from RStudio, and most popular tool after RStudio is ESS or Emacs speaks statistics. These tools are powerful, but because they’re more tailored for advanced users, I’ve chosen to focus on RStudio in my books. I think that’s a reasonable trade-off as if you don’t use RStudio, you can just ignore the bits that don’t apply (and you’re probably a more experience R programmer so you are able to figure out the equivalents yourself).

You've contributed so much to R, particularly in R packages. How could you be so productive?

Here are a few more thoughts from a personal perspective.

Writing. I have worked really hard to build a solid writing habit - I try and write for 60-90 minutes every morning. It's the first thing I do after I get out of bed. I think writing is really helpful to me for a few reasons. First, I often use my writing as a reference - I don't program in C++ every day, so I'm constantly referring to @Rcpp every time I do. Writing also makes me aware of gaps in my knowledge and my tools, and filling in those gaps tends to make me more efficient at tackling new problems.

Reading. I read a lot. I follow about 300 blogs, and keep a pretty close eye on the R tags on Twitter and Stack Overflow. I don't read most things deeply - the majority of content I only briefly skim. But this wide exposure helps me keep up with changes in technology, interesting new programming languages, and what others are doing with data. It's also helpful that if when you're tackling a new problem you can recognise the basic name - then googling for it will suggest possible solutions. If you don't know the name of a problem, it's very hard to research it.

Chunking. Context-switching is expensive, so if I worked on many packages at the same time, I'd never get anything done. Instead, at any point in time, most of my packages are lying fallow, steadily accumulating issues and ideas for new feature. Once a critical mass has accumulated, I'll spend a couple of days on the package.

Finally, it's hard to over-emphasise the impact that working full-time on R makes. Since I've left Rice, I now spend well over 90% of my work time thinking about and programming in R. This has a compounding effect because as I built better tools (cognitive and computational) it becomes even easier to build new tools. I can create a new package in seconds, and I have many techniques on-hand (in-brain) for solving new problems.


——See More


更多精彩,加入圖靈訪談微信!

相關文章