Hadley Wickham, author of R Packages: Data Structure Prodigy(圖靈訪談)
Hadley Wickham
Hadley(哈德利)出生在紐西蘭 · 漢密爾頓的一個從事資料統計的家庭。他的父親布萊恩•韋翰是康奈爾大學動物育種方面的資料統計博士,妹妹獲得了加州大學伯克利分校資料統計的博士學位。
"15歲時,我的第一份工作就是開發Microsoft Access資料庫,很有趣。我當時做一些資料庫文件,現在人們仍然在使用我寫的資料庫。”
和Hadley一樣,R程式語言也來自紐西蘭。R語言成立於1993年,由奧克蘭大學的統計學家Ross Ihaka和Robert Gentleman一起建立,主要用於資料分析,卻也存在一些怪癖(如索引資料結構的方式、實體記憶體儲存的方式等)。所以,其他開發語言的使用者大都認為R語言很奇怪。使用過Java、VBA和PHP之後,Hadley發現R“與眾不同”。“(許多程式設計師)認為R語言荒謬、笨拙,我不這麼認為,”他說,“我認為R非常有趣。”
- 學習R包最有用的元件,包括使用指南和單元測試
- 利用devtools自動執行任務
- 掌握良好編碼風格的技巧,比如如何把函式組織成檔案
- 使用devtools簡化開發流程
- 發現提交包到CRAN的最佳途徑
Having seen your picture, some followers suggest, “With such a pretty face, why don't you make a living by appearance instead of coding?” In fact, it's a tendency for fans to praise their superstars with words like “With such a perfect face, (Johnny Depp,etc.) can completely live well, but he struggles with acting improvement.” So what's your reason for coding?
I love coding for two main reasons. Firstly, I really enjoy figuring out the underlying structure behind problems that on the surface seem very different. For example, I found it very satisfying to develop the ideas behind tidy data and the tidyr package because it I enjoyed figuring out the deeper underlying theory.
Secondly, I really enjoy programming because it helps other people. Producing R packages is a great way to turn my ideas in to tools that other people can take advantage, and I enjoy all the feedback that I get from the R community. Hearing that people are using my code and finding it useful is one of the things that keeps my motivated.
R Packages is available for free online. Don't you fear it may decrease paper version's sales? Or why do you choose to publish the book then, since there is little financial incentive?
My goal from writing books is not to make money, but to reach as many people as possible. I think making the book available in both forms achieves this goal well. Younger people who don’t have a lot of money to spend on books can use the website. People who enjoy reading physical books can still buy one, and the marketing around a physical book is more likely to reach people who aren’t as active on the internet.
I know R Packages was written in the open. Could you describe your understandings on the crowd-sourcing experience?
I think writing a book is a truly excellent way to write. One of the challenges of writing a book is that it is a large project that can take one or more years. It’s hard to maintain excitement and motivation about such a big project. However, when you write in the open, you constantly get feedback. This makes it much easier to stay motivated!
I’m also quite bad at proof reading, and I really enjoy that the R community can contribute through github pull requests to fix all of my silly mistakes! People also contribute larger fixes, and point out other problems with the text. All together, writing in the open makes the book much better than it would otherwise be!
Could we compare R packages development to API design? In addition to encapsulation, robustness and usability, is there anything special need to be paid attention to?
I think there are some general principles that make my packages work together particularly smoothly. Currently, those principles are mostly intutuitive to me: I know what to do, but I can’t explain it well so that other people can learn. I am trying to change that by writing up the principles that underlie the “tidyverse”, and you can find my first attempt at https://github.com/hadley/tidyverse/blob/master/vignettes/manifesto.Rmd. I think these are important principles for the design of R packages because they make an API feel like R, and help packages work together naturally.
R was designed for data analysis, but has some quirks, like data structures are indexed and have to be stored in physical memory. Do you think the memory management way of C++ and Spark would be referred to ?
R is not perfect, but I think it does a really good job of making the human data analyst as effective as possible. R is a very flexible language which means that it’s possible to design domain specific languages like ggplot2 and dplyr that help solve certain subdomains of the data anaylsis problem. That flexibility has it’s downsides: generally slower performance. I think it’s worthwhile to have different languages for different domains: R is great for making humans efficient at doing data analysis; C++ is great for making computers calculate as efficiently as possible. I personally don’t believe it’s possible to have one language that does both. (In other words, I believe in Ousterhout’s dichotomy, https://en.wikipedia.org/wiki/Ousterhout%27s_dichotomy)
Data statistics and analysis with R has its unique advantages but with low efficiency. Could interfaces of C be used in the development of R packages so as to build components easily and efficiently to be used ?
Yes, and many many packages now use Rcpp and C++ to do exactly that. As we see more experienced programmers learn R, and more R users become experienced programmers, I think we will see more and more packages that are designed for high-efficiency.
Microsoft and IBM have employed R. There are also commercial companies providing R packages with better performance like H2o. What's your idea concerning company's influences on R development?
I think it’s a great sign of R’s continued evolution and it’s growing-up as a programming language. R is now a critical part of many companies, and that means that there will be more resources to work on R generally. One particularly exciting initiative that I’m involved with is the R consortium (https://www.r-consortium.org). This is a way for companies to give back to the R community, and have their money be spend to make R better for everyone.
According to you, RStudio is the best development environment for R users. A few readers concern your books might be too focused on RStudio. They suggest it's better to separate from integration with RStudio.
There are other ways to use R apart from RStudio, and most popular tool after RStudio is ESS or Emacs speaks statistics. These tools are powerful, but because they’re more tailored for advanced users, I’ve chosen to focus on RStudio in my books. I think that’s a reasonable trade-off as if you don’t use RStudio, you can just ignore the bits that don’t apply (and you’re probably a more experience R programmer so you are able to figure out the equivalents yourself).
You've contributed so much to R, particularly in R packages. How could you be so productive?
Here are a few more thoughts from a personal perspective.
Writing. I have worked really hard to build a solid writing habit - I try and write for 60-90 minutes every morning. It's the first thing I do after I get out of bed. I think writing is really helpful to me for a few reasons. First, I often use my writing as a reference - I don't program in C++ every day, so I'm constantly referring to @Rcpp every time I do. Writing also makes me aware of gaps in my knowledge and my tools, and filling in those gaps tends to make me more efficient at tackling new problems.
Reading. I read a lot. I follow about 300 blogs, and keep a pretty close eye on the R tags on Twitter and Stack Overflow. I don't read most things deeply - the majority of content I only briefly skim. But this wide exposure helps me keep up with changes in technology, interesting new programming languages, and what others are doing with data. It's also helpful that if when you're tackling a new problem you can recognise the basic name - then googling for it will suggest possible solutions. If you don't know the name of a problem, it's very hard to research it.
Chunking. Context-switching is expensive, so if I worked on many packages at the same time, I'd never get anything done. Instead, at any point in time, most of my packages are lying fallow, steadily accumulating issues and ideas for new feature. Once a critical mass has accumulated, I'll spend a couple of days on the package.
Finally, it's hard to over-emphasise the impact that working full-time on R makes. Since I've left Rice, I now spend well over 90% of my work time thinking about and programming in R. This has a compounding effect because as I built better tools (cognitive and computational) it becomes even easier to build new tools. I can create a new package in seconds, and I have many techniques on-hand (in-brain) for solving new problems.
- data structureStruct
- sqrt-data-structureStruct
- Data Structure_樹Struct
- Half-Edge-Mesh-Data-StructureStruct
- 圖靈訪談系列之一:陳世欣談產品經理與社群圖靈
- 圖靈訪談系列之九:CNode社群談Node.js技術及生態圖靈Node.js
- DataGirls社群創始人 Aislinn:做勇敢的少數派(圖靈訪談)AI圖靈
- A C++ half-edge data structure for a triangle mesh with no external dependencies whatsoeveC++Struct
- 2020ICPC小米網路賽 C.Data Structure ProblemStruct
- H.264碼流結構 (H.264 Data Structure)Struct
- 圖靈訪談系列之八:對話歸隱的大師——Donald E. Knuth(高德納)圖靈
- R語言data manipulation學習筆記之subset dataR語言筆記
- 【OCP最新題庫解析(052)--題6】Which structure can span multiple data filesStruct
- 在Spring data中使用r2dbcSpring
- Dart Karaoke Author消原音教程Dart
- Author: ** not defined in users.txt file
- 圖靈訪談1025 | 美團攻城獅:用技術創造歷史,用走過的路寫一本書圖靈
- 談談 Kubernetes 的匿名訪問
- 談談資料編織(Data Fabric)和資料網格(Data Mesh)的關係
- 帶你深入理解圖靈機--什麼是圖靈機、圖靈完備圖靈
- 15 圖靈圖靈
- Data:URL 圖片
- R : 折線圖
- OpenAPI Basic StructureAPIStruct
- Spring Data R2DBC響應式操作MySQLSpringMySql
- 使用 AutoNLP 和 Prodigy 進行主動學習 - huggingface
- python modules and packagesPythonPackage
- 《矽谷之火》作者訪談
- 你好,圖靈社群!圖靈
- 圖靈搬家啦!圖靈
- 談談如何從資料湖(Data Lake)架構轉向資料網格(Data Mesh)架構架構
- 修改 git repo 歷史提交的 authorGit
- 談談網路協議 - 資料鏈路層( Data Link)協議
- idea--Project StructureIdeaProjectStruct
- 專訪明略科技CTO郝傑,共繪會話智慧發展藍圖 | 愛分析訪談會話
- SpriteAtlas精靈圖集
- 圖靈的優惠圖靈
- [原創] How to revise author name and email in commit historyAIMIT
- R語言:畫樹圖R語言