Netflix的全週期開發者—執行您構建的內容(中英雙語)

靜好先森發表於2019-04-08

原文

The year was 2012 and operating a critical service at Netflix was laborious. Deployments felt like walking through wet sand. Canarying was devolving into verifying endurance (“nothing broke after one week of canarying, let’s push it”) rather than correct functionality. Researching issues felt like bouncing a rubber ball between teams, hard to catch the root cause and harder yet to stop from bouncing between one another. All of these were signs that changes were needed.

2012年的時候在Netflix運營一項關鍵服務很費力的。部署就像在潮溼的沙地上行走。金絲雀部署變成了對耐心的測試(“一週的金絲雀測試都沒問題發生所以我們繼續推進吧”)而不是正確的功能。研究問題就好像皮球一樣在團隊之間被踢來踢去,很難抓住根源。所有這些跡象表明需要做一些改變了。

Fast forward to 2018. Netflix has grown to 125M global members enjoying 140M+ hours of viewing per day. We’ve invested significantly in improving the development and operations story for our engineering teams. Along the way we’ve experimented with many approaches to building and operating our services. We’d like to share one approach, including its pros and cons, that is relatively common within Netflix. We hope that sharing our experiences inspires others to debate the alternatives and learn from our journey.

時間快進到2018年。Netflix的全球使用者已增至1.25億,每天觀看時長超過1.4億小時。我們已經在改善我們的工程團隊的開發和運營方面投入了大量的資金。在此過程中,我們嘗試了許多方法來構建和執行我們的服務。在此,我們願意把其中一種我們內部用得相對普遍的方案,包括它的優缺點拿出來跟大家分享。我們希望我們的經驗分享能給大家一點啟發並且討論出可能的替代方案。

一個團隊的旅程

Edge Engineering is responsible for the first layer of AWS services that must be up for Netflix streaming to work. In the past, Edge Engineering had ops-focused teams and SRE specialists who owned the deploy+operate+support parts of the software life cycle. Releasing a new feature meant devs coordinating with the ops team on things like metrics, alerts, and capacity considerations, and then handing off code for the ops team to deploy and operate. To be effective at running the code and supporting partners, the ops teams needed ongoing training on new features and bug fixes. The primary upside of having a separate ops team was less developer interrupts when things were going well.

Edge Engineering(邊緣工程)負責AWS服務的第一層,Netflix流媒體必須依靠這些服務才能正常工作。在過去,Edge Engineering有專注運維的團隊以及SRE(網站可靠性工程師)專家,他們負責軟體生命週期的部署+運營+支撐這部分。釋出一個新特性意味著開發人員要在度量、警報和容量考慮事項等方面與運維團隊協調,然後將程式碼交給運維團隊進行部署和操作。為了有效地執行程式碼並支援合作伙伴,運維團隊需要持續接受新特性和bug修復方面的培訓。擁有一個單獨的運維團隊的主要好處是,當事情進展順利時,開發人員的干擾更少。

When things didn’t go well, the costs added up. Communication and knowledge transfers between devs and ops/SREs were lossy, requiring additional round trips to debug problems or answer partner questions. Deployment problems had a higher time-to-detect and time-to-resolve due to the ops teams having less direct knowledge of the changes being deployed. The gap between code complete and deployed was much longer than today, with releases happening on the order of weeks rather than days. Feedback went from ops, who directly experienced pains such as lack of alerting/monitoring or performance issues and increased latencies, to devs, who were hearing about those problems second-hand.

當事情進展不順時,成本就會增加。開發人員和ops/SREs之間的交流和資訊交換是有損耗的,需要額外的往返除錯問題或回答合作伙伴的問題。部署問題因為運維團隊對所部署內容的更改了解較少。所以檢測和解決問題需要很長的時間。程式碼完成與部署之間的鴻溝變得更大,釋出往往是以周為量級而不是日。反饋從運維發起,這幫人直接經歷了缺少告警/監控或者效能問題及時延增加這樣的痛苦,然後再傳遞到開發人員這裡時問題已經是二手了。

To improve on this, Edge Engineering experimented with a hybrid model where devs could push code themselves when needed, and also were responsible for off-hours production issues and support requests. This improved the feedback and learning cycles for developers. But, having only partial responsibility left gaps. For example, even though devs could do their own deployments and debug pipeline breakages, they would often defer to the ops release specialist. For the ops-focused people, they were motivated to do the day to day work but found it hard to prioritize automation so that others didn’t need to rely on them.

為了改進這一點,Edge Engineering嘗試了一種混合模型,在這種模型中,開發人員可以在需要的時候自己推送程式碼,同時還負責非工作時間的生產問題和支援請求。這改進了開發人員的反饋和學習週期。但這會出現部分的責任不到位的問題。例如,即使開發人員可以執行他們自己的部署和除錯管道中斷,他們往往也會交給運維處理。對於那些專注於運維的人來說,他們有動力去做每天的工作,但是很難會把無需別人依賴自己的自動化放在優先考慮的位置

In search of a better way, we took a step back and decided to start from first principles. What were we trying to accomplish and why weren’t we being successful?

為了尋找更好的方法,我們退了一步,決定從第一性原理開始。我們想要完成什麼,為什麼我們沒有成功?

軟體生命週期

The purpose of the software life cycle is to optimize “time to value”; to effectively convert ideas into working products and services for customers. Developing and running a software service involves a full set of responsibilities:

軟體生命週期的目的是優化“價值時間”;為客戶有效地將想法轉化為工作產品和服務。開發和執行軟體服務涉及一系列職責:

Netflix的全週期開發者—執行您構建的內容(中英雙語)

We had been segmenting these responsibilities. At an extreme, this means each functional area is owned by a different person/role:

我們一直在劃分這些責任。在極端情況下,這意味著每個功能區由不同的人/角色擁有:

Netflix的全週期開發者—執行您構建的內容(中英雙語)

These specialized roles create efficiencies within each segment while potentially creating inefficiencies across the entire life cycle. Specialists develop expertise in a focused area and optimize what’s needed for that area. They get more effective at solving their piece of the puzzle. But software requires the entire life cycle to deliver value to customers. Having teams of specialists who each own a slice of the life cycle can create silos that slow down end-to-end progress. Grouping differing specialists together into one team can reduce silos, but having different people do each role adds communication overhead, introduces bottlenecks, and inhibits the effectiveness of feedback loops.

這些專門的角色在每一個細分領域內創造出了效能,但是卻有可能造成整個生命週期的低效。專家在其聚焦的領域發展專業知識並針對該領域的需要進行優化。他們在解決特定領域的難題上變得越來越高效。但是軟體需要整個生命週期來為客戶提供價值。各自精通生命週期的一小塊的專家團隊反而可能會製造出資訊孤島,從而減慢端到端的進度。將不同的專家分組到一個團隊中可以減少資訊孤島,但是讓不同的人擔任每個角色會增加溝通開銷,引入瓶頸,並抑制反饋迴環的有效性。

執行構建的內容

To rethink our approach, we drew inspiration from the principles of the devops movement. We could optimize for learning and feedback by breaking down silos and encouraging shared ownership of the full software life cycle:

為了重新思考我們的方法,我們從devops運動的原則中獲得了靈感。我們可以通過打破資訊孤島和鼓勵共享整個軟體生命週期的所有權來優化學習和反饋:

Netflix的全週期開發者—執行您構建的內容(中英雙語)

“Operate what you build” puts the devops principles in action by having the team that develops a system also be responsible for operating and supporting that system. Distributing this responsibility to each development team, rather than externalizing it, creates direct feedback loops and aligns incentives. Teams that feel operational pain are empowered to remediate the pain by changing their system design or code; they are responsible and accountable for both functions. Each development team owns deployment issues, performance bugs, capacity planning, alerting gaps, partner support, and so on.

“運營你開發的東西”通過讓開發系統的團隊也負責系統的運營和支援來踐行devops原則。把這個責任分攤給每一支開發團隊,而不是外化它,這樣就建立直接反饋迴環並且把激勵給統一起來。感受到運維痛苦的團隊被賦權通過改變系統設計或程式碼來治療這種痛苦;他們要負責這兩種職能。每一支開發團隊都要負責部署問題、效能bug、能力規劃、告警差異、夥伴支援等等。

通過開發工具進行擴充套件

Ownership of the full development life cycle adds significantly to what software developers are expected to do. Tooling that simplifies and automates common development needs helps to balance this out. For example, if software developers are expected to manage rollbacks of their services, rich tooling is needed that can both detect and alert them of the problems as well as to aid in the rollback.

對整個開發生命週期的所有權給軟體開發者顯著增加了負擔。這就需要有簡化和自動化共同開發需求的工具來減輕負擔。比方說,如果軟體開發者預期要管理服務的回滾的話,就要有豐富的工具既能檢測到問題並予以告警,又能輔助進行回滾才行。

Netflix created centralized teams (e.g., Cloud Platform, Performance & Reliability Engineering, Engineering Tools) with the mission of developing common tooling and infrastructure to solve problems that every development team has. Those centralized teams act as force multipliers by turning their specialized knowledge into reusable building blocks. For example:

Netflix建立了集中的團隊(例如,雲平臺、效能和可靠性工程、工程工具),其任務是開發通用的工具和基礎設施來解決每個開發團隊都有的問題。這些集中的團隊將他們的專業知識轉化為可重用的構建塊,從而起到了力量倍增器的作用。例如:

Netflix的全週期開發者—執行您構建的內容(中英雙語)

Empowered with these tools in hand, development teams can focus on solving problems within their specific product domain. As additional tooling needs arise, centralized teams assess whether the needs are common across multiple dev teams. When they are, collaborations ensue. Sometimes these local needs are too specific to warrant centralized investment. In that case the development team decides if their need is important enough for them to solve on their own.

有了這些工具在手,開發團隊可以專注於解決他們特定產品領域中的問題。隨著其他工具需求的出現,集中化團隊會評估多個開發團隊是否也有這些需求。如果有,接著就要協作。有時,這些地方需求過於具體,無法保證集中投資。在這種情況下,開發團隊決定他們的需求是否重要到足以讓他們自己解決問題。

Balancing local versus central investment in similar problems is one of the toughest aspects of our approach. In our experience the benefits of finding novel solutions to developer needs are worth the risk of multiple groups creating parallel solutions that will need to converge down the road. Communication and alignment are the keys to success. By starting well-aligned on the needs and how common they are likely to be, we can better match the investment to the benefits to dev teams across Netflix.

對類似問題在區域性與集中投資間進行平衡是我們的方案當中最棘手的地方。按照我們的經驗尋找開發需求的新穎解決方案的好處,是值得冒險讓多支團隊同時開發在將來殊途同歸的解決方案的。溝通與協調是成功的關鍵。通過協調好需求及其共性,我們就能更好地將投資與跨開發團隊的好處進行匹配。

全週期開發者

By combining all of these ideas together, we arrived at a model where a development team, equipped with amazing developer productivity tools, is responsible for the full software life cycle: design, development, test, deploy, operate, and support.

把所有這些想法湊到一起,我們就得出了這麼一個模式,在配備了出色的開發者生產力工具之後,開發團隊將負責整個軟體生命週期:包括設計、開發、測試、部署、運營以及支援。

Netflix的全週期開發者—執行您構建的內容(中英雙語)

Full cycle developers are expected to be knowledgeable and effective in all areas of the software life cycle. For many new-to-Netflix developers, this means ramping up on areas they haven’t focused on before. We run dev bootcamps and other forms of ongoing training to impart this knowledge and build up these skills. Knowledge is necessary but not sufficient; easy-to-use tools for deployment pipelines (e.g., Spinnaker) and monitoring (e.g., Atlas) are also needed for effective full cycle ownership.

全週期開發者需要熟悉軟體生命週期各個領域並且高效。對於很多不熟悉Netflix的開發者來說,這意味著要在自己之前不怎麼關注的領域加把勁。我們開設有dev新兵訓練營及其他持續培訓形式來灌輸這種知識並培養技能。知識是必要非充分條件;部署管道和監控還需要有易用的工具才能支撐高效的全週期開發運營。

Full cycle developers apply engineering discipline to all areas of the life cycle. They evaluate problems from a developer perspective and ask questions like “how can I automate what is needed to operate this system?” and “what self-service tool will enable my partners to answer their questions without needing me to be involved?” This helps our teams scale by favoring systems-focused rather than humans-focused thinking and automation over manual approaches.

全週期開發者把工程規範應用到生命週期的各個領域。他們從開發者的角度去評估問題,會提出類似“我如何才能自動化該系統運營所需的東西?”以及“什麼樣的自服務工具能讓我的夥伴回答他們的問題而不需要我的參與?”優先考慮聚焦系統的辦法而不是聚焦於人的辦法,優先考慮自動化而不是手工,這幫助了我們團隊實現伸縮性。

Moving to a full cycle developer model requires a mindset shift. Some developers view design+development, and sometimes testing, as the primary way that they create value. This leads to the anti-pattern of viewing operations as a distraction, favoring short term fixes to operational and support issues so that they can get back to their “real job”. But the “real job” of full cycle developers is to use their software development expertise to solve problems across the full life cycle. A full cycle developer thinks and acts like an SWE, SDET, and SRE. At times they create software that solves business problems, at other times they write test cases for that, and still other times they automate operational aspects of that system.

轉向全週期開發者模式需要理念的轉變。一些開發者認為設計+開發,或者有時候測試才是創造價值的主要手段。這會導致一種反模式,認為運營是分心的事情,更喜歡對運營和支援問題進行短期性質的修補以便能夠回到自己“真正的工作”上去。但是全週期開發者這項“真正的工作”是利用他們的軟體開發知識去解決全生命週期的問題。全週期開發者要像SWE、SDET以及SRE一樣思考和行動。有時候他們要建立軟體去解決商業問題,有時候他們寫相應的測試用例,還有些時候他們會對系統的運營方面進行自動化。

For this model to succeed, teams must be committed to the value it brings and be cognizant of the costs. Teams need to be staffed appropriately with enough headroom to manage builds and deployments, handle production issues, and respond to partner support requests. Time needs to be devoted to training. Tools need to be leveraged and invested in. Partnerships need to be fostered with centralized teams to create reusable components and solutions. All areas of the life cycle need to be considered during planning and retrospectives. Investments like automating alert responses and building self-service partner support tools need to be prioritized alongside business projects. With appropriate staffing, prioritization, and partnerships, teams can be successful at operating what they build. Without these, teams risk overload and burnout.

這一模式要想取得成功,團隊必須為它所帶來的價值做奉獻並且要認識到所需的成本。團隊需要預留合理的人手去管理開發和部署,處理生產問題,並且對夥伴的支援請求作出響應。需要投入時間到培訓上。要利用好工具並且投資於工具。需要跟集中化團隊培養合作關係來建立出可重用的元件和解決方案。規劃和回顧階段要考慮到生命週期的各個領域。除了商業專案以外,像自動化告警響應和開發自服務夥伴支援工具這樣的投資需要優先考慮。有了合適的人力、恰當的優先次序,再加上合作關係,團隊就能成功地運營自己開發的東西。沒有這些,團隊就會有負擔過重精疲力竭的風險。

To apply this model outside of Netflix, adaptations are necessary. The common problems across your dev teams are likely similar — from the need for continuous delivery pipelines, monitoring/observability, and so on. But many companies won’t have the staffing to invest in centralized teams like at Netflix, nor will they need the complexity that Netflix’s scale requires. Netflix’s tools are often open source, and it may be compelling to try them as a first pass. However, other open source and SaaS solutions to these problems can meet most companies needs. Start with analysis of the potential value and count the costs, followed by the mindset-shift. Evaluate what you need and be mindful of bringing in the least complexity necessary.

在Netflix之外的地方應用這一模式需要進行必要的調整。開發團隊之間的共同問題可能是類似的——比如持續交付管道的需求,比如監控/可觀察性等等。但很多公司並沒有像Netflix這樣有足夠的人力投資到集中化團隊上,或者也不需要Netflix這種規模導致的複雜性。Netflix的工具往往是開源的,所以一開始你想嘗試一下也正常。不過這些問題其他的開源和SaaS解決方案也能滿足大多數公司的需求。先從分析潛在價值和計算成本開始沒然後再進行觀念轉變。評估你需要什麼,小心不要引入不必要的複雜性。

權衡

The tech industry has a wide range of ways to solve development and operations needs (see devops topologies for an extensive list). The full cycle model described here is common at Netflix, but has its downsides. Knowing the trade-offs before choosing a model can increase the chance of success.

技術圈有很豐富的手段來解決開放和運營需求(延伸閱讀:devops拓撲)。這裡描述的全週期模型在Netflix很普遍,但這種模式也有缺點。在選擇一種模式前先了解其中的利弊可以提高成功的機率。

With the full cycle model, priority is given to a larger area of ownership and effectiveness in those broader domains through tools. Breadth requires both interest and aptitude in a diverse range of technologies. Some developers prefer focusing on becoming world class experts in a narrow field and our industry needs those types of specialists for some areas. For those experts, the need to be broad, with reasonable depth in each area, may be uncomfortable and sometimes unfulfilling. Some at Netflix prefer to be in an area that needs deep expertise without requiring ongoing breadth and we support them in finding those roles; others enjoy and welcome the broader responsibilities.

在全週期模式下,一個人要管的事情變寬了變多了。而一些開發者偏向於專注成為比較狹窄的領域的世界級專家,在一些領域我們也是需要那種型別的專家的。對於那些專家來說,需要一專多能,對每個領域都懂一些的要求可能會感覺不太舒服而且有時候勉為其難。有些人寧願呆在需要深厚知識不需要持續擴充套件廣度的領域,我們也支援他們去找到這樣的角色;有的則享受並且歡迎承擔更廣的責任。

In our experience with building and operating cloud-based systems, we’ve seen effectiveness with developers who value the breadth that owning the full cycle requires. But that breadth increases each developer’s cognitive load and means a team will balance more priorities every week than if they just focused on one area. We mitigate this by having an on-call rotation where developers take turns handling the deployment + operations + support responsibilities. When done well, that creates space for the others to do the focused, flow-state type work. When not done well, teams devolve into everyone jumping in on high-interrupt work like production issues, which can lead to burnout.

根據我們開發和運營基於雲的系統的經驗,我們見識過哪些重視擁有全週期所需的廣度的開發者的效能。但是這種廣度增加了每一位開發者的認知負荷,這意味著團隊每週將比僅關注一個領域要平衡更多的優先事項。為此我們採取了隨時待命的輪轉來緩解這一點:即讓開發者輪流分擔部署+運營+支援責任。做得不好的情況下,就會出現人人都在當救火隊員去處理生產問題等高中斷的情況,導致所有人精疲力竭。

Tooling and automation help to scale expertise, but no tool will solve every problem in the developer productivity and operations space. Netflix has a “paved road” set of tools and practices that are formally supported by centralized teams. We don’t mandate adoption of those paved roads but encourage adoption by ensuring that development and operations using those technologies is a far better experience than not using them. The downside of our approach is that the ideal of “every team using every feature in every tool for their most important needs” is near impossible to achieve. Realizing the returns on investment for our centralized teams’ solutions requires effort, alignment, and ongoing adaptations.

工具和自動化有助於擴充套件專業知識,但沒有一項工具能解決開發者生產力和運營領域的每一個問題。Netflix有集中化團隊支撐的現成的一套工具和實踐。我們不強求其他團隊一定要用這些,但是通過確保開發和運營採用這些技術的體驗要比不用好得多來鼓勵他們採用。我們的辦法不好之處在於“每一支團隊將每一項工具的每一個功能用到其最重要的需求”上這個理想幾乎是不可能實現的。需要意識到我們集中化團隊解決方案的投資回報需要努力、協調以及持續適配。

結論

The path from 2012 to today has been full of experiments, learning, and adaptations. Edge Engineering, whose earlier experiences motivated finding a better model, is actively applying the full cycle developer model today. Deployments are routine and frequent, canaries take hours instead of days, and developers can quickly research issues and make changes rather than bouncing the responsibilities across teams. Other groups are seeing similar benefits. However, we’re cognizant that we got here by applying and learning from alternate approaches. We expect tomorrow’s needs to motivate further evolution.

從2012年走到今天經歷了種種實驗、學習和適配的過程。Edge Engineering的早期經歷刺激了尋找更好模式的需求,從此全週期開發者模式就被我們積極地應用到今天。部署是日常,進行得很頻繁,金絲雀行動只需要數小時而不是數日了,開發者可以迅速調研問題作出變更而不是在團隊之間踢皮球。其他的團隊也看到了類似的好處。然而,我們認識到我們是通過應用替代方案並從中學習才走到今天的。我們預期將來的需求還會推動進一步的演進。

Interested in seeing this model in action? Want to be a part of exploring how we evolve our approaches for the future? Consider joining us.

相關文章