The AWS Well-Architected Framework Module 2 - Operational Excellence Pillar
AWS良好设计框架模块2 - 运营支柱
This module is about Operational Excellence Pillar, the second module of the 9 modules of AWS Well-Architected Framework (WAF).
AWS良好设计框架,以下简称WAF(不是Web应用防火墙!!!),AWS分9个模块培训,本文为第二部分——运营优化支柱(简称运营支柱)。
source: https://explore.skillbuilder.aws/learn/course/2045/play/27515/aws-well-architected-module-2-operational-excellence-pillar
1. Module objectives (本模块目标)
-
Describe the features of the Operational Excellence pillar; (特点)
-
Describe the design principles of the Operational Excellence pillar; (设计原则)
-
Describe the best practices for of the Operational Excellence pillar; (最佳实践)
-
Describe the common uses of WAF (WAF应用)
2. Operational Excellence pillar (运营支柱概况)
The Operational Excellence Pillar focuses on how your organization supports your business objectives, the ability to run and monitor systems to deliver business value, and continual improvement of supporting processes and procedures. (机构如何支撑业务目标,运行监控系统的能力,支撑过程和步骤的持续优化)
The focus areas include:
- Organization - Customers need to understand their organization’s priorities, their organizational structure, and how the organization supports their team members so that team members can support business outcomes. (理解所在机构的优先事项,组织结构,如何支撑团队成员实现业务目标)
- Prepare - Customers need to design their architecture for operations. They need to review the readiness of their workloads, and teams, in order to make informed decisions. (以运营为目标设计设计架构,根据业务系统和团队的情况制定方案)
- Operate - Customers need to know how to operate their workloads and understand the health of their workload and operations activities. With this understanding, they can identify when organizational and business outcomes are at risk and respond appropriately. (理解其业务系统是如何运行的,健康状况怎样,分析什么时候出现风险,如何应对)
- Evolve - Customers need to have a process for continuous improvement of both their workload and their operations activities. This includes implementing feedback loops, learning from experience, making improvements, and sharing what is learned to benefit their entire organization. (升级:制定业务系统和运营操作持续优化的流程,包括流程闭环,汲取教训,制定优化措施,共享经验等)
3. Operational Excellence design principles (设计原则)
In a traditional environment: (传统环境下的运营痛点)
- Manual Changes: Changes were frequently made by human beings, following rulebooks that were often out of date. (人工修改,规则过时)
- Batch changes: Because making change was difficult and risky, customers tended not to do it often and therefore tended to batch changes into large releases. (不动则以,一动惊人)
- Rarely run Game Days: Customers rarely simulated failures or events because the did not have the systems or human capacity to do this. (不能重现故障)
- No time to learn from mistakes: Things were often moving so fast that customers moved from one reactive situation to the next with no time to learn. (不能汲取教训)
- Stale documentation: Due to how change was being implemented, it was difficult to keep information current. (不能保持文档更新)
In the cloud, the constrains have been removed, customers (you) can: (在AWS云环境下,痛点得到解决)
- Perform operations as code: Apply the same engineering discipline that they use for application code to their entire environment and their operations activities. (规范运营)
- Make frequent, small, reversible changes and refine operations procedures frequently: Through automation, customers can reduce the level of effort to make changes and adopt a standard of making frequent incremental reversible changes, reducing risk. (小步迭代)
- Anticipate and learn from failure: Customers can build organizational muscle memory by running game days that simulate failures to test the recovery processes, and learn from these exercises, and all operational events, to improve their responses. (模拟演练)
4. Operational Excellence: Organization (机构方面的运营优化)
- Operational priorities: There needs to be common understaning across customer organizations of the businesss value of workload and the role of teams in supporting it. Teams need to have shared business goals to set the priorities that will enable business success. (运营优先级:对机构的业务和团队角色有统一的理解,团队有共同的目标)
- Operating model: Teams need to understand their roles in the success of other teams, and the role of other teams in their success. Understanding responsibility and ownership will help focus efforts and maximize the benefit of those efforts. (运营模式:团队成员理解自己的角色和责任)
- Organizational culture: Customers need to provide support to their team members so that their team members can be more effective in taking action and supporting the business outcomes. (企业文化:企业支持团队达成业务目标)
An example of an Operatinal Excellence best practice in the focus area of Organization is understanding business needs.
- Business and development teams: Business exist to serve the needs of the customer. Development and Operations exist to serve the needs of the business. Priorities are shaped by business and customer needs. (业务服务客户需求,开发和运营服务业务需求,业务和客户的需求决定运营的优先级)
- Internal and external requirements: External factors such as regulatory compliance requirements or industry best practices may also influence priorities. Informed decisions shoud be made that consider both benefits and risks when setting priorities. (监管等外部需求会影响优先级,要综合考虑好处和风险)
- Trade-offs: If speed to market is a priority, reliablility may be near-term trade-off. If performance is a priotity, cost optimization may be a trade-off. Priotities should be updated as needs change. (折中)
5. Operational Excellence: Prepare (准备方面的运营优化)
- Design telemetry is focused on ensuring the workload provides the information necessary for the customer to understand its internal state across all its components. (提供信息)
- Improve flow emphasizes pracitces that accelerate the velocity of beneficial changes towards production, that limit the issues that move forward, and that enable identifying and fixing issues before they can make it to production. (加速优化,定位修复问题)
- Mitigate deployment risks focuses on practices that let the customer quickly identify if changes have not had desired outcomes, and enbale rapid recovery from those issues. (降低部署风险)
- Operational readiness is focused on knowing how ready the workload is to enter production and how ready customer teams are to support it. By understanding the risks of operating the workload in production, customers can make an informed decision whether to do so. (运营是否已就绪)
An example of an Operational Excellence best practice in the focus area of Prepare is to use multiple environments.
- Customers should provide their developers with individual sandboxes for experimentation and individual developer environments. This will enable them to work in parallel without impact on each other’s efforts. (为开发者准备独立的沙箱和实验环境)
- They should use infrastructure as code and configuration management systems to deploy environments that are as consistent with production but scaled for purpose. (把基础设施当做代码,部署与生产一致的开发环境)
- Security controls should increase in environments as they get closer to production. This will help ensure that behaviors and the results from testing will be indicative of what should be expected in production. (越接近生产,越加强安全控制)
- When environments are not in use, customers should turn them off to avoid costs associated with idle resources. (不用就关掉)
6. Operational Excellence: Operate (操作方面的运营优化)
Operate focuses on understanding the health of customer workload and operations activities, deriving useful business and technical insights, and responding to operational events.
-
Customers must define, capture, and analyze both workload and operations metrics that provide visibility into workload health, the success of operations activities, and the achivement of business outcomes.
-
Deriving technical and business insights: An example of technical insights from workload metrics is the impact on the health of a workload component following a change. (从系统参数审视技术的示例:技术变动对系统组件健康度有多大影响)An example of business insights from Operations metrics is the frequency of successful feature updates to a cuntomer-facing system. (从运营参数审视业务的示例:特性更新成功的频率是高还是低。)
-
Customers should establish baseline expected metric values and thresholds for improvement, investigation, and intervention. When an event occurs and a threshold is breached, this will enable customers to respond appropriately. (建立基线和阈值,在发生事件或突破阈值时,能及时响应)Events may be planned, such as sales promotions, deployments, and Game Days, or unplanned, such as surges in utilization or component failures. Customers should use runbooks and playbooks to enable consistent responses to events. (事件分为有计划的和突发的,根据维护手册和操作指南进行响应。)
(PS: A runbook documents a single process or task. On the other hand, a playbook documents a company’s overarching goals and strategy. The two differ primarily in their purpose — runbooks have a narrow focus while playbooks cover org-wide aspects. Besides, a playbook is a set of guidelines, while a runbook is a set of instructions to trigger action.)
An example of Operational Excellence best practice in Operate is having a process to manage events, incidents, and problems. (最佳实践举例:管理事件、事故和问题的过程)
Customers should have defined processes to respond to:
- Oberved events
- Incidents - Events that require intervention
- Problems - Events that require intervention and either could happen again or that cannot currently be resolved
Any event for which you raise an alert should have a well-defined response in the form of a runbook or playbook. (引起告警的事件要定义响应)
Defined alerts should be owned by a role or team that is accountable for defining the response and owning escalation. (定义告警的人,也要负责定义响应和升级。)
The defined response should include what initiates escalation, the process for escalation, and identify specific owners for each action. Escalations may include extenal parties, such as third-party vendors or AWS support. (响应要包括升级启动、升级过程,落实到人。升级可能包括外部资源如厂商)
Users should be notified when the services they consume are impacted and when they return to normal operations. This will enable them to take action if necessary. (通知到最终用户)
Customers should communicate status through dashboards tailored to their target audience, and through push notifications such as email. (把状态信息同步给相关人员)
7. Operational Excellence: Evolve (进化方面的运营优化)
- Learn from experience: Customers should evaluate opportunities for improvement captured from the analysis of customer-impacting events, feedback loops, lessons learned, and analysis of metrics, and cross-team reviews. (从经验中学习)
- Make improvements: Customers need to validate and prioritize their opporunities, make changes, and evaluate outcomes. If the changes do not yield the desired results, they should consider other appoaches. (改进提升)
- Share learning: To maximize the benefit from improvement efforts, customer should share both what they learn and the artifacts they create across their teams. (artifact: something observed in a scientific investigation or experiment that is not naturally present but occurs as a result of the preparative or investigative procedure.)
An example of an Operational Excellence best practice in the area of Evolve is incorporating feedback loops.
- Activities should include feedback loops to identify areas for improvement. (活动必须包含反馈闭环,以便确定改进点)
- Feedback should be used to prioritize and drive improvements. (优先驱动改进)
- Feedback should come from the beginning operations activities, from user experience, and from business and development teams. An example feedback mechanism is regular reviews of operations and workload metrics with members of business, development, and operations teams. Use these meetings to identify changes in buisness needs, validate insights, and determine opportunities and methods for improvement. (反馈来自操作行为、用户体验和业务或开发团队。举例:经常回顾系统和运营参数,分析业务需求是否变化,验证业务洞察,制定改进措施)
- Customers can evaluate feedback over time to recognize the success of improvement. (评估改进是否有效)
8. Summarize of key points
The key points of the Operational Excellence pillar include:
- Understand business priorities. (理解业务优先级)
- Design for operations, architecting for insight to the health of customer workload and the success of customer operations activities. (架构设计要围绕系统和运营)
- Evaluate operational readiness, including both customer workload and team. (评估系统和团队是否就绪)
- Understand the health of customer workload and operations activities, gaining business and operational insights. (理解系统和运营的健康度)
- Prepare for and respond to events using runbooks and playbooks. (准备和响应事件、事故等)
- Learn from experience, and make improvements. To maximize the benefit of that experience, share what is learned across organizations. (学习、改进和经验共享)