Read "DevOps Practice Guide" Note 3

Part 5 Step 3: Technical Practice of Continuous Learning and Experimentation

Chapter 19 Integrating Learning into Your Daily Work 180

foreword

They have a specific design goal, which is to ensure that Netflix's service can continue to operate even if the entire availability zone of AWS fails, like this accident in the US East region. To achieve this, the system architecture needs to be loosely coupled, and each component has a particularly sensitive timeout design, so as to ensure that a failed component will not drag down the entire system. Instead, every feature and component of Netflix is ​​designed to be fully degradable. For example, when the CPU usage skyrockets due to a sharp increase in traffic, instead of displaying a personalized movie recommendation list to the user, only the static content that has been cached is displayed, thereby reducing the demand for computing resources.

It's called "Trouble Monkey". It continuously and randomly deletes production servers to simulate failures in the AWS environment. They do this in the hope that all "engineering teams will be conditioned to work at regular failure levels" so that the service can "automatically recover without any human intervention". Trouble Monkey continuously injects faults into pre-production and production environments, thereby achieving operational recovery goals.

By continually probing and addressing these issues during normal business hours, organizational learning outcomes are simultaneously created

Rascal Monkey is just one example of incorporating learning into everyday work. The story also shows how learning organizations think about failures, accidents, and mistakes—seeing them as opportunities for learning, not opportunities for punishment.

19.1 Building a culture of justice and learning 181

Because of the inevitable design problems in the complex systems we build, there should be no need to "name, blame, and shame" those responsible for failures. The goal should always be to maximize opportunities for organizational learning, with the goal of viewing errors, errors, blunders, slips, etc. from a learning perspective.

When engineers make mistakes, if they can give detailed information with a sense of security, they will not only be willing to take responsibility for things, but also enthusiastically help others to avoid the same mistakes from happening. This creates organizational learning.

There are two effective practices that help create an unbiased learning culture: blame-free post-mortem analysis and the introduction of controlled human failure in production environments to create opportunities for unavoidable problems in complex systems. practise.

19.2 Hold a blame-free postmortem meeting 182

To help create a culture of justice, when accidents and critical incidents occur (eg, deployment failures, production incidents affecting customers), there should be a non-blame postmortem after the problem has been resolved.

In a no-blame post-mortem meeting, we do the following:
 Construct a timeline that gathers all the details about the failure from
multiple  Allows
and encourages those who make mistakes to become experts in educating others not to make the same mistakes in the future;
 Creates a free decision Decision-making is judged after the fact;
 Develop countermeasures to prevent similar incidents and ensure that these countermeasures, target dates and responsible persons are documented so that they can be tracked.

In order to gain sufficient understanding, the following stakeholders need to be present at the meeting:
 People involved in decision-making about the problem;
 People who identify the problem;
 People
who respond to the problem;
People who diagnose the problem;
Anyone interested in attending the conference.


It is helpful to have someone trained and unrelated to the incident to organize and lead the meeting , especially during the first few postmortem meetings, when the use of "should have" or "should have" or "should have been" should be explicitly prohibited during the meeting and resolution Could have been", because they are counterfactual statements,

The meeting must allow sufficient time to brainstorm and decide on responses. Once countermeasures have been identified, efforts must be prioritized, responsible persons assigned, and implementation timetabled.

19.3 Make the results of the postmortem meeting publicly available as widely as possible184

After the non-blaming post-mortem meeting, the minutes and all related documentation should be made widely available. Ideally, the published information should be in a centralized location and easily accessible to all across the organization, from past Learn from accidents.

Making these postmortem documents widely available and encouraging others in the organization to read them can enhance organizational learning.

19.4 Reduce accident tolerance and look for weaker fault signals 185

When working in complex systems, amplifying weak fault signals is critical to guarding against catastrophic failures. In 2003, the space shuttle Columbia exploded on re-entry into Earth's atmosphere on the 16th day of its mission. We now know that a piece of insulating foam punctured the external fuel tank as it took off.

Some mid-level NASA engineers had reported the incident before Columbia returned, but their comments were not taken seriously. Bubble problems are nothing new. Foam drift has damaged spacecraft in previous launches, but never caused a major accident. NASA characterized the incident as a maintenance issue and took no action. It was too late until the accident happened.

19.5 Redefining failure to encourage risk assessment 186

19.6 Injecting failures in production to recover and learn 186

This section describes the process of rehearsing and injecting faults into a system to ensure that the system is properly designed and built so that faults occur in a specific and controlled manner. We ensure that the system fails gracefully by performing tests on a regular basis (even continuously).

Recoverability requires that we first define failure modes and then test to ensure that these failure modes function as designed. One approach is to inject faults in the production environment and rehearse faults at scale. This allows confidence that the system will recover itself in the event of an incident, ideally without even affecting customers.

19.7 Creating a failure drill day 187

The goal of the drill day is to help the team simulate and rehearse incidents so they are operationally capable. First, plan for a catastrophic event, such as simulating the destruction of an entire data center at some point in the future. Then, give the team time to eliminate all single points of failure and create the necessary monitoring procedures, failover procedures, etc.

Teams define and execute various drills on drill days. For example, by performing a database failover or by interrupting a critical network connection, exposing a problem in a defined process.

By incrementally creating more resilient services and a higher degree of certainty during drill days, we can resume normal operations in the event of an unexpected event while creating more learning opportunities and a more resilient organization.

We can practice, build the required operations manual. Another output of the practice day was that people actually knew who to call and talk to.

19.8 Summary 189

The only sustainable competitive advantage an organization has is the ability to learn faster than its rivals.

Chapter 20 Turning Local Experience into Global Improvement 190

20.1 Using chat rooms and chatbots to automatically accumulate organizational knowledge 190

20.2 Automated, standardized processes in software to facilitate reuse 192

20.3 Creating a single source code repository shared across the organization 192

What we keep in the shared source code repository is not only the source code, but also artifacts that contain other learning experiences and knowledge:
 Configuration standards for libraries, infrastructure, and environments (Chef's recipe files, Puppet class files, etc.);
 Deployment Tools;
 Testing standards and tools, including security aspects;
 Deployment pipeline tools;
 Monitoring and analysis tools;
 Tutorials and standards.

20.4 Using Automated Test Documentation and Communicating Practices to Transfer Knowledge 194

The benefit of adopting the practice of test-driven development (TDD), which is writing automated tests before writing code, is that the tests are almost entirely automated. This principle turns a test suite into an active, up-to-date system specification. Any engineer who wants to know how to use the system can look at the test suite to find working examples of the system's API usage.

20.5 Designing Operations by Identifying Non-Functional Requirements 194

The following are examples of non-functional requirements that should be in place:
 Adequate telemetry for various applications and environments;
 Ability to accurately track dependencies;
Services that are resilient and gracefully degrade;
Backward compatibility;
 Ability to archive data to manage production datasets;
 Ability to easily search and understand log information from various services;
 Ability to track user requests across multiple services;
 Ease of use using feature switches or other methods , Centralized runtime configuration.

20.6 Incorporating Reusable Operations User Stories into Development 195

20.7 Ensuring Technology Selection Helps Achieve Organizational Goals 195

For example, only one team has the expertise of a critical service, and only this team can make changes or fixes of the problem, which forms a bottleneck. In other words, we may have optimized team productivity but inadvertently hindered the achievement of organizational goals.

In order to ensure in-depth research on certain specific technologies, we hope that the operation and maintenance team will participate in the technology selection of the production environment,

All the advantages of schemaless databases are outweighed by the operational problems they cause. These concerns include logging, graphing, monitoring, production telemetry, backup and recovery, etc., and a host of issues that developers generally don't need to care about. The end result is that we gave up MongoDB,

20.8 Summary 197

Chapter 21 Setting aside time for organizational learning and improvement 198

21.1 Institutionalized Practices for Repaying Technical Debt 199

One of the easiest ways to do this is to schedule and conduct kaizen blitzs for days or weeks in which everyone on the team (or the organization as a whole) organizes themselves to address concerns—no functional work allowed. You can focus on a problem point of code, environment, architecture, tools, etc. These teams often consist of development, operations, and information security engineers, spanning the entire value stream.

The goal during these blitzks is not simply to experiment and innovate to test new technologies, but to improve day-to-day work, such as finding workarounds in day-to-day work. While experimentation will lead to certain improvements, the focus of Improvement Blitz is to solve specific problems encountered in daily work.

A hackathon is held every few months, where everyone
prototypes their new idea. Finally, the whole team gets together to review all the work that has been done. Many of our most successful products have come from hackathons, including Timeline, chat, video, mobile development frameworks and some of the most important infrastructure such as the HipHop compiler. ", converted all of Facebook's production services from interpreted PHP program files to compiled C++ binaries. HipHop enabled Facebook's platform to handle 6 times higher production load than native PHP programs.

Through regular Kaizen Blitz and Hack Weeks, all in the value stream take pride in taking ownership of the innovations, continuously incorporating improvements into the system, further increasing safety, reliability and learning.

21.2 Teaching and Learning for All 200

Weekly study time scheduled for peers. During the two-hour study time, each companion has to learn by himself and teach others. The topics were all about what they wanted to learn, some about technology, some about new software development or process improvement methods, and some even about how to better manage their careers.

21.3 Sharing experiences in DevOps conferences 201

21.4 Internal consultants and coaches for communication practices 203

Google has a 20% innovation time policy that allows developers to take one day a week to spend on Google-related projects outside their primary responsibility. Some like-minded engineers form spontaneous groups to focus on improving the blitz using this 20% of their time.

21.5 Summary 204

This chapter describes how to establish a set of routines that help reinforce lifelong learning and a culture that values ​​improving day-to-day work on a day-to-day basis. This can be achieved by setting aside time to repay technical debt; creating forums where people can learn and mentor each other both inside and outside the organization; and enabling experts to help internal teams through coaching, consulting, or simply setting up a face-to-face time.

21.6 Summary of Section 5 204

Part VI Technical Practices for Integrating Information Security, Change Management, and Compliance

Chapter 22 Making information security part of everyone's day-to-day work 207

22.1 Integrating safety into demonstrations of development iterations 207

22.2 Integrating Security into Bug Tracking and Postmortem Sessions 208

An issue tracking system for development and operations teams to manage all known security issues to ensure visibility of security work and the ability to prioritize it alongside other work.

We incorporate all security issues into the JIRA system. This is a system that all engineers use on a daily basis to label issues as P1 and P2,

Whenever a security issue arises, we do a postmortem session because it better educates engineers on how to prevent issues from recurring, and is an excellent mechanism for transferring security knowledge to engineering teams.

22.3 Integrating preventive security controls into shared source code repositories and shared services 208

Add any mechanisms and tools that help ensure application and environment security to the shared source code repository. We will add secure libraries that meet specific information security goals, such as authentication and encryption libraries and services.

We can provide security training to development and operations teams, and help them review project products to ensure that security goals are implemented correctly, especially when the team is using these tools for the first time.

22.4 Integrating Security into the Deployment Pipeline 209

Information security testing will be automated as much as possible at this step. Security tests can be run alongside all other tests in the deployment pipeline whenever a developer or
operator commits code, even at the earliest stages of a software project.

The goal is to provide quick feedback to Dev and Ops so that they can be notified when they submit a change with a security risk. This allows for quick detection and remediation of security issues,

22.5 Ensuring Application Security 210

These tests are run continuously in the deployment pipeline. We expect to include the following as part of the test.
 Static Analysis: All possible runtime behaviors of the program code will be examined and look for coding flaws, backdoors, and potentially malicious code  Dynamic
Analysis: Dynamic testing monitors items such as system memory, functional behavior, response time, and overall system performance.
 Dependent Component Scanning: This takes an inventory of all packages and libraries that binaries and executables depend on and ensures that these dependent components (often out of our control) are free of vulnerabilities or malicious binaries.
 Source Code Integrity and Code Signing: All developers should have their own PGP key and everything committed to the version control system should be signed

We should define design patterns that help developers write code that prevents abuse, such as setting a rate limit on a service, turning a pressed submit button into an unclickable state. OWASP publishes a lot of useful guidance, including the following:
 How to store passwords;
 How to deal with forgotten passwords;
 How to deal with logging;
 How to prevent cross-site scripting (XSS) vulnerabilities.

22.6 Securing the software supply chain 214

When selecting software, we detect whether software projects rely on components or libraries with known vulnerabilities and help developers carefully choose which components to use

22.7 Securing the environment 215

Use automated testing to ensure that all necessary settings have been applied correctly, including security hardening configuration, database security settings, key length, and more. Additionally, we will use the test to scan the environment for known vulnerabilities

22.8 Integrating information security into production telemetry 216

We deploy the necessary monitoring, logging, and alerting systems, and by integrating security telemetry into tools used by development, QA, and operations, everyone in the value stream can see how applications and environments behave in the face of malicious threats , including: attackers constantly try to exploit vulnerabilities, gain unauthorized access, plant backdoors, perform fraud, denial of service and other destructive activities.

22.9 Building a secure telemetry system into the application 217

For detection to take place, an associated telemetry system must be created in the application.
Examples of this include:
 Successful and unsuccessful user logins;
 User password resets;
 User email address resets;
 User credit card changes.

For example, brute force logins are an early sign of attempts to gain unauthorized access, so the ratio of failed and successful logins can be shown. Of course, we should establish an alerting strategy for these important events to ensure that problems can be detected and corrected quickly.

22.10 Establishing a secure telemetry system in the environment 217

Create comprehensive telemetry systems in environments, especially for components running on unmanaged infrastructure (eg, hosted environments, cloud). Certain events require monitoring and alerting, including:
 Operating system changes (e.g., in production, building infrastructure);
 Security group changes;
 Configuration changes (e.g., OSSEC, Puppet, Chef, Tripwire );
 Cloud infrastructure changes (e.g., VPC, security groups, users, and permissions);
 XSS attempts (i.e., “cross-site scripting attacks”);
 SQLi attempts (i.e., “SQL injection attacks”);
 Web server errors ( For example, 4×× and 5×× errors).

Let's also confirm that logging is properly configured so that all telemetry is sent to the correct place. When monitoring attacks, in addition to recording events, you can also choose to block access,

Instead of forming a separate anti-fraud or information security department to achieve these goals, these responsibilities are integrated into the DevOps value stream. A security-related telemetry system was created to display it alongside the dev- and ops-oriented monitoring metrics that every Etsy engineer cares about on a daily basis, including the following.
 Abnormal termination of production programs (e.g., segfaults, core dumps, etc.): “Of particular concern is why certain processes consistently have core dumps throughout production that are
all Traffic triggered repeatedly. Also of concern are those HTTP error messages '500 Internal Server Error'. These indicators indicate that a vulnerability is being exploited, someone wants to gain unauthorized access to our system, and the application needs to be urgently patched." 
Database Syntax Errors: "We are always looking for database syntax errors in our code - errors that are either open to SQL injection attacks or ongoing attacks. Therefore, we cannot tolerate database syntax errors in code, it is still used to compromise The main attack vectors of the system."
 Signs of SQL injection attacks: "There is an absurdly simple test - we only need to set an alarm on the user input field with the keyword UNION ALL, because it almost always indicates a SQL injection attack. We also added unit tests to make sure this type of uncontrolled user input never made it into a database query."

22.11 Securing the deployment pipeline 219

To protect continuous build, integration, or deployment pipelines, mitigations might also include:
 Hardening continuous build and integration servers and ensuring that they can be rebuilt in an automated Build and integration servers are compromised;
 Review any changes committed to the version control system - either pair programming at commit time or setting up a code review process between commit and trunk merge - thus preventing continuous integration servers from running unprotected  Detect when test code that contains suspicious
API calls (for example, a unit test that accesses the file system or network) is checked into the code base
 Ensuring that each CI process runs in its own isolated container or virtual machine;
Ensuring that the version control credentials used by the CI system are read-only.

22.12 Summary 219

Chapter 23 Securing the Deployment Pipeline 220

23.1 Integrating security and compliance into the change approval process 220

23.2 Reclassification of numerous low-risk changes as standard changes 221

23.3 How routine changes are handled 222

23.4 Reducing reliance on separation of duties 224

Segregation of duties slows down and reduces the ability of engineers to get feedback on their work, potentially hindering the above requirements. This prevents engineers from taking full responsibility for the quality of their work and reduces the ability of businesses to create organizational learning.

Where possible, the use of separation of duties as a control should be avoided. We should choose pair programming, continuous inspection code check-in and code review, etc., which can provide the necessary guarantee for the quality of work.

23.5 Ensure documentation and evidence are retained for auditors and compliance staff 226

23.6 Summary 228

23.7 Summary of Section 6 228

Guess you like

Origin blog.csdn.net/lihuayong/article/details/120245709