Enterprise-level application Service Management Automation X (SMAX) microservices road

Compared with the rapidly changing Internet applications, how can enterprise-level applications quickly fulfill customer promises and respond to user and market needs in a timely manner? Technological transformation is an inevitable trend, and the rise of the concept of microservices just gives enterprises an excellent opportunity for technological transformation. However, the transformation of microservices has not been smooth sailing. It involves not only the transformation of the technical level, but also the change of management models and working methods. This sharing talks about some personal views and practical experience of Mico Focus.
Before the text begins, let me briefly introduce our products. Service Management Automation X (SMAX) comes from Micro Focus, the world's leading pure software company, and is based on ITIL (Information Technology Infrastructure Library). ITIL is a globally recognized series Best practices for IT service management. The basic structure of ITIL can be seen in the figure below:

2976efa1e9af634711d7b0c86904b7ef.jpeg


Predicament?

28a36f4995867f5212f6d5f4645419ca.png


In 2017, our entire project team was still developing and releasing products using the waterfall model. The iterative release cycle of each product is one year. Compared with the rapidly changing Internet applications, there may be a version update every month, and our product version update is too slow. Why update products so frequently? Frequent releases and updates can quickly fulfill customer promises, respond to user and market needs in a timely manner, maximize commercial value, and seize the market in small steps. The same is true for enterprise-level applications. We need to realize business value quickly so that customers can use new features as soon as possible. Therefore, from the company's strategic level, it was decided to increase the frequency of version updates by 4 times and shorten the release cycle from the original one year to three months.
To achieve this strategic goal, the existing organizational structure and technical structure are very big challenges. We need a new type of tactic to meet our business needs. Microservices can just solve the problem of rapid product iteration and have the following advantages:
  • Improve development communication, each service is sufficiently cohesive, small enough, and the code is easy to understand;

  • Service independent testing, deployment, upgrade and release;

  • It is easy to expand the development team and can form a development team for each service;

  • Improve fault tolerance, a memory leak of a service will not cause the entire system to be paralyzed.


In the end, we chose microservices to help us iterate products quickly and realize business value quickly. At the same time, we started to use agile development in the development process and advocated a DevOps culture. We hope that each small team can achieve team autonomy and respond quickly to the market.


catastrophe?

28a36f4995867f5212f6d5f4645419ca.png


When the strategic goals are determined, what do they do? It can be discussed from two perspectives.
  • Management level

  • technical level



The transformation of microservices at the management level must be made from top to bottom. From the management to the specific developers, we must reach a consensus on concepts. We must not break or stand, dare to try new technologies, but we must not blindly adopt them. New technology, technology for the sake of technology, an appropriate small-scale verification process in the early stage is very necessary. We must also decisively abandon some decadent old ideas.
The number of our entire development team has reached hundreds. As for how to manage an agile team with hundreds of people, there is no industry best practice. We can only cross the river by feeling the stones. For how to do agile development, we are still constantly exploring to find the best way for us. For large enterprises, the industry now has a complete set of agile development solutions SAFe, and we are also trying.
Technical
back to the technical level, I am speaking from the following aspects of our transformation of the way:
  • Technical selection

  • Service split

  • Safety

    • Container security

    • Sensitive information protection


  • Enterprise Ready

    • health examination

    • Zero downtime upgrade


  • DevOps

    • Pipeline

    • Performance Testing

    • monitor

    • GitOps


Technical selection

28a36f4995867f5212f6d5f4645419ca.png


Now when it comes to microservices, many people think of container technology, such as Docker.
So what is the relationship between microservices and containers? My answer is that microservices have nothing to do with containers.
The concept of microservices appeared much earlier than container technology, and its concept was proposed in the 1970s. The container technology was only proposed in 2013. It was originally developed by a project called dotCloud and later renamed Docker. It is completely possible to develop applications based on the idea of ​​microservices without container technology. For example, the popular Spring Cloud and Dubbo do not use container technology to realize the idea of ​​microservices.
Then why are microservices and container technology always mentioned at the same time? This is mainly due to the following two reasons:
  • According to the concept of microservices, if you use containers as infrastructure, you can achieve rapid deployment, rapid iteration, and independent operation;

  • In cloud computing, containers have attracted more attention as an infrastructure to replace virtual machines.


If you can apply container technology to microservices, you can get twice the result with half the effort. Since it is a microservice, many small services will inevitably be produced. How to manage them has become a very realistic problem for us. In 2017, when the container orchestration platform flourished, we hesitated, how should we choose? At that time, many Internet companies began to develop their own container orchestration platforms. Although this can achieve great independence, at the same time, self-research also requires a lot of R&D investment, which we cannot afford. Then we set our sights on the open source community. At that time, the well-known container orchestration platforms were Kubernetes and Docker Swarm.

d913cacc6f7a536ec32be3dfdfd64826.jpeg


It may be unexpected to many people. We used Docker Swarm at the beginning. The reason for choosing it is also very simple. Docker Swarm is quick to get started and the learning curve is gentle. But in the process of using or trying, we underestimated the complexity of our product, Docker Swarm is far from meeting our needs. In the end, we decided to move to the Kubernetes platform. Fortunately, we did not spend too much time on Docker Swarm, and the decision to move the platform was firm. This is critical in the process of microservice transformation. People will always have the courage to admit mistakes when they make mistakes, and constantly make adjustments, without procrastinating indecision. A certain trial and error cost is inevitable, don't be afraid of trial and error.
Looking back now, our decision to choose Kubernetes at the time was really a "bet" right. Kubernetes itself is a complete distributed system solution that supports service discovery, service registration, capacity expansion, cross-host deployment, enterprise-level readiness, self-repair, automatic restart and other functions, which can help us save a lot of self-research Invest.
Kubernetes originated from Google's internal system Borg, after years of actual operation, it is more mature and stable. Relying on the active community, the CNCF Foundation's blessing, and the continuous development of cloud native technology, we can easily get help from the community, and many great technology implementations have helped us solve many problems. For example, Hashicorp's Vault, an excellent software for storing sensitive information, and monitoring software Prometheus.


Service split

28a36f4995867f5212f6d5f4645419ca.png


With technology selection, Kubernetes and Docker, we can further split the microservices. How to split the service? To what extent?
When it comes to service splitting, I have to mention the well-known "Conway's Law", "The software architecture will reflect the company's organizational structure, which in turn affects the software architecture." In
order to obtain greater communication benefits, Micro The transformation of the service structure will inevitably lead to changes in the organizational structure. We need to constantly adjust the existing technical structure and organizational structure, balance the relationship between the two, and finally split the service into four modules.
  • Incident management, problem management, change management, service asset/configuration management

  • Tenant management

  • Insight

  • General service


Each business module has a team responsible for development and maintenance, and the number of each team is about 3-5 people, and the team will continue to split it appropriately and land it as a real Pod that provides services.
I personally think that for large-scale enterprise-level applications, because of its own architecture, technical debt, organizational structure, etc., it is not anxious to split the application into very detailed. First, the more detailed the demolition, the higher the cost, and the longer it will take. If there is no business value in the short term, it is difficult to convince the business side to agree to this plan. Second, from the perspective of the future, there is no guarantee that the current split is reasonable. The principle of small steps and quick trial and error should be followed.
Each component must be able to operate independently, which is the industry's recognized standard for microservices. For the definition of components here, I personally think that it can be extended to consist of multiple Pods to provide functions together, rather than requiring each Pod to provide functions independently. Regarding the microservice transformation of existing applications, regardless of the status quo, excessively high standards are out of date. We can slowly iterate the development in the way of architecture evolution, and finally continue to improve our applications.
In summary, our splitting principle is to first divide into large modules according to business functions, and then continue to split by each team, combining business and technology. At the same time, an independent team can be established to take charge of some common functions in microservices. Common common functions include routing modules and authentication authorization modules.

021954a782e7040e4b112d3ba4b1fe91.png


Safety

28a36f4995867f5212f6d5f4645419ca.png


Container Security
In Israel, we have a dedicated security team responsible for security scanning of our products, which can check and ensure the security of containers from the following dimensions.
Malicious and vulnerable images, there are thousands of free images in the Docker Hub market, which can be used in Docker containers at any time. However, a study showed that a large number of security vulnerabilities were found in more than 2500 Docker images tested. Choose official or trusted mirrors to avoid introducing fragile components, or even malicious code. Docker Hub also provides a paid plan that can perform a "security scan" on the image to check for known vulnerabilities in the image.
Excessive resource usage. Generally, Docker containers have no resource limits. Therefore, unrestricted Docker containers may cause severe degradation of host performance. To set resource limits on memory, bandwidth, and disk usage (I/O) to ensure the stability of overall performance.
Container breakthrough. In the Dockerfile, if we do not explicitly specify the user and perform permission processing, then the Docker container will run as root by default when it is running. Starting the Docker container as root is a very dangerous thing. Although the root in the Docker container does not necessarily have the same permissions as the root of the host host itself, the root in the container has the same UID (UID 0) as the host. If the container is run in a priviledge manner, then the two will be the same, resulting in huge security risks. The principle of least privilege should be observed to avoid potential security risks.
Sign and verify the Docker image to ensure that customers can get the Docker image released by our company to prevent malicious tampering of the Docker image.
Do not store sensitive information in the Docker image. There are many tools on the market to check the security of containers. The tools we currently use are Anchore (https://anchore.com/) and Aujas (https://www.aujas.com/ ).
Sensitive information protection
The communication between the pods between the microservices requires security authentication, and a certificate and user name and password are required. How can the certificate be distributed to each Pod securely? How should the Pod obtain the password? Based on the powerful ecosystem of Kubernetes, Hashicorp's Vault can meet our technical needs. For more use of Vault, please visit the official website: https://www.hashicorp.com/products/vault


Enterprise Ready

28a36f4995867f5212f6d5f4645419ca.png


Health Check is
an enterprise-level software. Many of our customers are multinational companies with employees in various countries. Cross-time zones require our applications to provide stable services 24 hours a day. How to ensure high availability? In the software industry, you may often hear the phrase, "How about restarting?" Indeed, many problems can be solved by restarting. With the help of the self-healing function provided by Kubernetes, we have set reasonable Liveness for all Pods. Probes and Readiness probes to improve the usability of the overall application.
Kubernetes uses Liveness Probe to determine when to restart a container. For example, the memory of the Java program is leaked and the program cannot work normally, but the JVM process is always running. For this kind of application's own business problems, it is very important to determine whether to restart by detecting whether the container response is normal. Good health check mechanism.
Kubernetes uses Readiness Probe to determine whether the container is ready to accept traffic. Only when all the containers in the Pod are in the ready state, Kubernetes will determine that the Pod is in the ready state. The function of this signal is to control which Pods should serve as the back end of the Service. If Pods are in a non-ready state, they will be removed from the Service Endpoint.
Configure a valid Liveness Probe
What should Liveness Probe check?
A good Liveness Probe should check whether all the key parts of the application are healthy, and use a proprietary URL to access, such as /health, execute this function when accessing /health, and then return the corresponding result. It should be noted that authentication cannot be done, otherwise Probe will continue to fail and fall into an endless loop of restart.
In addition, the check can only be limited to the inside of the application, and cannot check the parts that depend on the outside. For example, when the current web server cannot connect to the database, this cannot be regarded as unhealthy.
Liveness Probe must be lightweight.
Liveness Probe cannot take up too many resources and cannot take too long, otherwise all resources are doing health checks, which is meaningless. For example, for Java applications, it is best to use the HTTP GET method. If you use the Exec method, the JVM startup takes up a lot of resources.
Zero downtime upgrade
products are always in continuous iteration, and there will not be only one version of SMAX. In a Kubernetes cluster, how to upgrade applications with minimal downtime?
Here I have to admire the power of Kubernetes, which has provided us with a ready-made solution, Rolling Update. Rolling Update of Kubernetes provides two parameters maxSurge and maxUnavailable to control the speed of the rolling update. Different speeds will cause different phenomena.
Create an additional Pod (new), and then delete an old Pod (maxUnavailable = 0, maxSurge = 1)
Suppose we have 3 Pods , this configuration allows adding an additional Pod on top of the original Pod (maxSurge = 1 ), and the number of available Pods cannot be less than 3 (maxUnavailable = 0). This configuration ensures that when the new Pod does not provide services normally, there will still be the old Pod, and the old Pod can still be used to provide external services. However, this configuration, while ensuring downtime, also poses new challenges to the application. The old and new Pods must be compatible with each other. Another disadvantage is that it is slow to update Pod.

4e3295c385fea903058965fc2285f965.png


Delete a Pod, and then add a new Pod (maxUnavailable = 1, maxSurge = 0).
This configuration does not allow the creation of additional Pods (maxSurge = 0), but allows one Pod to be unavailable (maxUnavailable = 1). In this configuration, Kubernetes will first stop a Pod, and then create a new Pod. The main advantage of this configuration is that no additional computing resources are required. After all, creating a Pod is an additional computing resource consumption.

31f1b4ca397789d18724c6b41671a933.png


Update Pods as quickly as possible (maxUnavailable = 1, maxSurge = 1).
This configuration allows an additional Pod (maxSurge = 1) to be created while allowing one Pod to be unavailable (maxUnavailable = 1). The third configuration combines the advantages and disadvantages of the above two configurations, greatly reducing the time to update Pod.

090521a0663b6a8853852c5682ffc1be.png


These three configuration methods all exist in our SMAX products, and the key is to choose the configuration method that suits you best.
The above is the migration path of microservices of our product SMAX. Presumably, the migration path of each company is different. It is not feasible to copy all of them. It must be combined with its own industry attributes, company organizational structure, staffing, and existing technical structure. , Constantly make trade-offs and set priorities. A perfect technical architecture cannot be in place in one step, and it needs continuous evolution and iteration.
With the transformation of microservices, the deployment of clusters, testing, etc. have become extremely complicated. Therefore, a new technical concept is required to improve work efficiency, and DevOps has emerged. DevOps is a method that enables developers and operations personnel to collaborate more closely to deliver high-quality software faster.


DevOps

28a36f4995867f5212f6d5f4645419ca.png


Pipeline
is a huge and complex integration and delivery step for building a microservice cluster. If these steps are still in the manual operation stage, it will seriously slow down the concept of "fast delivery". Realize continuous integration and continuous delivery (CI/CD) pipelines through Jenkins pipeline to complete automated construction, testing, and deployment of applications.
Build a bridge between development and operation and maintenance personnel to facilitate developers to quickly verify new code functions and ensure quality.
In terms of Pipeline, we have developed a lot of toolkits by ourselves. For example, in 2017, there was no kubeadm. We developed a set of programs that can quickly create a Kubernetes cluster and deploy the entire SMAX cluster on it. All it takes is one instruction, which greatly saves the time for development and testers to deploy the environment.
Performance testing
At present, we use LoadRunner to stress test the entire cluster. At present, we have about 50 key transactions (Transactions) for stress testing. These 50 key transactions basically cover the core functions of SMAX. And we still have 150 non-critical issues in the development and debugging stage, and strive to conduct a more complete and comprehensive stress test on SMAX. In the future, we will continue to expand the number of our transactions for stress testing.
We use the test results of the last released version as a baseline to calculate the results of this stress test, whether the performance has improved or declined. Usually if there is a performance degradation of about 10%, we will be alert and spend time and manpower to further analyze the cause. Refer to LoadRunner test reports and other monitoring reports to help us locate problems faster.

fd5f3d77e2897d41285500deecef91b8.png


Monitoring
Currently, we monitor the entire cluster from the two perspectives of virtual machines and Kubernetes. Virtual machine monitoring can give us an overall performance report. Kubernetes cluster monitoring can be more specific to the performance report of each Pod.
Virtual machine monitoring, we use Zabbix to monitor the hardware usage of the entire cluster, such as CPU, Memory, disk IO, etc. These parameters are relatively rough to view the overall status of the cluster, but they can give us a more intuitive overall performance trend report. When some performance indicators are abnormal (super-average 5%-10%), we can quickly find the problem, and then take measures to further analyze the problem. Is the introduction of new code causing performance problems?

96db0cc177eb80cbb058ef7e266679d3.png


Kubernetes cluster monitoring, using Prometheus+Grafana to build a monitoring system, we can intuitively see the Pod’s operating status, when the Pod’s CPU and Memory are in a high load stage, and whether the duration of the high load matches the time slice of the LoadRunner stress test If it does not match, we need to further analyze the cause of the high load. If the percentage of high load is greater than the average, it also needs to be alert, which may be a sign of performance degradation.

28a36f4995867f5212f6d5f4645419ca.png


GitOps
GitOps requires the use of declarative specifications (such as yaml files) in version control software (such as GitHub, GitLab and Bitbucket) to store the required state of the system, so that changes to the entire system are auditable and trackable, and every change Both include the submission time and submitter, which means that the infrastructure can be versioned just like the application.
In Kubernetes, all Kubernetes resources are created in a declarative specification (yaml file). For changes to the declarative specification, Kubernetes will be responsible for keeping the final state of the cluster consistent with the declarative specification.
GitOps uses Git's versioning and Kubernetes' declarative specifications to describe, create and observe the entire system, develop and deliver model specifications based on Kubernetes-based infrastructure and applications.
If you understand GitOps from the perspective of CI/CD, then every time a Pull Request is merged into the master branch is CI. CD means that Kubernetes uses the file changes of the Pull Request to update its Kubernetes cluster.
We use the concept of GitOps to provide infrastructure for DevOps teams, such as Jenkins instance, PostgreSQL database, Prometheus and Grafana.
In DevOps, we still have a lot to do, such as automatic scaling, traffic monitoring, log link tracking. We are already experimenting with some POCs (Proof of Concept), and we believe that in the near future, all our attempts can bring real business value to customers.


to sum up

28a36f4995867f5212f6d5f4645419ca.png


The architecture is always in the process of continuous evolution. We will not stop exploring new technologies, and use better technologies to serve the more complex business needs in the future and quickly realize commercial value.
Q&A

a34058500f363c6569c1e576a6040ab7.png


Q: What is the most important thing about the agile team? Which part of the team is the most successful?
A: I think the most important thing about team agility should be the team culture. To truly be a team, with everyone as one mind, don't engage in small groups. If everyone's goals are unified, it will be easier to accomplish one thing. For example, how do we cultivate a team culture in our team? There will always be 1-2 people in each team who are willing to share and specialize in technology. We can regularly engage in internal technology sharing. Small ones do not take too long. Slowly cultivate everyone's geek spirit, and 1-2 people will drive the whole team. The above are the most important things for a specific team. If it rises to the entire product department or the entire company, I think the percentage support of the management is not critical. There must be a way to go to the dark courage. The most successful part of our team, I think is the culture of our team, I am willing to share, and I also hope that this culture can be maintained.
Q: When the business is upgraded, Kubernetes guarantees the maximum number of online Pods through a rolling upgrade strategy. Then, how does a single Pod ensure the smooth flow of traffic, such as what checks will be done during the postStart phase?
A: This depends on whether the specific business of your Pod is stateful, such as session information. Can old and new Pods coexist and provide services at the same time? Whether Pod provides online service or offline service. Then decide what to do at which stage of the Pod life cycle. If we put it in the simplest terms:
  • Old and new Pods can coexist, and Pod provides offline services. Then in the preStop stage, you can first set the readiness of the old Pod to false, so that new requests will be sent to the new Pod, and then Terminate when the old Pod finishes the existing offline services.

  • If the old and new Pods can coexist, but cannot provide services at the same time, you may need to consider blue-green deployment and switch traffic through the service label selector.

  • If in a complicated situation, I need more information and specific analysis. Correspondingly, the code needs to do more. We can talk in detail in the group.


Q: What are the points to pay attention to during the service splitting process? The configuration problem of the microservice architecture? What are the points to pay attention to when clustering old services? Can the network between the new and the old service interoperate, and how to solve it if it does not interoperate?
A: I personally think that the attention points of service splitting can be viewed from "management" and "technology":
  • At the management level, how to resolve the conflict between the service and the existing organizational structure after the service is split, where the management must be determined to break the old pattern, but timely compromise is inevitable.

  • At the technical level, it is still split according to the business, and the same business or similar businesses are split into the same team as much as possible, so that business knowledge does not need to cross teams at least. First make a big split, not so fine split, try it, and then according to the business, the team decides to continue the split, there may be a large team into two small teams.


ConfigMap and Secrets can be used for configuration in Kubernetes, but if it is a complex application, some open source configuration centers can also be used. Points to note about the clustering of old services. This problem is a bit big. Basically, all the content shared today is a point of attention. It will be organized into articles and published at that time. You can pay attention to it. Kubernetes can communicate with external systems, such as using externalService. If they do not communicate, can they try to use technologies such as Proxy to make them communicate?
Q: Using Kubernetes to manage microservices will bring some benefits. Of course, it will also bring some complexity, such as the complexity of the container platform itself. In what ways do you weigh the final use of Kubernetes to manage microservices?
A: One is the complexity of its own business. If the business is simple, there is no need to use Kubernetes. If your business is complex, you will inevitably need a full-featured container orchestration platform, such as Kubernetes. Even some more complex businesses will be based on Kubernetes for secondary development. The second is the promotion of CNCF and cloud native technologies. These technologies revolve around Kubernetes. The help of the community is also very important, reducing the investment in self-research.
Q: Is your infrastructure a private cloud of your company or a public cloud? In terms of infrastructure selection, will you use public clouds to simplify infrastructure investment, while focusing on software development?
A: We have a wired computer room and also support deployment in the cloud, such as Amzon EKS, Auzre AKS, GCP, and Alibaba Cloud. I am trying to deploy to OpenShift recently. In fact, we are an enterprise-level application that sells software products. Where is the final deployment? On-premise, private cloud or public cloud is up to the customer to decide. So we need support for each platform. But we also have our own SaaS service that is deployed on AWS, which simplifies infrastructure investment and maintenance costs.


Guess you like

Origin blog.51cto.com/14992974/2547582