Architect of Vipshop, Didi, and Hujiang, the practice sharing and exchange on microservice granularity, high availability, and continuous interaction (Part 2)

Architect group exchange meeting: Choose one of the most popular technical topics for practical experience sharing in each issue.

In this group exchange meeting, Hujiang Huang Kai, Vipshop Zheng Minghua, Didi Zhao Wei, Qiniu Yun Xiaoqin were invited to exchange views on microservice granularity, high availability, and continuous interaction.

This issue followed the previous issue of Vipshop, Didi, and Hujiang architects, and exchanged views on the practice sharing and exchange of microservice granularity, high availability, and continuous interaction (Part 1).

The first round: topic exchange

DiDi Zhao Wei: How to affect the normal development of the business during the evolution of the entire service, from the monolithic service to the microservice?

Zheng Minghua from Vipshop: There are two ways to transform from a single service to a microservice. One is to make a little noise. Every time you change a little bit, it will take a long time. Sometimes you will find that you can’t change it. cost is very high. Let me give an example. When we used to do order transformation, how did we do it later? Just come one by one, and divide the logistics first, because the logistics are relatively independent. How to remove the logistics order first? Let the logistics store his business and data first, build a model, and the upper-level business of logistics will have another thing, which is to do double writing. When writing, write both sides, the old order system is written, the new order system is also written, and both are written. The complexity of double writing is still quite high.

Didi Zhao Wei: Yes, the consistency of your data is really troublesome.

Vipshop Zheng Minghua: We will check when we read. When writing, it is double-writing, but when reading, there will be a check, that is, read the new and old places, and then check. If the verification is correct for both data, but if it is found to be wrong, the old data is mainly used, and an alarm is issued.

Didi Zhao Wei: How do you judge which is the old data? For example, my privacy operation, I place an order, place an order, the new system and the old system.

Vipshop Zheng Minghua: There is no comparison when writing, but there is comparison when reading. Let me give an example and look back at the classmate I just talked about. When I wrote it, I did write two different sets of databases. There are two ways for us to compare. One is that we have a bcp system, and the two data will be attributed to another place to help you check. We rarely do this method. Except for all machine platforms, the logistics system does not do it. . For example, when a customer is reading an order, he will read both orders when he is placing an order, and the interface will help him to compare whether the two orders are consistent. If they are inconsistent, the data read by the old interface will be the main one . The new interface data alarm, or manual participation, to see where the problem is.

Didi Zhao Wei: How is it working now?



Vipshop Zheng Minghua: This is a traditional practice, and it's okay. When I was in the former company, I basically solved it according to the method I just said, and we have to dismantle a lot of things. Logistics, finance, foreign exchange, tax rebates, and orders are basically dismantled one by one according to this model. But you said that it is impossible not to affect the business. Among them, the human participation in the splitting of orders, the splitting of the business system, and the manpower involved in the business system will all affect the business system. There is another relatively straightforward way, which is to make another system. Once the entire system is online, new services can be implemented in both systems, until one day, I uninstalled the old system.

Didi Zhao Wei: Are you involved in data migration?

Vipshop Zheng Minghua: Either way will involve data migration. Once the data model changes, it will involve data cleaning and migration. If the old and new data models are incompatible, data cannot be migrated, and data migration will result in data loss.

Didi Zhao Wei: Do you need to stop the service during the migration?

Vipshop Zheng Minghua: Basically, there is no need to stop the service. So far, when I am in Ali, the service will not be stopped. Because the new business is already being designed, and the data on both sides is available, I can just move the old data back without affecting my real business.

Didi Zhao Wei: We originally designed the microservice architecture, and did not experience the evolution from monolithic services to microservices, so I am very curious. During the whole process from monolithic services to microservices, you What points do you think you should pay attention to, or is there such a challenge in that place, or those points must be considered, or those are to be done well?

Zheng Minghua, Vipshop: After splitting a large system into multiple services, after many services have arrived in the database, how to solve this distributed problem? There are indeed issues to be considered, including message middleware and multiple services. When I communicate with each service through messages, how to solve the stability and sequence of messages is considered from a technical point of view.

Then how do I dismantle services, or from a business perspective. I want each business to be self contained and independently deployable. For example, order splitting and order sourcing are two different services, and these two services should be completely decoupled. Orders are split by suppliers or merchants. Only after splitting orders can you distribute orders to Going to different warehouses is two sequential actions. These two actions should not be intertwined. These two actions correspond to different services and should be completely isolated, and the databases between them are also isolated. Therefore, from a business perspective, as long as the business is well understood, the splitting of services is not a big problem.

Qin Xiaoqin: The splitting of services will definitely make the structure more complex, but microservices have realized in the concept description that from the perspective of service architecture, the design considers the deployment problem, and the priority of operation in the architecture is also ranked first. In the past, the design mode and software architecture basically did not consider the deployment and operation issues. Therefore, if you want to support the microservice architecture, you must have a set of effective operation and deployment tools and methods, which is one of the reasons why container-related technologies and container clouds are now attracting attention.

The problem of dependencies, including some of the problems of complexity, depends on the definition of boundaries and interfaces, and the way of data connection when splitting, whether the design is reasonable enough, whether the service provider is clear about the method of requirements, and whether the service caller understands it. The intent of the interface, that is to say, whether the team communicates with each service, the positioning of things, and the abstraction of the interface have the same cognitive level and reach a consensus. As long as the interface is stable and reasonable, no matter how the implementation changes, it will not have a negative impact on the integrated architecture. Local changes to services can be faster because there is no adjustment of a large system involved.

Therefore, it cannot be split for the sake of splitting, and the intention of splitting must accurately describe the solution of the problem. In a system, defining an interface is more important than how to implement it. Don't design an interface that is difficult to understand and unreasonable.

Didi Zhao Wei: Because I am also considering if I want to do a single service to a micro service, I think monitoring and alarming must be done first, and then the downgrade of such services may have to be considered in place.

Vipshop Zheng Minghua: You are talking about a big problem. It's a disaster for the big internet without good monitoring.

Didi Zhao Wei: As far as I know, things like them may be missing in the process of transitioning from this kind of monolithic service to microservices, in addition to these, it also includes the issue of personnel capabilities. He used to be a monolithic service, but if you let him suddenly switch to this kind of microservice, it may be uncomfortable, including the transformation in which he locates the problem.

Vipshop Zheng Minghua: Both Ali and Vipshop have relatively complete monitoring. Including the call from the outside to our background service and the call to the data layer once our service framework is used. Every mobilization service, including operations, has detailed monitoring, including the time-consuming of each link, it can help you monitor it. Take Taobao as an example. Taobao has an eagle eye system, which can detect all capacitive links. Includes post-talk for each link. Therefore, Taobao also has a detailed alarm system. Including flow control, current limit, downgrade, these have detailed plans. For example, how to register a service, how to find it, how to alarm when a machine fails, how to remove it from your server, and how to remove it. These are the basic measures for you to do microservices. Without this, you rely on employees to do it. If you have hundreds of machines and hundreds of services, you can still handle it. With tens of thousands of services and tens of thousands of machines, it may not be possible.

Therefore, such basic measures should be necessary for large-scale Internet systems. Without this thing, I don't think it's good to eat on the Internet, you can't sleep, and you don't know there's something wrong with that place. If we have good monitoring facilities, we can detect problems in time. Small bugs may also have major hidden dangers behind them.

I once encountered an accident that my hard disk was full, because all application logs were stored, and the utilization of the hard disk was not monitored at that time, and the online service was abnormal. If there is monitoring, including CPU, IP, including hard disk, monitoring system, downgrade current limit, and another system stress test, they are all methods and methods that I think are necessary for large-scale Internet systems.



Moderator: What difficulties will be encountered when refactoring and unpacking from a single service to a microservice? Or how to solve it?

Zheng Minghua from Vipshop: As discussed just now, service frameworks, architectural frameworks, and infrastructure are problems to be solved. Without these reserves, services or microservices simply cannot be realized.

Didi Zhao Wei: As for the other one, I think the overall cost of microservices will be very high. The cost of microservices, such as the cost of machines, the ability of operation and maintenance personnel, right? This is to be improved.

Vipshop Zheng Minghua: Including this test positioning is all cost.

Didi Zhao Wei: Yes. All are costs, and the entire cost will rise. If a small company is a small company, it may not invest much in this area, because there is no money.

Zheng Minghua from Vipshop: So after-service microservices depend on the ability of the team and the company.



Moderator: How is the API Gateway designed?

Qinniu Xiaoqin: In the microservice architecture, each microservice can be used as a service provider, and the client can call it directly. When multiple microservices are called at the same time, the client needs to send multiple independent requests one by one, which is inefficient. By designing the API gateway as the only access point to the system, all client requests go through the API gateway, which then routes the requests to the appropriate microservices. In this way, the client only needs to interact with the API gateway without calling specific services, and the API gateway can call multiple microservices to process the same request, which simplifies the number of interactions between the client and the service, and simplifies the client's code. Our API gateway currently has the functions of traffic forwarding, service discovery, traffic control, service degradation, permission control, fault tolerance and monitoring.

Traffic forwarding has the function of load balancing, using the polling method, and service discovery is based on Consul. After the user's request comes in, the access addresses of all available nodes are queried through Consul, and then the request is sent to the back-end service for processing by polling. The returned result is only forwarded, which is interpreted and used by the requester. And there is a monitoring component in the API gateway, which monitors the number of requests, the number of failures, etc., and transmits it to the Prometheus server. Through monitoring data, it performs corresponding processing such as flow control and service degradation on requests.

Didi Zhao Wei: Our API gateway is mainly divided into two parts. Part of it is to accept external requests, the so-called API gateway. The other part is the life cycle background management service for the API. Both sets of things are all served externally. We used the gateway to solve several problems at that time.

The first is decoupling. For the front-end and back-end, internal services should not be directly exposed on the public network. They can be removed through the gateway. External services are requested through HTTP, and the external does not need to rely on such things as Client. Then the backend can also be continuously developed to reduce the cost of external calls. Then through this call, also through this format of Json. The gateway inputs the Json parameters, and then converts the Json to an object, and then outputs the parameters to shield the complexity of the object to Json. Through the gateway, we will also do some of this authentication, flow control, degradation, and monitoring. Through the gateway, the entire mouth can be closed. Then manage the background, mainly the management of the life cycle of the API, including authentication, as well as the mapping between services such as the API backend, this relationship and so on. We use the management background, once the change is completed, so the Internet gateway is not a single machine, and then all gateways must take effect at the same time, etc., and also broadcast this message through MQ.

Hujiang Huang Kai: Is this API network management equivalent to a single point? This will cause all traffic to be forwarded to the next service through the gateway. Once something happens to the gateway, or its efficiency is very slow, it will cause the entire business process to slow down.

Didi Zhao Wei: Let’s first look at the entire business volume. For chauffeurs, the business volume is actually not that high. Because it's not very high frequency. Although we are a single point, our machine is not a single point, there are multiple machines. Generally, if there is a problem, the gateway will not have a problem. There may be a problem with the backend service. There is a problem with the back-end service, it is not that it makes a new release, there is a bug, or the time-consuming of his service is from the past. For example, 5 milliseconds, 6 milliseconds, suddenly become 500, 600 milliseconds, which may cause some problems. Therefore, on the gateway, we will do flow control, degradation, and monitoring to ensure that it is like this. Do flow control on the gateway. Do flow control downgrade to ensure its financial health.

Hujiang Huang Kai: After the service is registered, how does your gateway discover this service?

Didi Zhao Wei: We have the management background to manually configure. Every time a service needs to be exposed, it needs to go to my management background to match its class, which class name, method, parameters, etc. are all configured, and then exposed to my API Gateway. The name needs to be mapped.

Vipshop Zheng Minghua: We are also working on gateways. Our gateways are a cluster or multiple clusters, not a single cluster. First of all, it is not a single server, but a bunch of servers, and this pair of servers may not be a small cluster, but a large cluster. This large cluster may have multiple small clusters, and each integration may correspond to different Business, so even if there is a problem with a network like you said, it will not affect all his business, this is the first one; the second one is that the gateway is mainly a channel, stateless, no business logic, So it means that the possibility of him having a problem is very small.

Hujiang Huang Kai: What language do you use to develop?

Vipshop Zheng Minghua: We usually develop in Java.

Hujiang Huang Kai: If I make an HTTP request, I will use a similar HTTP client as a forwarding tool, right?

Didi Zhao Wei: We use the Doubbo method, the API Gateway, which is actually on Doubbo, and will get your underlying services from Doubbo, his purification and so on, to transfer, He won't go that way, that is to say, he will gateway and he will come in, we will come in to calculate, accept Json, convert it into an object, and then convert it into a binary stream, and transfer it to the background through this Double RPC. .

Hujiang Huang Kai: So your services are all developed in Java, without any other languages?

Vipshop Zheng Minghua: This is not. First of all, the service has nothing to do with the language, but has something to do with the serialization protocol.

Didi Zhao Wei: This is mainly serialization, because the language is the high status of the serial number, which may not be the same, and your own length may be different, so it may be used in this industry. Some of the more general ones, such as Stream, or you can do a set of binary serialization yourself.

Didi Zhao Wei: He may not support some languages. For example, the new Go may not be well supported, so this is sometimes a little more troublesome.

Hujiang Huang Kai: Many of our services are written in other languages, such as Go, C++ and .NET languages. If we want to push the RPC protocol, it will cause opposition from other departments because they can’t access it, which will drive the efficiency of the project. too low. In order to unify the management of these projects, we all use Restful as the communication method between services.

Didi Zhao Wei: In addition, the performance of HTTP may be lower than that of RPC and TCP. Because I think the company's development to the end, the technical framework should be unified. From the perspective of the industry, in addition to Ali, Ali's method is to convert HTTP to pure Java. Some other big companies in the industry, such as Tencent and Baidu, seem to develop in multiple languages, using PHP in the front and C in the back. But it seems that only Ali is doing better, and the entire technical framework is unified. I think it may be a direction, so the cost will be very low, the cost of opening up the business will be very low, and the development will be accelerated later, but it is really difficult to do this.

Vipshop Zheng Minghua: I am doing this kind of thing at Vipshop now, gradually changing the PHP language to Java development. But I've always believed in this thing to do. Because within a company, teams are fluid. If they are all Java, the flow is fine. But one is C, the other is PHP, and the flow is difficult. For example, some of our businesses are not done, and the technical team is not easy to convert into other languages. From another point of view, after the technology is unified, the cost of development and maintenance or communication and coordination will be greatly reduced.



Moderator: How to improve the high availability of services?

Hujiang Huang Kai: Microservices are inherently highly available. Most of the microservices are stateless services. As long as the horizontal expansion problem is solved, the high availability problem is solved. For this stateless microservice, our company provides a solution of Docker+mesos+marathon, so the application will not die, and it will automatically restart if it dies. It is more difficult for stateful services to be highly available, so I was asking before, are all microservices stateless? If it is stateful, it must solve the problem of high availability.

Vipshop Zheng Minghua: If there is a stateful service, the client session must be connected, but the performance will be greatly reduced.

Hujiang Huang Kai: Is the solution Active+standby mode? Client access via VIP?

Vipshop Zheng Minghua: But you still can't solve this. Once it dies, the customer will lose the status of the customer.

Vipshop Zheng Minghua: Because your current customers are connected to TCP.

Didi Zhao Wei: It will definitely be disconnected, and it will need to be reconnected. That’s for sure, if you use that virtual IP here. When it is connected to the back or to a fixed machine, the machine dies, that is, the internal socket is broken. Although the external one is not broken, it actually needs to be rebuilt.

Didi Zhao Wei: Once the one-click Retry is over, the entire link will be connected. In fact, if it is disconnected at this time, the user actually needs to reconnect.

Moderator: Retry the client.

Zheng Minghua from Vipshop: It may still be difficult, because when you first established the connection, the Docker you used may be gone, and it will be useless if you connect and try again. which one.

Hujiang Huang Kai: Is there a Session service to store the state? Will the service state be in Redis or in memory?

Didi Zhao Wei: So your service is stateless, because you have already got the state into Redis, so your service is stateless, so your services can extend each other, or extend Service, less service is no problem.

Zheng Minghua of Vipshop: To achieve customer stickiness, the server can only be used after it is up, otherwise it is basically difficult to do. Another is what service framework you choose, such as Doubbo, which is also a common pattern. Therefore, the problem of how to retry after a service failure has been solved here.



Moderator: How to do continuous delivery of online microservices?

Didi Zhao Wei: In fact, we haven't done continuous delivery yet, and we are still trying continuous integration. Because if you want to do continuous interaction, it is not something you can do alone as a business department. Because you will be involved a lot, such as IP allocation, service layer reporting. We want to do continuous interaction through Docker, how to allocate your IP in the future, how to allocate ports, including the continuous allocation of Docker. We originally used Tencent's robot at the earliest. His system can't be used when it is mirrored, you know? It's really annoying, including the future log collection and viewing. These problems cannot be solved by a business system. It must cooperate with the operation and maintenance department to do this, including if you are involved in a release process, We are now trying this kind of continuous integration, we can't do continuous interaction right now.

Vipshop Zheng Minghua: Let me simply say that continuous integration/delivery is a relatively big thing. From development to test environments, including grayscale environments, including pre-release environments. For example, how to integrate in the test environment, because you are a microservice, if you want to run the entire business, you need dozens or hundreds of surrounding services to run this business, then how to solve the development version and Testing problems with other services, this is a big problem. In the previous process, it has done a lot of work, including building a personal test environment, and then building a joint debugging test environment for the entire team, and you continue to release the test environment? Then this thing has to be established in order to do it, in order to really do continuous release.

Didi Zhao Wei: Regarding this problem, in addition to your business, storage is also involved, such as MySQL, Redis, etc. It is indeed difficult for continuous integration and continuous interaction. Now Docker and Kubernetes are solving it. This problem, but now many companies are still trying Docker, but the industry is playing very well and I haven't seen much.

Hujiang Huang Kai: We already have a continuous integration/release solution. Since all the services of our courseware system are all deployed using Docker. Based on the inspiration of Devops and Openstack, we make full use of Docker's feature of making once and using it everywhere, and using the function of Jenkins pipeline, so that the development will automatically enter the fully automatic delivery process as long as the code is submitted, until it goes online.

The basic process is as follows: the

successful submission of the code triggers the Jenkins pipeline.

Pipeline consists of the following parts: testing, compiling, making Docker image, uploading image, and calling Marathon API to publish to QA environment.

QA Validates the QA environment through automated or manual testing, and if there is a problem, go back to development and revise the pipeline.

After the test is passed, Pipeline will automatically publish it to the verification environment and call the script for FBT.

After verifying that there is no problem, Pipeline first uses the canary release on the production line to confirm whether there is any problem. If there is no problem, it can be fully deployed.

There are several characteristics of the entire continuous delivery process:

the entire release is all carried out by the Pipeline script calling the Marathon API without any human involvement. In actual use, some manual intervention can be done according to the sensitivity of the business. For example, the release to the production line environment may be assisted by the operation and maintenance personnel.

Several environments in the delivery process are not physically isolated except for the production line environment. Instead, it is dynamically created or destroyed according to requirements, similar to that each deployment is a brand new environment. The advantage of this is that it is not easy to produce redundant data and configuration errors, which is also the most characteristic of the entire process.

Docker's one-time production, available everywhere feature greatly avoids the deployment complexity caused by the inconsistency of the environment in the delivery process.

Didi Zhao Wei: We want to do these things now, but there is one more troublesome thing, and now use this word. We definitely have to try new technologies. It can’t be a one-size-fits-all approach. All of them will go up directly, and all of them will go up every time. Because you are not familiar with this thing and have not undergone on-site verification, you must try it gradually. The problem with trying it step by step is that some of our services are modified in Docker, and some of them are external. Service registration and discovery, how are you doing now? Including this kind of log, because it is put into the mirror, the log is equivalent to the output, how do you locate the problem and check the log, including your monitoring, how will you do it?

Hujiang Huang Kai: In fact, there are two ways to register and discover in the whole framework. One is to actively register information with the registration server after the service is started. We use Consul now, which means that the service will automatically adjust Consul's registered service API after it starts, telling Consul the IP address and port of the service. The IP address here is the host's IP + the mapped port. The host's IP is automatically injected into Docker through environment variables, and each Docker automatically knows the host's IP. This method is intrusive. In fact, when we use an orchestration framework, such as Kubernetes or Mesos, we know what the port and name of the Docker service launched in the resource pool are, although the port and IP are dynamic. Another point is to solve the intercommunication between the internal network and the external network, which is a necessary step for us to realize the Docker microservice orchestration. If many microservices are started, if these microservices are stateless and parallel services, a load balance must be required. As long as port mapping is enabled on this Load balance, the internal and external networks will be connected.

Hujiang Huang Kai: Another service registration and discovery is to continuously monitor the changes of the orchestration service through three-party tools, and automatically register if a new service is found to be started. Regarding the log output, we use Volume to mount the HOST local file system, "introduce" the log files in Docker to the host, and then manage and query through the ELK suite. Each log file will have a specific Docker IP, which is very convenient for querying problems.

Didi Zhao Wei: It is very difficult for the business department to do this, unless they help you solve this kind of problem after you come in, or the system department comes in to help you solve this kind of problem.

Didi Zhao Wei: But the problem is that we don't have Loadbalance service yet. They do it themselves through ZK. The Doubbo thing is equivalent to us using Loadbalance. So for us, this thing is very troublesome and difficult to do. So I can't figure it out all the time. I must try a small service first. If I can't force it, I put it all in Docker. I can't stand it. If there is a problem, you can't take the responsibility.

Hujiang Huang Kai: Yes, the experiment started with some small applications. In fact, if a process verification runs through, then it can be promoted on a large scale, because the process is the same.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326391807&siteId=291194637