Ele.me Operation and Maintenance Infrastructure Evolution History

The content of this article comes from the on- site sharing of Xu Wei, senior operation and maintenance manager of the 10th Meizu Open Day .
Editor: Cynthia

Introduction: Ele.me was established in 2008. At the end of 2014, Ele.me began to usher in a large-scale explosive growth of its business. From 2015 to 2016, Ele.me entered a period of rapid development, and the growth of business and servers was dozens of times. Large-scale growth will inevitably bring many challenges. This article will share the measures and ideas for dealing with challenges in different periods through the evolution history of Ele.me's operation and maintenance infrastructure.

1. The 1.0 era

From 2014 to 2015, which was called Ele.me 1.0 era, the business ushered in rapid development. At this time, more consideration was given to what the business needs, rather than the long-term structure. Each individual or team is responsible for their own part of the job, fully aligned with the needs of the business. It is conceivable that there will be a lot of technical debt due to ill-consideration in this process, which is the so-called "pain" point.

internet pain

The pain of the network is mainly manifested in:

● No standardization: IP is randomly linked. The external network IP is directly linked to the server; some servers may have 2 or even 3 IPs; some have bonding, some do not;
● Many attacks: In the case of rapid business growth, there will be a large number of attacks. It may crash when attacked;
● The bandwidth convergence is relatively low. Because the traffic is too large, such as the high bandwidth of the cache, the uplink port of the switch or the gigabit network card of the server is quickly filled;
● Lack of monitoring: The technical team does not know if there is a problem, and the rider or user says that they cannot place an order. Complaints are fed back by the customer service;
● Single point: There are many single points from the business to the overall architecture to each business and even the machine;
● There is also the problem of unstable link quality.

server pain

The pain of resources is mainly manifested in:

● Server delivery is not timely: From last year to this year, our highest weekly delivery volume was 3700+ logical servers. On average, thousands of units are delivered and recycled every month, which requires very high efficiency;
● Lack of asset management: no standards, high maintenance costs. At this time, in a period of brutal growth, if you need a server, you will quickly buy it, regardless of how many servers you have. I don't know what the servers are configured with. There is no standardization. Maybe this one has an SSD and the other one doesn't. Maintenance costs are very high.
● There is no guarantee of delivery quality: all machines are installed on human flesh. At the end of 2015, a batch of machines was purchased and an installation team was temporarily formed to install the machines together. Because they are all human-fleshed operations, they are slow and the quality of delivery cannot be guaranteed, making it more difficult to troubleshoot.

Missing basic services

The lack of basic services is mainly reflected in:

● For monitoring, Zabbix was first used. Due to different configurations, some hard disks are not monitored, IOPS are missing, and business layer monitoring is not fully covered;
● Load balancing, each business has one or two servers and hangs Nginx Doing a reverse proxy is done casually;
● Centralized file storage. Each server will store many files locally, which brings many problems to the overall infrastructure management. For example, some things, such as theoretically, many SOA services on the Internet are stateless. There should be no other things in the local area except code. However, when a fault occurs, the business cannot confirm the problem due to immature monitoring. You need to look at the log, which is complicated. Some people say that it needs to be kept for a week, and some people say that it needs to be kept for a month. Sometimes the log is dozens of G a day. What should I do? Then add a hard disk, how to add? Who will procure and manage? How to do the standardization in the back? Centralized logs and centralized file storage are all to solve the problem of standardization.
The basic service is also very confusing.

2. What have we done

Faced with so many problems, what should we do? In fact, it is enough for operation and maintenance to do three things.

The first is standardization. From hardware to network to operating system to the technology stack used, software installation method, log storage path, name, code deployment method, and monitoring from top to bottom, a set of systematic standards must be established. With standards, you can use code automation, and with automation and standardization, a virtuous circle can be achieved.
The second is processization, which is to standardize and standardize many requirements through steps.
The third is the platform, building a platform for standardization and automation.

What I understand is that operation and maintenance needs to do two life cycle management.

The first is resource life cycle management, including resource procurement, listing, deployment, code, troubleshooting, server recycling, and scrapping.
The second is application life cycle management, including application development, testing, launch, change, application offline, recycling, etc.

2.1 Standardization

 

 

There is a concept about standardization that I often emphasize to everyone, which is to let our users make choices and ask questions instead of asking questions.

For example, users often say that I want a machine with a 24-core 32G 600G hard drive. At this time, you should tell the user: I now have four models: A, B, C, and D, which are computing, storage, and Memory type, high I/O type, which one do you want? This is very important, many users are just used to two machines: one with a 200G hard drive and one with a 250G hard drive. The needs of users are all kinds of strange, and it is difficult to achieve without standardization. Our server models are unified and provide various models. You need to talk to users, collect user needs, and try to identify the real needs of users.

When purchasing the model, you need to customize it, such as whether to turn off the power saving mode. There are still some pitfalls for each manufacturer, including the drive letter drift of the pass-through card, how to automate it, and how to automatically go online when the machine comes.

The server is also customized when it leaves the factory and puts it on the shelf. We label resources as modules, the smallest module is 3 cabinets, and the number of servers in each cabinet is fixed. During production, for example, if I want to purchase 1,000 servers, I tell the manufacturer that I have already planned, which computer room, which cabinet, and which U-position these 1,000 servers are placed in. The manufacturer will customize it. After the machine arrives and is put on the shelves, the manufacturer or service provider will connect the electricity, the operating system will be automatically installed, and even the network, each layer is standardized.

2.2 Process + Automation

 

 

The picture above shows Ele.me's workflow engine.

It can be understood that there are many processes in the resource life cycle, such as server application, including physical machine application, virtual machine application, cloud service application, etc.; such as a large number of states, including recycling, etc. Behind the process is automation, standardizing user input, allowing users to make multiple-choice questions, what model, configuration, and quantity they want. Once the form is filled out, the background is automatically executed.

2.3 Automation + Platform

● Automatic installation and initialization of physical servers. For example, if I have thousands of servers, can I install them in one day? 360 once installed a maximum of 5,000 servers a day, and our highest record is 2,500 physical servers a day.
● Online automation of network equipment.
● Resource management platform. All resources can be managed in a unified manner, similar to the management background of the resource delivery process
● Distributed file system, mainly used for database backup and image processing
● Log centralized platform, all logs are centralized on elk, don't log on the server

2.4 Private Cloud Platform (Zstack)

Ele.me implemented the savage growth method in the early days, and the virtual machine was created by itself. After the creation, there will be a problem: how do you know that a machine can be created again? For example, I can create 6 virtual machines on one machine. I have created 5 virtual machines. How do I know where the other one is created? A business needs to be deployed to 10 physical machines, even across cabinets, to avoid failure of a single point physical machine or a single cabinet and affect global applications. This involves resource scheduling of virtual machines. We chose ZStack.

Why choose ZStack?

The three popular open source technology options for private cloud are: OpenStack, CloudStack and ZStack.
On the principle that the simpler the better, we ruled out OpenStack, which is too heavy, and no one has the time to hold such a large system. And getting some feedback from the industry, it's not very good overall.

The developer of CloudStack is also the developer of ZStack. At that time, the CloudStack community was no longer maintained, and it did not support centos7.
So we chose ZStack. At that time, ZStack also had many bugs, but it was relatively simple, and we could do it well.

 

 

Features of ZStack: Simple, stateless, interfaced

It is relatively simple, just install it and you can run it. Of course, if you want to use it well, it is still a bit difficult in the back. It is a central structure, and everything is done based on messages. Our ZStack platform can’t see any tall pages, and the back is all custom interfaces. The front-end process automatically adjusts the back-end interface, through some messages. Synchronize. ZStack currently manages more than 6,000 virtual machines.

3. The 2.0 era

In the 1.0 era, we did some standardized and automated work to make our things run smoothly. Since 2016, we have entered the 2.0 era, and there are some pain points at this stage: what is SLA? Do you have data? You said that the efficiency is very high, how to prove it? Is it high to deliver 1000 a day? How is the data measured? In the IT circle, except for God, all things must speak with data, and everything must be quantifiable and measurable.

4. What have we done

During this period, our measures to solve the pain points started from two aspects: refined operation and maintenance and data operation. Operations and operations are different.

4.1 Refinement of operation and maintenance

Refinement operation and maintenance includes the following aspects.
● Continuous upgrade of network architecture
● Establishment of server performance baselines
● Server delivery quality verification (unqualified and not delivered)
● Hardware fault repair automation
● Network traffic analysis
● Server restart automation
● Bug fix: power saving mode, bonding. . .

The network architecture continues to be upgraded

In the early days, we had a data center, and the core switch used was Huawei's 5700SE. What does this mean? During a traffic burst, this device caused our P0-level incident. So we redefine network standards and continue to do a lot of network upgrades, including core, load balancing, bandwidth aggregated to the core, and network architecture optimization.
There are also inter-IDC links. At first, some of our inter-IDC links were VPNs. Now, bare fiber is used for intra-city links, and transmission is used for cross-city links. There is also a vague and continuous investment here.

Network Optimization

 

 

As shown in the figure, for the IDCs in Beijing and Shanghai, we have all used bare fibers and dedicated lines to access the IDCs from the office, and all of them are strong enough. It also includes third-party payments, such as Alipay, WeChat Pay, etc.

Server performance baseline formulation, delivery quality verification

Whether the delivered server is good enough to speak with data. All our servers have a baseline, such as a computing model, the computing power, the I/O capability, and the PPS of the network card packet can be tested. Performance testing will be carried out during delivery, and the delivery can only be achieved when the baseline is reached, otherwise it cannot be delivered.

network traffic analysis

We had encountered a situation where the bandwidth of a certain fiber between aggregation and access was full. Because the early bandwidth convergence ratio was not enough, four 10G ports were on it. Due to the algorithm of network traffic, one of the four 10G ports was running. full. We need to know which business is running and how the traffic of key nodes is running. If there is a problem, an alarm should be issued.

Hardware failure warranty automation

At present, we have a large number of servers, and there may be dozens of failures every week. How can we know the failure at the first time and quickly repair it without affecting the business?

There are also some other tasks, such as automatic restart of the server. It is very hard to do operation and maintenance. If there is a fault in the middle of the night or the server is broken, it needs to be restarted. If you have to enter the password through the remote management card to log in and restart, it is too low. Now, to automate the restart.

Generally speaking, there are several logics for automatic server repair report.
● Fault discovery
● Fault notification: user, IDC, supplier
● Fault handling
● Fault recovery verification
● Fault analysis

The first is fault discovery. How to find resource faults? Monitoring, in-band, out-of-band, and log multi-directional monitoring. All monitored alarms go to one place for preliminary collection, and finally to this system.

 

 

This is the picture on September 19. You can see that there are a lot of faults. After these faults are found, they need to be notified. The notification is also very complicated. Is it a text message, a phone call, or an internal tool? We want to notify in multiple channels. Some users are good. He said that you don't need to send me an email. I have an interface on my side that can do automatic sending. For example, if our server fails, we will automatically send him a message. After receiving this message, the business will be Start to turn off the machine, and even complete a series of operations such as data operations. After finishing, return a message to the repair system: You can repair this server. After receiving this message, notify IDC and the supplier: Which one What is wrong with the server room, which cabinet, which U-bit, and the server with the number of serial numbers? Please come to repair it at what time period, and tell the IDC through various methods: who has the ID number and what equipment will they bring at any time? Where to do home repairs.

Our tens of thousands of servers are run by only two people. Supplier maintenance and troubleshooting are human flesh. After processing, log in to the external system and send us a message to tell us that the server is repaired. Our program will automatically check whether the failure has recovered. If recovered, it will notify the user about the resource. When the time is fixed, the user receives the message and will pull the server back again. At the same time, all fault information will be entered into my database, which will be automatically analyzed to see which brand of server is not good, which model or which accessory is more broken. There can be a reference.

Refinement operation and maintenance and various bugs

Fix and many other details, the details are the devil. At the beginning, it was badly trapped by the power saving mode, including the problem of the network card, from hardware to service, and there were more bugs in the code.

Operation and maintenance management platform

We have many cabinets and computer rooms, and these data are collected and displayed through automated systems.

 

 

There are three things to consider in an operations focus: quality, efficiency, and cost. As you can see in the above picture, this is a module. There are many cabinets in this module. These cabinets consume a lot of electricity, which reflects the cost. A large number of our cabinets are yellow, and yellow is an alarm. The power of a cabinet is 4000W or 5000W. We will try our best to make full use of the resources, and the cost is relatively optimal. Therefore, our cabinets are all relatively high-power. For example, we will put a lot of equipment in the 47U cabinet.

4.2 Data Operation

All things in IT must speak with data.

Asset situation

 

 

Assets include: how many servers do we have, which rooms are distributed, how many cabinets are there, what models, brands and models are the servers, which are occupied and which are unused.

network traffic analysis

 

 

Network traffic analysis includes: who the network traffic comes from. For example, if there is an abnormal protrusion here, I know that this is caused by the transmission of bandwidth across cities. As we all know, the cross-city bandwidth is very expensive 10G bandwidth. It will take three months to expand the capacity, and the overall business will be seriously affected. We want to know who is using this traffic as soon as possible.

where did the server go

 

 

We buy stuff from so many companies, we need to know who is using these machines? The line in the figure is the resource utilization rate, which is the big data department. In the figure, we can see that the resource utilization rate of big data is very high. While other departments resource utilization is not high. Through this data, I will send you a report, telling you how much you spent, how many servers are used, what type of servers, distribution and utilization. This is the idea of ​​operation, not the idea of ​​operation and maintenance.

Resource Delivery SLA

 

 

How is our workload measured? We delivered a lot of servers, when did they deliver, what models did they deliver, and how efficiently did they deliver? Must be measurable. For example, when doing the year-end KPI assessment, you said that our department has done a lot of work, this project, that project, these are all nonsense. Just tell me, I have done several projects this year, how many resources are deployed and what is the efficiency of these projects. The average delivery time used to be 2 hours, now it is 20 minutes, and it will be 5 minutes in the future.

cost accounting

 

 

We spent a lot of money this year, where is that money going? who spent it? What did you buy? Who is it used for? How does it work? From various dimensions, the composition of these costs, etc., and even the comparison of our costs with those of our friends, can be seen through these things.

Supplier quality evaluation

 

 

For example, when did each accessory fail, and what was the failure rate? Automatically grade the manufacturer, and the report will be automatically delivered to the procurement as a technical score for the procurement, without human intervention in this process. In terms of quality, if the quality of a certain manufacturer has declined during this period of time, after-sales management can be carried out on him.

V. Summary

This analysis is mainly about resource life cycle management. Most of the content of this article is biased towards the underlying resources, but the idea can fall into all modules, such as logs, different operation and maintenance systems, monitoring, etc.

Finally some thoughts.

Simple and usable. I have been engaged in Internet operation and maintenance for almost ten years since I entered the industry. The failure of the early server needs to be checked one by one by someone who understands it well. The current practice is that if one server has a problem, it is pulled down, and the other server is quickly installed. Going up, cost controllable, quality controllable, and efficiency first. In this case, it must be simple and usable.
This sentence was first spread by Baidu. This is also a principle for my operation and maintenance. Everything should be simple. There are some open source solutions with a lot of cool features, but you have to think deeply about do you really need it? Does it really help you? What is the real core value? All software programs must achieve high clustering and low coupling to avoid strong dependencies.

Standardization can be standardized, automation can be automated. Standardization must be the trend in the future. Now, with the development of Alibaba Cloud and Tencent Cloud, many small companies will migrate to the cloud. What can the future hybrid cloud architecture and operation and maintenance do? How to achieve rapid capacity expansion and elastic computing, elastic computing includes capacity planning, stress testing, etc., there are many points, the cornerstone of these points is standardization and automation.

Try not to reinvent the wheel, make your own. People who are engaged in development like to build wheels. They always feel that when I go to a company and don't write something, it seems that I am very low or have no performance. Using "good" software is more important than using "good" software. There is no good or bad tool, it depends on whether you can use it well, do the right thing at the right time, don't have any prejudice against the tool, it is just a brick, what we have to do is to combine these bricks and use them well these bricks.

First, then better (80 cents long live). Everything is difficult to start, and the first step must be taken first. Don't say that I want something very grand, and the design is very good in all aspects. What should be considered is whether this thing can be implemented? Landing is the most important thing. Then some people will ask how to do a lot of things, but they are bad? Close the mouth first, and then slowly optimize it.

With the rapid development of the Internet, Internet applications must be iterated quickly. Our company publishes hundreds of times every day. Rapid updates in agile development are very important. This process must be a spiral, and even two steps forward and one step back. Don't say who's architecture is the best, only what suits you is the best, and what suits you at different stages is different, and requires constant refactoring and iteration.

Existing users and new users are equally important. This is what Amazon came up with. There is a famous thing about Amazon: in the past, a user wanted to migrate services to Amazon, which was estimated to be tens of millions of dollars. Amazon commented that what kind of transformation will be done when this service is moved over, and what impact this transformation will have on the stability of the existing business. After layers of reporting, the final conclusion is that I don't want this user anymore. Because existing users are not guaranteed after acceptance.

this point is very important. In September 2016, our log system was launched. When it was just launched, the peak request volume was 80,000/sec, and in October it reached 800,000/sec. At this time, users continued to come and ask for access. At that time, I felt that the hardware, architecture and other aspects were about to hold. I told everyone about this, and I won a one-month buffer period for myself, and made a lot of technical transformations. Now the peak value exceeds 2.6 million logs per day. , can also be collected, transmitted, stored, and analyzed in real time. A balance must be found in this. While serving existing users, gradually access new users. Of course, you can't say "no way" bluntly, so that your brand will be gone.

Embrace change and don't be glass-hearted. I have worked for many years and have been to several companies, and I can see that each company has different changes at different stages. For example, in my team, there is basically no operation and maintenance for development, and standardization is done. The bottom layer is a hardware software, an operating system expert, and the others are some programmers. But what if many people themselves do operation and maintenance? To start learning to code, there is change and growth. My goal for the team this year is to finish our team in 2017. It means to be unattended, or only spend 10% or 20% of the time and energy on bug fixes in the background, and the other 80% of the time on value output. Now this goal is already in the process of landing.

November 9-12, Beijing National Convention Center, the 6th TOP100 Global Software Case Study Summit, Ele.me Senior Project Manager Chen Shipeng will share "Self Shorting - Supply-side Reform of Agile Managers under New Thinking Mode"; Li Shuangtao, chief architect of the middleware team and chief architect of the remote multi-active project, will share "Ele.me's overall service remote multi-active transformation".

The TOP100 Global Software Case Study Summit has been held for six sessions to select outstanding global software R&D cases, with over 2,000 participants each year. Including product, team, architecture, operation and maintenance, big data, artificial intelligence and other technical special sessions, on-site learning of the latest research and development practices of first-line Internet companies such as Google, Microsoft, Tencent, Ali, Baidu and so on.

Application entrance for the single-day experience ticket for the opening ceremony of the conference

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326027766&siteId=291194637