Design and implementation of Tencent's internal full-link tracking system "Tianji Pavilion"

Source | Public Account: Baima Xiaoxifeng

Legend has it that there is a machine in the Tianji Pavilion that controls everything in the world, and everything runs from it. "Tianji Pavilion" in this article is a monitoring system based on link tracking. Back-end developers can gain insight into "Tianji Pavilion" through "Tianji Pavilion" and quickly solve problems.

Summary

In order to support the growing huge business volume, the industry uses a large number of microservice architectures. Services are split according to different dimensions. Internet applications are built on different sets of software modules. These software modules may be developed by different teams, may be implemented in different programming languages, and may be deployed on thousands of servers. Across multiple different data centers, distributed systems have become increasingly complex.

How to quickly locate the fault? How to accurately evaluate the capacity? How to dynamically display service links? How to optimize system performance? These are the four major challenges that the distributed system brings to back-end development students. The industry uses link tracking systems to solve the above problems, but Tencent lacks link tracking. In order to fill this gap, the "Tianji Pavilion" system was born.

"Tianjige" collects, stores, and analyzes the trace data, index data and log data in the distributed system to complete the full link tracking, thereby solving the above problems. At present, Tianjige has access to 1,200+ services, the peak value of trace data reporting is 5 million per minute, the peak value of index data reporting is 130 million per minute, and the log data is 30T per day. How does Tianji Pavilion deal with such a large amount of data? How does Tianjige's link tracking work? How to achieve low intrusion and low overhead? This article will reveal these questions for you one by one.

background

The rise of microservices makes the system more and more complex

As mentioned in the summary, Penguin E-sports also adopts a microservice architecture. Figure 1 is the dependency diagram of the Penguin e-sports homepage interface. You may not see this picture clearly, but it is normal because the system depends on too many services. In this case, a request involves dozens of services. If one of the key services fails, only the exception is known, but the specific exception is caused by which service you need to enter each service to look at the log. This processing efficiency Is very low.

Figure 1: The interface topology of the Penguin gaming homepage

Pain of background development

Microservices are also a double-edged sword. I won’t say much about the advantages. The disadvantage is that the system is too complicated, which brings the following four major challenges to back-end developers.

  1. Difficulty in locating faults: A request often involves multiple services, and these services are likely to be responsible for multiple teams. Once there is a problem, only know that there is an exception, but the specific exception is caused by which service you need to enter each service to see the log, such processing efficiency is very low. In the worst case, multiple teams may need to be positioned together.
  2. Capacity assessment is difficult: Penguin E-sports has several promotional activities every month. The form of activity also changes frequently, resulting in different flow inlets. Penguin E-sports has more than 500 modules, and the traffic from different entrances causes the qps increment of each module to be different, and capacity evaluation is a difficult task. Before connecting to Tianji Pavilion, each large-scale mass activity of Penguin e-sports requires two development students to spend a whole day to evaluate the capacity, with half the effort.
  3. Difficult to sort out links: a newcomer joins the back-end team, and he takes over a module in this microservice system. He doesn’t know where he is, who relies on his system, or what his system’s downstream relies on. Service requires looking at documents and analyzing line by line according to the code, which is time-consuming and labor-intensive.
  4. Performance analysis is difficult: a service depends on multiple back-end services. If the time-consuming of an interface suddenly increases, you must start from your own development and gradually analyze the time-consuming situation of each dependent interface.

Industry solutions

The industry uses  distributed link tracking systems to solve the above problems. Dapper is a distributed tracking system in Google's production environment, and it can be regarded as the originator of major link tracking systems. In 2010, Google published Dapper's paper. After that, major Internet companies launched their own link tracking systems based on Dapper's ideas. Including Twitter's Zipkin, Korean PinPoint, Apache's HTrace, Ali's Hawkeye Tracing, Sina's Watchman, Uber's Jaeger, etc.

Our way out

Tencent is relatively weak in link tracking and needs to fill this gap as soon as possible. The above several link tracking systems all satisfy the function of link tracking, but when implemented in our own production environment, these Trace systems have many problems. Google "dapper" and Ali "Eagle Eye" are not open source; pinpoint implements call interception and data collection through bytecode injection, and can only be used for java servers; Zipkin and jaeger mainly collect "trace data" and "metric data" And the collection of "log data" is weak, and the ability to combine and analyze the three data is not enough; the other APM systems are not recommended. In the end, we built a set of Tencent's universal link tracking system, Tianji Pavilion, in accordance with the opentrace protocol and the architecture of Ali Eagle Eye.

Tianji Pavilion Introduction

What is Tianji Pavilion

Tianji Pavilion is a monitoring system with distributed link tracking as the core. It collects, stores, and analyzes the "trace data", "index data" and "log data" in the distributed system, combined with pressure test data and TNM2 data, to achieve multiple functions such as fault diagnosis, capacity evaluation, and system combing. Greatly reduce the operation and maintenance challenges of developers, as shown in Figure 2. The detailed data collection process is shown in the "Data Collection" section below.

Figure 2 Schematic diagram of the functions of Tianji Pavilion

Note : "Recall event data" includes "trace data", "index data" and "log data".

Trace data-refers to the span data of link tracking, which is mainly used for link tracking, restoration, and topological drawing.

Indicator data-refers to rpc modulus data, including minute-level rpc request volume, success rate, delay distribution, error code distribution, etc.

Log data-business log.

What are the functions of Tianji Pavilion

  • Fault location: Tianjige can restore the call chain by using the tracking data, and combined with the log data, it can quickly locate the fault. For example, if a user complains that his number of video likes is incorrect, the developer can find the trace of the complaint user's request according to the user's QQ number. Looking at the trace details, you can clearly see that the request timed out when getting the like result, and the development students can quickly complete the positioning and processing. The UI interface is shown in Figure 3.

Figure 3 Trace ui interface

  • Link combing: With tracking data, drawing a business topology diagram is a matter of course. Figure 4 is a topology diagram of a certain service of Penguin E-sports.

Figure 4 rpc call spanning tree

  • Capacity assessment: Tianjige's link tracking system can get the call topology map of each service. The index data statistics module can get rpc index data, and the tnm2 system can get service deployment information based on this information. The pressure test system can pressure test the sexual bottleneck of each svr. Based on the above data, Tianjige can accurately assess whether the deployment of various services is reasonable. In case of large-scale events, developers only need to provide the qps increment of the entrance interface, and Tianjige can estimate the qps increment of each dependent service in the future, and give recommended expansion data.

Figure 5: Capacity assessment results

  • Performance analysis: The pressure test system of Tianjige can pressure test the performance data of a single service, combined with link tracking and index statistics, can well analyze the performance bottleneck of the entire system and find the optimization direction.
  • Other functions: In addition to the above 4 functions, Tianji Pavilion also has many practical functions such as real-time alarm, overload protection, name service, index data query, etc. Welcome everyone to experience the Tianji Pavilion system. Experience address: http://tjg.oa.com.

The overall structure of Tianji Pavilion

Tianji Pavilion includes four systems: link tracking, stress measurement system, capacity management system, and name service. The overall architecture is shown in Figure 6.

  1. Link tracking system: Link tracking is the core of Tianjige. It is responsible for collecting, storing and analyzing trace data, index data and log data called by rpc, and realizes the functions of fast fault location and link sorting, as shown in blue in Figure 6. section.
  2. Pressure measurement system: Provides automatic pressure measurement capabilities to find out the performance bottlenecks of each service, as shown in the upper left corner of Figure 6.
  3. Capacity management system: This system combines trace data, indicator data, pressure test data and tnm2 data to achieve accurate capacity evaluation functions, as shown in the pink part of Figure 6.
  4. Name Service: I won’t say more about this.

Figure 6 The overall architecture of Tianji Pavilion

Technical realization of link tracking

The principle of link tracking

Rpc call scene description

First look at a simple rpc call chain scenario (see Figure 7), a request calls downstream svr1 through the access layer cgi, then svr1 first calls the service svr2, gets the result and then calls svr3, and finally combines the results of svr2 and svr3 , Returned to the user via cgi. There have been 3 RPCs here. We use ①②③④⑤⑥ to indicate the order of RPC. How can we trace and restore?

Simply put, Tianjige uses the rpc framework to report data at the start and end of each rpc. Use Jstorm to process the reported data in real time and store it in hbase. The management terminal analyzes the hbase data for visual display to achieve the purpose of link tracking.

Figure 7: Simple rpc call scenario

Trace report data description

In the Tianjige tracking tree structure, each service interface is a node, and the connection between the nodes is the span (span). Span is the main data structure in the link tracking system. It records the relationship between the rpc master regulation point and the regulated point. Span contains data as follows:

  1. TraceID: A service request will be assigned a unique TraceID. In the case of rpc, the framework will pass the TraceID to the downstream svr through in-band data. All the spans reported by the rpc involved in this request have the same unique TraceID, which can be passed by the system TraceID associates all rpc spans to form a span set.
  2. SpanID: Span ID, each merged span has a unique SpanID under a traceId. In order to facilitate subsequent processing, the "main tone span" and "modified span" of an rpc have the same spanID, as shown in Figure 8. Spans with the same color have the same spanID.
  3. parentID: The id of the parent span. There is a hierarchical relationship between rpc calls, so span, as the storage structure of the calling relationship, also has a hierarchical relationship. As shown in Figure 8, the tracking chain is displayed in the form of a tracking tree, and the root node of the tree is The vertex of the call. Therefore, the parentId=0 of the top-level span. Take the example shown in Figure 8, the span of the cgi request svr1 is the top-level span, with spanID=1 and parentid=0. svr1 requests that the span of svr2 is a child span of the top-level span, with spanID=2 and parentid=1. And svr1 requests svr3 to be the child span of the graded span, spanID=3, parent=1. Obviously span2 and span3 are equal. In this way, the rpc collection can be restored to a call tree.
  4. Annotation: In addition to the call tree, developers are also very concerned about the time relationship and time-consuming information of rpc. To this end, we define 4 rpc log annotations:
    1. Client sends data: client send, referred to as cs
    2. The client receives the reply: client recv, abbreviated as cr
    3. Server receives data: server recv, referred to as sr
    4. The server sends back the packet: server send abbreviated as ss

The “main tone span” will include the time points of cs and cr, and the “tuned span” will report the time points of sr and ss. "Merged span" has the above 4 points in time. With these points in time, the time-consuming of almost all stages can be calculated.

5. Name : The name of the interface, which records which interface of the server this rpc call is.

6.  Result : call result, the return value of rpc, generally 0 indicates success.

7.  Caller : caller information, including caller svr name and IP.

8.  Callee : called information, including called svr name, IP, port.

9.  Other information : In the span structure, some other information that is convenient for analyzing the problem will be stored, but this information does not affect the link tracking, so it will not be detailed here.

Trace report process description

Having said that, everyone's impression of span may still be a bit vague, so we continue to take the service call of Figure 7 as an example. If we connect the application of Figure 7 to Tianji Pavilion, we will see the effect of Figure 8:

Figure 8 Span report details

A complete span contains both client data and server data, so a complete span must be reported twice. The data reported at the starting point of rpc is called "main span", and the data reported at the end of rpc is called "modulated span". "Main tone span" and "modulated span" are also called sub-spans. When the sub spans are calculated in real time by jstorm, they are merged into "merged spans" and stored in hbase. The reporting process is shown in Figure 8. In the figure, there are 3 rpc times and 6 sub-spans are reported. These 6 sub-spans are merged into 3 merged spans in the computing layer. The detailed process is as follows: ( High-energy warning ahead, this process is more complicated, students who are not interested in the reporting process can skip Over ).

The first child span: Cgi generates a client span before launching rpc, traceId=111111, spanid=1, parented=0 (no parent span). And pass these 3 IDs to svr1 through in-band data. After cgi receives the response packet from svr1, complete the span event time: cs=t1, cr=t12, and report the main tune span. See the "main tone span" reported by cgi in Figure 8.

The second child span: svr1 solves the client span from the request data, and generates a "transferred span". The three IDs of the "transferred span" are the same as the starting span, traceId=111111, spanid=1, parented=0. After Svr1 sends back the packet to cgi, it completes the time sr=t2, ss=t11, and reports the "modified span", as shown in Figure 8 for the "modified span" reported by svr1.

The third child span: svr1 generates a client span before initiating the rpc of svr2, traceId=111111, spanid=2 (each “main tone span” generates a new spanid), parented=1 (using the svr’s “passed” Adjust the "span" id as parentId). And pass these 3 IDs to svr2 through in-band data. After cgi receives the response packet from svr2, complete the span event time: cs=t3, cr=t6, and report the "main tune span". See Figure 8 for the "main tone span" reported by svr1 with spanid=2.

The fourth sub span: svr2 Refer to step 2, and report the adjusted span with spanid=2.

The fifth sub-span: svr1 refers to step 3 and reports the main key span with spanid=3.

Sixth sub span: svr3 Refer to step 4 and report the adjusted span with spanid=3.

trace restore instructions

Tianji Pavilion can find all spans with spanid=111111 from hbase. Then through spanid and praentID, the call tree can be restored. It can also calculate various time-consuming. E.g:

Cgi request total time-consuming = span1.cr-span1.cs = t12-t1

( Note: span1 refers to the "merged span" with spanid=1 )

The network time consumption of ① in the figure = span1.sr-span1.cs = t2-t1

Svr1 call svr2 time-consuming = span2.cr-span2.cs = t6-t3

Svr1 call Svr3 time-consuming = span3.cr-span3.cs = t10-t7

Time difference correction

It should be noted that the time stamp of span is accurate to milliseconds, and each machine has a certain time error. This error will cause time-consuming calculations to be inaccurate. Tianji Pavilion uses the following methods to perform time correction through the following two steps when displaying the results (see Figure 9).

  1. Ensure that the four time points of cs, sr, ss, and cr of each span are strictly sequential, that is, t1<t2<t3<t4.
  2. If t1<t2<t3<t4 does not hold, it means that the time deviation of the machine is large, and further correction is needed: t2=t1+((t4-t1)-(t3-t2))/2. t3=t4-((t4-t1)-(t3-t2))/2.

Note: The above correction is based on the assumption that the time consumption of ① in Figure 9 is equivalent to the time consumption of ②. Practice has proved that this hypothesis works well.

Figure 9 Time difference correction

Link tracking architecture

The architecture of Tianji Pavilion is similar to that of Ali Eagle Eye in 2016, and is divided into four layers: data collection, real-time computing, data storage, and offline analysis. The details are shown in Figure 10.

Figure 10: Tianjige link tracking architecture

data collection

Tianjige data collected three types of data: trace data, indicator data, and business logs. They are stored on four types of storage: hbase, habo, es, and disk, as shown in Figure 11.

Figure 11: Tianji Pavilion data collection architecture

  • Trace data: It is the span data mentioned in 4.1, which is mainly used for link tracking and restoration. Stored in hbase.
  • Index data: including the number of successes, failures, return codes, time-consuming distribution and other index data of the interface latitude, which are stored in the habo system.
  • Business log: The business log is divided into two types: cold and hot. The cold data includes all the logs and is stored on the disk. Hot data refers to the logs where the link trace is sampled and errors occur, and the hot data is stored in the es system.

These three data cooperate with each other to better complete monitoring and fault location. First, Jstorm calculates “index data” in real time, and generates alarm information when an abnormality is found. Developers can click on the alarm information and open the link tracking view to find out the point of failure. By checking the business log related to the alarm, you can roughly locate the cause of the alarm (see Figure 3).

The focus of data collection is to achieve  low intrusion and  low overhead . Only when the above two design points are met, can the business choose to access Tianjige. Here we will focus on how the "trace data" of Tianji Pavilion achieves low intrusion and low overhead.

Low intrusion : Low intrusion is relatively simple. Tianjige chose to make a fuss at the bottom of the srf framework. The value-added product department uniformly uses the srf framework, and the business upgrades to the new srf framework, and only reorganization can be painlessly connected to Tianji Pavilion.

Low overhead : The overhead here refers to "the consumption of the system being monitored in generating tracking and collecting tracking data leads to a decrease in system performance". We hope to reduce this overhead to a negligible level. The biggest cost of "trace data" collection is the creation and destruction of spans. The creation and destruction of root spans takes an average of 205 ns, while the same operation on other spans takes 172 ns. The main difference in time lies in the need to assign a globally unique ID to this trace on the root span. Reducing span generation can greatly reduce overhead. Tianji Pavilion uses the following 4 methods to ensure that important events are not missed, and minimize costs.

  • Sampling and reporting: Link tracking does not require all requests to be reported. The initial sampling rate of Tianjige is 1/1024. This simple scheme is very effective in high-throughput services. However, this sampling rate is easy to miss important events for low-throughput services. In fact, low-throughput services can accept higher sampling rates. Finally, we use a sampling expectation rate to identify the tracking of sampling per unit time. In this way, low flow and low load automatically increase the sampling rate, while under high flow and high load conditions, the sampling rate will be reduced, keeping the loss under control.

(Note: The sampling rate here requires that the multiple rpc of a request are either all sampled or not sampled, otherwise the link will not be connected. Tianjige will sample at the request entrance, and the sample identifier will be transmitted downstream through the in-band data. Ensure that the same sampling identifier is used for a request.)

  • Dyeing report: Tianjige supports the dyeing report mechanism. Currently, uin dyeing is recommended, and the company's internal numbers will be sampled and reported.
  • Error reporting: The original intention of link tracking is to help developers locate and analyze problems, so it is necessary to report errors when requesting. Two points need to be paid attention to when reporting errors.
  1. First: Prevent an avalanche. If the back-end service fails, all front-end requests will be reported, which may cause a service avalanche because the report consumes a lot of performance. For this reason, Tianji Pavilion has set an upper limit for data reporting. If more than 50 data are reported per second, they will no longer be reported.
  2. Second: Reverse generation. In order to reduce the cost, the unsampled rpc will not generate traceID and SpanID. If an error occurs in the unsampled rpc, the call relationship needs to be constructed backwards from back to front. In this case, Tianjige will help the parent span to generate an id, and pass the parent spanid to the main voice through the return package. The main tune constructs the calling relationship accordingly. It is worth mentioning that only branches with errors are reported, and branches without errors will be omitted, and the tracking tree will degenerate into a tracking line. But this degraded tracking line is enough to locate the problem.
  • Shared memory lock-free reporting: The reporting api serializes the Span that needs to be reported into binary through jce, and writes it to the local shared memory lock-free queue, and then the agent consumes the shared memory and reports it to Kafka in batches. The shared memory lock-free queue uses instant Open source high-performance lock-free shared memory queue with performance up to 800w/s.
  • Performance loss verification: After opening the Tianji Pavilion report, the measured qps dropped below 3%. The logic of the test server is to read small cmem data (about a few bytes) and return it, which is a relatively simple operation. In the case of similar cpu consumption, qps dropped by about 2.9%. The detailed data is shown in the table below. If the business logic is more complex, the consumption of this operation will be lower and the performance loss will be less obvious.

Figure 12 Performance verification table

Real-time calculation

Why choose Jstorm

The tasks of the calculation layer of Tianji Pavilion mainly include the following two points:

  1. Combine the start span and end span of the same rpc into a merged span and store it in hbase.
  2. Statistic indicator data in a minute-granular time window, and notify the alarm service if an abnormality is found. Among them, the second point has high real-time requirements, and it is suitable to use a streaming computing engine. Through the comparison of the following table, we chose Jstorm at the beginning (in fact, Flink is now better than Jstorm in many aspects, and we plan to replace it later with Flink).

Challenges of real-time computing

As a monitoring system, it needs to be real-time, consistent and deterministic.

Real-time is easier to handle, and Jstorm can naturally meet this requirement. At present, Tianjige can issue an alarm in about 3 minutes, which is 2 to 3 minutes faster than the model adjustment system.

The so-called consistency means that the data processed by jstorm is consistent with the data reported by the agent. The system is required to have self-healing capabilities. For example, after jstorm restarts, the data can be recalculated and the monitoring curve cannot fall. To ensure consistency, Tianji Pavilion has adopted the following measures:

  1. Deploy in multiple locations to achieve remote disaster recovery.
  2. Use zookeeper to ensure that there is only one master.
  3. Enable the ack mechanism to ensure that each piece of data is processed correctly once.

Note: The disadvantages of enabling ack are also obvious: jstorm's memory consumption is significantly increased, and jstorm's processing performance will also decrease. At present, the jstorm cluster machine of Tianji Pavilion is insufficient, and the ack mechanism is temporarily closed.

What is certainty? For example, jstorm found that the number of requests suddenly dropped in a certain minute. At this time, I should send an alarm or not. Why is it difficult for me to make this decision because I don’t know whether the monitored object really has a problem, or There is a problem with the stream computing part of the monitoring system itself. Is the stream computing system stuck somewhere or is there a problem with the data collection channel? Tianjige implements a completeness algorithm to complete the deterministic guarantee, which is somewhat similar to the Snapshot algorithm adopted in Apache Flink. The data reported by Tianjige has log time. The time window of Jstorm is based on the log time. If the log for the next minute has exceeded a certain number, we believe that the data for the previous minute is complete and an alarm can be issued.

Storage selection

From Figure 11, we can see that link tracking mainly stores three types of data: trace data, indicator data, and log data.

The storage of Trace data has the following requirements. After comparison, we choose to use hbase to store trace data.

  1. Large capacity: The current daily reporting volume is 1T, and it will be doubled if it is promoted by the Easygoing Tianji Pavilion.
  2. Life cycle: trace data is useful for a short time. We only store the last 10 days of data. We hope that the expired data can be eliminated automatically.
  3. High concurrency: trace data writes more and less reads, and requires storage that supports high concurrency writes.
  4. Support read and write by key: Support traceID as key to access data.

The data volume of the indicator data is even larger, reaching 100 million pieces per minute during the peak period. In addition to satisfying massive data storage, the storage of indicator data also requires streaming data writing, real-time analysis, high availability, and support for multi-dimensional filtering. We finally chose habo to store indicator data (habo is an olap system, and the bottom layer is druid ). We store the log data in the hard disk. In order to facilitate the query, we store the hot log related to the trace in es.

Technical realization of capacity assessment

The original SNG service deployment uses a large number of virtual machines, and the expansion and contraction process is heavy and takes a long time. However, the business often engages in large-scale activities and the micro-service architecture is complex. Before each large-scale activity, development resources need to be spent on architecture combing and capacity evaluation. In order to solve this pain point, Tianji Pavilion has implemented two capacity evaluation methods.

  • Entrance-based capacity assessment.
  • Capacity assessment based on business indicators.

Entrance-based capacity assessment

Tianji Pavilion can achieve accurate capacity evaluation thanks to the following key points:

  1. Link tracking: It can automatically sort out the follow-up dependencies of a certain entrance.
  2. Pressure measurement system: can obtain the capacity bottleneck of each server.
  3. Indicator statistics: actual qps and other indicator data of each interface.
  4. Tnm system: Here you can query the deployment status of each module and the CPU consumption of each machine.

With the above data, we can calculate the current capacity of each module. When developing students perform capacity evaluation, they only need to specify the request increment of the entry module, and Tianjige can combine link tracking to more accurately evaluate the subsequent request increment of each dependent module. Evaluate whether these modules need to be expanded and how much. So how to calculate the request increment of each module? The principle is shown in Figure 13.

Figure 13: Link tracking topology diagram and conduction coefficient

The above figure is a topological diagram drawn by Tianjige through link tracking. In the figure, A, B, C, D...represent a service.

The number 4w on the line represents the frequency at which the interface is adjusted. For example, service A is called 40,000 times per second. Service A calls service D 20,000 times per second, and service D is adjusted 38,000 times per second in total.

The green numbers in the figure represent the conductivity. For example, the conductivity from A to D is 0.5. Because A is requested 4w times per second, A will actively call D service 20,000 times per second. With the topology map and the conduction coefficient, it is easy to evaluate the request increment of the follow-up service based on the entry. As shown in Figure 13, suppose A's request increases by 20,000/sec. Then the follow-up service will increase the request increment described in red font, as shown in Figure 14.

Figure 14: Capacity assessment model

The capacity assessment based on the entrance is simple and accurate. Figure 15 is the practical assessment result of Tianji Pavilion.

Figure 15: Evaluation results of a certain activity capacity of Penguin E-sports

Capacity assessment based on business indicators

"Penguin E-sports" is the first business connected to Tianji Pavilion. Penguin E-sports has a key indicator PCU-the number of users watching live broadcasts at the same time. Most of the service requests of Penguin E-sports are positively correlated with pcu data. Tianjige specially developed a pcu-based capacity evaluation model for Penguin E-sports. Figure 5 is the capacity evaluation result of Tianji Pavilion based on pcu.

Pressure measurement system

The pressure measurement system of Tianjige guides the current network traffic for pressure measurement. It does not require development and construction requests. It only needs to click a start button to automatically complete the pressure measurement and generate a detailed pressure measurement report. The report includes details of the pressure test results (Figure 16), performance trend graph (Figure 17), month-on-month comparison of pressure test results (Figure 18), and some other performance indicators.

Figure 16: Details of pressure test results

Figure 17: Pressure test performance trend

Figure 18: Chain comparison of pressure test results

Real-time alert

The alarm of Tianji Pavilion has two characteristics.

  • High real-time performance: Thanks to the real-time calculation of jstorm, the alarm delay of Tianji Pavilion is 2 to 3 minutes.
  • Alarm convergence: Tianjige knows the dependencies of services, so it can achieve alarm convergence. If service D in Figure 14 fails, then three services A, B, and C will be affected and abnormal. Tianji Pavilion will only generate one alarm and describe the relationship between A, B, C and D. As shown in Figure 19.

Figure 19: Alarm convergence

Summary & plan

Legend has it that there is a machine that controls everything in time in the Tianji Pavilion, and everything runs from it. In this article, the "Tianji Pavilion" value-added product department develops a monitoring system built by students in their spare time. The goal is to keep everything under control. Developers can use "Tianji Pavilion" to gain insight into "the secret" and quickly solve problems.

At present, Tianji Pavilion has achieved good results in fault location, capacity evaluation, and link combing, which is equivalent to the level of Ali Eagle Eye around 2014, but there is still a big gap between the industry's advanced level. The revolution has not yet succeeded, and comrades still need to work hard. I hope that more fans will join the construction of Tianji Pavilion in 19 years, complete the following plans, and get a glimpse of the "chance" as soon as possible.

  1. Promotion: Combine with taf, integrate Tianjige api into the taf framework to facilitate further promotion of Tianjige.
  2. Multi-language support: support go language api.
  3. Modularization: The collection, real-time calculation, persistence, alarm and other processes are made into building blocks. The business can configure the required modules as needed and customize the data processing process.
  4. Platformization: Allow businesses to develop their own view plug-ins on the Tianjige platform.
  5. Full-link stress test: According to the topological diagram of the business, the full-link stress test is realized.
  6. Correlation identification: Correlate trace operation and maintenance events (version release, configuration changes, network failures, etc.) to perform primary cause analysis.
  7. Open source collaboration.

  Scan the QR code to follow the official account : " White Code Xiaoxifeng ", enter " Interview Questions" to get 1500 real BAT interview questions !

Guess you like

Origin blog.csdn.net/qq_17010193/article/details/108619954