Detailed explanation of CCE service: one-stop alarm configuration and cloud native log view

This article is shared from the Huawei Cloud Community " CCE Service Logs and Alarms of a New Generation of Cloud Native Observable Platform " by: Cloud Container Future.

Alarms and logs are the main means for operation and maintenance personnel to quickly locate problems and recover from abnormalities. The daily working mode of operation and maintenance personnel is often to first receive alarm information, then initially judge the scope and impact of the anomaly based on the alarm information, locate the cause of the failure through the logs of relevant components, and perform system recovery. Therefore, how to provide operation and maintenance personnel with a simple and easy-to-use alarm and log management platform is a matter of great concern to various cloud native platforms.

Compared with traditional systems, the number of applications in cloud-native scenarios is huge, and operation and maintenance data such as monitoring indicators, events, and logs are even more massive. At the same time, alarm configuration needs to connect multiple systems. For example, the configuration of alarm notifier involves the message notification system, the indicator threshold alarm rules involve the monitoring system, and the log keyword alarm involves the log management system. This makes the configuration of alarms in cloud-native scenarios quite complex, involving jumps to different systems, and there are breakpoints in the process.

Similarly, log files in cloud native scenarios are complex. Logs include container standard output logs, in-container logs, node logs, etc.; and the logs may be distributed on different hosts and their locations are not fixed, making it difficult to find logs. Therefore, how to help operation and maintenance personnel quickly and accurately find the complete log link to the failure time point and present it clearly is a key challenge facing the log service.

cke_134.png

Figure 1  Challenges in logs and alerts

In response to the above alarm and log issues in cloud native scenarios, Huawei Cloud CCE service has launched alarm center and log center functions to achieve " one-stop alarm configuration " and " cloud native log view ".

One-stop alarm configuration

In order to allow users to complete the basic alarm configuration of the system in a very short time, CCE service and AOM service launch a cloud-native exclusive alarm template, which can configure the alarm rules of the cloud-native system with one click. This alarm template is based on the summary and refinement of Huawei Cloud's daily operation and maintenance experience. It covers cluster failure events and common failure scenarios in many aspects such as cluster, node, and load resource monitoring thresholds. Users only need to open the alarm center in CCE and bind the email or mobile phone of the person who notified the fault.

cke_135.png

Figure 2  One-click activation

In addition, the alarm center also has the capabilities of alarm notification group configuration, alarm rule configuration, alarm viewing and traceback, etc., allowing operation and maintenance personnel to complete the alarm configuration and processing process in one stop, completing the closed loop.

The alarm center provides alarm notification group capabilities based on Huawei Cloud SMN service. By configuring an alarm notification group, when a fault occurs, the corresponding operation and maintenance personnel can be notified in a timely manner according to the type and level of the problem triggering system to intervene.

cke_136.png

Figure 3  Configure alarm notification group

Alarm rules can be issued with one click through alarm templates, covering common indicator alarms and event alarms in clusters. Of course, users can also freely select and configure these alarm rules.

cke_137.png

Figure 4  Configuring alarm rules

When an alarm occurs, the alarm notifier will receive the alarm notification in time and can view and eliminate the alarm through the visual interface provided by the alarm center. In order to facilitate users to trace back the faults that have occurred, the alarm center also supports viewing historical alarms that have been eliminated.

cke_138.png

Figure 5 Alarm list

Cloud native log view

In order to adapt to the characteristics of cloud-native business and facilitate operation and maintenance personnel to quickly query logs and accurately locate faults, Huawei Cloud CCE service launches the log center function and provides a dedicated page layout from a cloud-native perspective.

cke_139.png

Figure 6  Log Center

The log center supports filtering based on K8s resource objects, such as workloads, Pods, etc. It also supports classified display of K8s management logs, audit logs, business logs, etc. The overall page is more concise, and key information such as the main content of the log and associated K8s resources is more prominent, allowing operation and maintenance personnel to focus on fault point logs and eliminate interference.

cke_140.png

Figure 7  Multi-dimensional filtering

The log center also provides configuration management capabilities for log collection strategies and supports free configuration of collected K8s resource objects. In addition, in order to further lower the threshold for using logs, the Log Center provides collection configuration templates for control plane logs, audit logs, and container standard output logs, which can be turned on or off with one click.

cke_141.png

Figure 8  Collection template

In this issue, we give you a brief introduction to the capabilities of the alarm center and log center. We very much look forward to these capabilities effectively improving your operation and maintenance experience. We will continue to optimize. Looking forward to your use and valuable suggestions for improvement.

For service experience please visit

Related Links

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

OpenAI opens ChatGPT Voice Vite 5 for free to all users. It is officially released . Operator's magic operation: disconnecting the network in the background, deactivating broadband accounts, forcing users to change optical modems. Microsoft open source Terminal Chat programmers tampered with ETC balances and embezzled more than 2.6 million yuan a year. Used by the father of Redis Pure C language code implements the Telegram Bot framework. If you are an open source project maintainer, how far can you endure this kind of reply? Microsoft Copilot Web AI will be officially launched on December 1, supporting Chinese OpenAI. Former CEO and President Sam Altman & Greg Brockman joined Microsoft. Broadcom announced the successful acquisition of VMware.
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/10151116