AIOps landing prerequisite exploration

Micro-letter picture _20190701165344.jpg

Based on the author's own understanding of technology and industry, to explore the preconditions AIOps analysis in business landing.

Involving Keywords: automated operation and maintenance, AIOps, technical operations PaaS, blue whales.

Author: Zhang Min


AIOps concept

Gartner in 2016 when it put forward the concept of AIOps, AIOps that is combined with artificial intelligence and operation and maintenance, and predicted that in 2020, AIOps adoption rate will reach 50%.


In simple terms, AIOps is to further automate the operation and maintenance can not solve the problem based on the way the operation and maintenance of existing data (logs, monitoring information, application information, etc.) and by machine learning.


Some "algorithm logic" software does not represent a real AIOps, determine whether it is really the key point is that AIOps: whether the data can be automatically summed up the law from learning and using the law to give the current environment policy recommendations.    

Gartner defines a conceptual diagram AIOpsGartner defines a conceptual diagram AIOps


AIOps concepts:

  1. Intelligent operation and maintenance is a big data platform and machine learning (algorithm platform) as the core.

  2. Intelligent operation and maintenance needs and monitoring, help desk, automated linkage system, intelligent operation and maintenance need to extract data from various monitoring systems, to provide user-oriented services, and has generated the implementation of intelligent automation system operation and maintenance decision model.


AIOps applications:

Through the calculation and analysis of data to support the operation and maintenance of intelligent monitoring and intelligent fault analysis and processing, intelligent maps and other IT knowledge.


AIOps value:

Traditional transport operation and maintenance of data-dimensional face of a flood, to quickly stop and decision-making, human expert analysis and judgment often take hours or greater.

The AIOps is to carry out excavation operation and maintenance data via machine learning, to help people instead of people even more effective and rapid decision-making.


Intelligent operation and maintenance of ground in the enterprise, can improve SLA service system, improve the user experience, reduce troubleshooting time, etc., to bring value to the business; and, ultimately, in the true sense of unattended operation and maintenance.


AIOps Applications

Currently the major traditional customers explore and build AIOps around the main content is as follows:


We found the problem: anomaly detection based on machine learning;

例如,目前监控数据的异常阈值往往是静态的,无法有效规避变更时间、特殊节假日、业务正常的高低峰等,简单阈值、同环比算法的覆盖面有限,很容易漏警和误警。

基于历史数据或进行样本标记的KPI异常检测,能第一时间发现问题,检测模型能覆盖大多数曲线类型,能较好适应业务生命周期中的变化。


根因分析:基于机器学习的故障树挖掘,定位故障发生的根源以及其原因;

例如,首先实现故障精准定位,在多指标情况下的业务异常(多指标检测的异常),出现异常的原因具体是哪个指标导致的;然后根据故障树挖掘和知识图谱,实现故障的精准根因分析与定位。


预测未来:基于机器学习模型的指标预测;

例如,基于多种回归和统计方法,实现对不同级别粒度的业务数据的预测,包括业务指标预测、容量预测等,如双11业务对组件容量和资源容量的容量预测等。


IT辅助决策支持:深入运营场景,实现业务运营的IT辅助决策应用;

如营收预测、舆情分析与预测等场景。


算法层面则可以跟学术界进行合作或在社区中获取,在早期训练数据集和反馈数据量比较少的情况下,采用无监督学习,具体实现是用模式识别(pattern recognition)的技术来判断指标是否关联。关联性是通过时间序列曲线相似度(similarity distance)来衡量的。


机器学习算法库提供计算时间序列曲线相似度的各种算法,比如:欧几里德距离(Euclidean Distance)、曼哈顿距离(Manhattan Distance)、明科斯基距离(Minkowski Distance)等。


在有足够数据集以后,算法演化成有:监督学习、随机森林(Random Forrest)、GBDT(Gradient Boosted Decision Tree) 、神经网络(Neutal Network)等。


AIOps对基础设施的要求

AIOps从技术层面来讲,需要数据、算法模型两个最为核心的要素,数据的支撑需要一套整体的运维大数据体系,而算法模型的支撑则需要一套整体的挖掘框架体系,以及执行决策的自动化系统。


运维大数据:

需要有集成多类数据源、一站式低门槛的数据开发、统一的多样化数据存储和查询等功能。


数据挖掘:

全流程、可视化数据建模,支持多种机器学习框架、交互式建模IDE、可视化样本标记等功能。


自动化系统:

需要集成企业CMDB、作业执行、编排引擎、自定义场景等功能。


更为核心的是这些功能模块之间应该有效交互,不能仅仅是独立的各个模块,需要有一套平台架构来去支撑各个个性化的场景,尤其是打破数据烟囱、功能烟囱,这样才能实现有效的智能运维生命周期落地:

数据采集数据建模机器学习挖掘自动化执行反馈


而腾讯蓝鲸,腾讯IEG自用的一套用于构建企业研发运营一体化体系的PaaS开发框架,则通过解耦原子能力与场景,能完全支撑AIOps的生命周期落地。

Floor support AIOps PaaS blue whale蓝鲸PaaS支撑AIOps落地


PaaS能力模块层:

1、管控模块负责通过Agent、通用协议和API接口等方式将公有云、私有云或者混合云中的服务器、存储、网络、虚拟化平台、数据库、中间件、基础应用、业务应用、云管平台、容器等企业所有需要统一运维的IT资源进行纳管;有统一的管道进行接入数据、有统一的管道执行命令。


2、平台层中的每个原子平台都是一个或者多个相关功能的集中实现:

配置模块(CMDB):

企业所有IT对象配置信息的集中存储和消费中心。


作业模块:

针对IT对象进行脚本执行和文件分发层面的自动化编排的作业中心。


编排模块:

跨系统编排及调度引擎,实现覆盖全生命周期场景的运维工作。


数据接入、开发与存储:

运维大数据平台,针对运维和运营数据进行大数据接入、清洗、存储、实时和离线计算、展示以及数据消费的中心,是实现数据运维和辅助运营的关键。


AI挖掘:

通全流程、可视化数据建模,支持多种机器学习框架、交互式建模IDE、可视化样本标记,并支持自己写入算法。


PaaS架构层

iPaaS层: 

API GateWay(统一接入模块),将配置管理(CMDB)平台、作业平台、数据平台、挖掘平台等原子平台统一接入、集成、驱动和调度,供上层运维场景SaaS驱动和调用。


aPaaS开发者中心:

开发者中心提供完整的前后端开发框架,当企业在未来出现新的运维需求的时候,企业可以快速利用开发者中心完成相应的运维系统开发,并一键部署。


运维场景应用层

平台所有的运维场景的实现运行在这个层次,包含配置管理与消费、IT监控与故障自愈、运维自动化、运维流程管理、数据分析和智能运维场景。


Blue Whale atom platform Tencent data platform architecture腾讯蓝鲸原子平台数据平台架构


Tencent whale atoms exemplary internet mining module腾讯蓝鲸原子平台挖掘模块示例


AIOps落地前提条件探索

从整体上来讲,AIOps的引入和使用需要具备一定的条件,但并不需要企业把所有东西准备好才能动工


例如很多企业觉得应该准备好数据完整性和人才才能开始应用AIOps,但是,数据的完整性取决于探索之后才知道怎样的数据才是完整的;AIOps人才更为关键的在于了解智能运维场景;算法也只有根据实际情况不断调优才能有更好的应用效果。


只要有痛点,和通过智能运维带来价值,AIOps就可以先引入,并逐步带动企业智能化运维的发展。


总结来讲,AIOps落地的前提条件应该分为三个方面:

445c517f45843c0a2d78a315c2456ec8


但是三个条件都不是指必须完全准备好才能开始实践:


基础设施平台:

可以从自动化能力,以及数据一体化能力进行起步建设,而不是一开始就建设一套于运维人员简单易用的模型设计框架;


算法:

There are already a lot of operation and maintenance of the common areas of the algorithm can be used to provide with academia, community and Tencent these algorithms have practical experience of landing parties to cooperate after the introduction of the algorithm requires constant debugging optimized in order to have a more accurate percentage; enterprises people can self-built algorithm, but the algorithm itself belongs to the field of science, in terms of business, can be considered from a comprehensive cost-effective;


People in your organization:

Prepare personnel organization, more attention should be comprehensive operation and maintenance personnel across technical fields, they know better operation and maintenance scenarios, and intelligent operation and maintenance which can actually solve problems, not to intelligence and intelligence.


Above, is the author of several reference and research information, as well as some personal feelings and understanding combined with business experience are welcome to discuss the message.


Guess you like

Origin blog.51cto.com/11811406/2421155