CAPES与AI特征浅谈

一、系统构成

CAPES Architecture
图1

系统分为两个模块，CAPES调节系统与目标系统。

在CAPES调节系统端，分为三个部分，控制节点、Replay DB和控制模型。控制节点可分为Interface Daemon、DRL Engine以及Action checker。
在目标系统端，使用两个agent: monitor agent与control agent。

系统运行流程如下：

Monitor agent周期性将目标系统结点的性能参数和reward传递给CAPES调节系统的Interface Daemon。
Interface Daemon将接收到的性能参数信息写入Replay Database。
DRL Engine从Replay DB中读取性能参数，进行训练。
DRL Daemon以固定间隔通过Interface Daemon向目标系统的Control Agent发送Action，同时这些Action也会被存到Replay DB中进行Experience Replay。
在将Action发送给目标系统前，会先调用Action checker进行预检查，剔除掉一些程序性的不现实操作，如将CPU的时钟频率降为0。
最终，Control Agent在收到action信息后，对目标节点进行相应的调整。

几个分析

为什么要使用Replay DB?深度强化学习在使用非线性函数进行近似运算时存在不稳定问题。为防止过拟合，并减缓DRL不稳定状态，需要降低目标Q-network的学习率。将每一步系统状态、操作和reward的变化存入数据库，稍后重演，可以有效打破传统系统使用连续系统状态所导致的时间耦合性。

为什么要使用Action Checker?程序通过学习可能会产生一些过分糟糕的结果，这些结果在实际运行过程中其实是不可行的，因此需要提前对这种情况进行剔除。

二、AI特征分析

Both raw and secondary system statuses, derived from raw system status, can be included.
Samples of raw system status include number of CPUs, CPU utilization, free memory, separate read/write I/O rate of each storage device, and buffer size.
Samples of secondary system status could be the total number of active threads, inbound/outbound buffer size, congestion window size, packet sizes

现有调参策略的局限性

nonlinear: hard to predict(environment, noise), when systems are pushed to the limit, the efficiency of many components can drop rapidly.

delay between an action and the resulting change: hard to debug, delays could vary in length

parameter space is huge: much effort, time consuming

requires domain knowledge

workload-responsive, workloads are not stable

difficult to trace back

No high-precision model; multiple objectives; distributed system(scalable)

NEXP-hard problem, need access to information of the entire history of observations -->approximation

三、系统性能衡量

energy usage, operations per second, data transfer throughput, latency etc.
Benchmark(待后续详细了解补充)

四、测试结果与探讨

测试信息
测试系统
分布式文件系统（原因：可以将每个节点的I/O请求分不到并行服务器）
测试参数
max_rpc_in_flight: Lustre congestion window size.
I/O rate limit: how many outgoing I/O requests are allowed per second.
硬件配置
four dedicated servers and five dedicated clients.
hardware: an Intel Xeon [email protected],16GBRAM,and one Intel 330SSD for the OS.
The network is gigabit ethernet with measured peak aggregated throughput of∼500MB/s.
Each storage server node uses one 7200RPMHGST Travelstar Z7K500 harddrive, of which raw I/O performance is measured at 113MB/s for sequential read and 106MB/s for sequential write.
No workload is memory intensive.

测试结果：

在这里插入图片描述
图2

在这里插入图片描述
图3

使用了三种工作负载进行测试：随机读写、Filebench file server、顺序读写。
CAPES在随机读写工作负载上，对于read-heavy的情景并没有明显优化，甚至会降低吞吐量；而对于write-heavy的情景有较大优化，对于read:write = 1:9的负载，12小时训练后其吞吐量优化能达45%。
在Filebench file server与顺序读写工作负载上，12小时训练几乎没有优化，而24小时训练后优化比约为17%。

在这里插入图片描述
图4

对于过拟合问题，选取三个节，节与节之间进行大量不相干的文件操作，每一节持续四个小时，其中两个小时使用默认操作，两个小时使用CAPES优化策略，进行持续两周的训练。经测试发现，并没有产生过拟合现象。

五、总结与思考

集群参数调优
调优算法
论文思路架构

下一步摸索：

研究CPU负载公式，寻找手动调节参数
其他调优算法摸索
环境噪声与工作负载特征
了解Benchmark

reference:

Yan Li, Kenneth Chang, Oceane Bel, Ethan L.Miller, and Darrell D.E.Long. 2017. CAPES: Unsupervised Storage Performance Tuning Using Neural Network-Based Deep Reinforcement Learning. In Proceedings of SC17, Denver, CO, USA, November 12–17, 2017,14pages. DOI: 10.1145/3126908.3126951

CAPES与AI特征浅谈

CAPES与AI特征浅谈

一、系统构成

二、AI特征分析

三、系统性能衡量

四、测试结果与探讨

测试结果：

五、总结与思考

猜你喜欢