ϵ-Diagnosis：大规模微服务系统中无监督的小窗口长尾延迟诊断方法

论文标题｜ ϵ-Diagnosis: Unsupervised and Real-time Diagnosis of Smallwindow Long-tail Latency in Large-scale Microservice Platforms
论文来源｜ WWW 2019
论文链接｜ https://monadyn.github.io/Papers/p3215-shan.pdf
源码链接｜未公布

TL;DR

对于大规模网络应用中service-level objective（SLO）诊断比较困难，文中提出了 small-window long-tail latency (SWLT) 的问题。为了定位产生SWLT问题的根因，文中提出了一个诊断算法 $\epsilon -Diagnosis$ 。

Contributions

We identify a new type of tail latency problem, small-window long-tail latency (e.g., in an 1-minute or 1-second period), which has a heavy-tail and high-variance characterization.
We propose an unsupervised and low-cost root-cause analysis algorithm–ϵ-Diagnosis, using two-sample test algorithm and ϵ-statistics for measuring the similarity of time series, to identify root-causes of SWLT from millions of metrics for on-line web services at runtime.

Algorithm/Model

文章的系统框架如下所示：

看一下实际数据中每个container中的数据的变化情况：

文章提出方法的假设是：当发生异常时，存在问题的container中的root-cause metrics 变化幅度比较大。因此可以根据metrics变化的幅度来判断root-cause metrics 和 container。文中使用的方法就是 $\epsilon$ -statistics test （energy distance correlation）algorithms。

文中算法主要分为以下几步：

Detecting SWLT：相当于检测异常。
文中用的是当response time 大于某个限定的阈值时，则触发报警系统，进行数据分析。
Selecting two samples from the snapshot：
统计发生异常前后metrics的值，例如：throughput, QPS, concurrent loads, response time, number of error log, number of log, number of database connections etc. 用一个时间序列向量表示：

$S_{(t)}=\left[x_{1}, x_{2}, \ldots, x_{n}\right]$

Two-sample null hypothesis test.

$\left\{\begin{array}{l}{P<\alpha, S_{A} \neq S_{N}} \\ {P \geqslant \alpha, S_{A}=S_{N}}\end{array}\right.$

ϵ-Statistics (Energy distance correlation)

$\rho^{2}\left(S_{A}, S_{N}\right)=\left\{\begin{array}{ll}{\frac{\operatorname{cov}^{2}\left(S_{A}, S_{N}\right)}{\sqrt{\sigma^{2}\left(S_{A}\right) \sigma^{2}\left(S_{N}\right)}},} & {\sigma^{2}\left(S_{A}\right) \sigma^{2}\left(S_{N}\right)>0} \\ {0,} & {\sigma^{2}\left(S_{A}\right) \sigma^{2}\left(S_{N}\right)=0}\end{array}\right.$