论文标题 | ϵ-Diagnosis: Unsupervised and Real-time Diagnosis of Smallwindow Long-tail Latency in Large-scale Microservice Platforms
论文来源 | WWW 2019
论文链接 | https://monadyn.github.io/Papers/p3215-shan.pdf
源码链接 | 未公布

TL;DR

对于大规模网络应用中service-level objective(SLO)诊断比较困难,文中提出了 small-window long-tail latency (SWLT) 的问题。为了定位产生SWLT问题的根因,文中提出了一个诊断算法ϵDiagnosis\epsilon -Diagnosis

Contributions

  • We identify a new type of tail latency problem, small-window long-tail latency (e.g., in an 1-minute or 1-second period), which has a heavy-tail and high-variance characterization.
  • We propose an unsupervised and low-cost root-cause analysis algorithm–ϵ-Diagnosis, using two-sample test algorithm and ϵ-statistics for measuring the similarity of time series, to identify root-causes of SWLT from millions of metrics for on-line web services at runtime.

Algorithm/Model

文章的系统框架如下所示:

看一下实际数据中每个container中的数据的变化情况:

文章提出方法的假设是:当发生异常时,存在问题的container中的root-cause metrics 变化幅度比较大。因此可以根据metrics变化的幅度来判断root-cause metrics 和 container。文中使用的方法就是ϵ\epsilon -statistics test (energy distance correlation)algorithms。

文中算法主要分为以下几步:

  • Detecting SWLT:相当于检测异常。
    文中用的是当response time 大于某个限定的阈值时,则触发报警系统,进行数据分析。
  • Selecting two samples from the snapshot:
    统计发生异常前后metrics的值,例如:throughput, QPS, concurrent loads, response time, number of error log, number of log, number of database connections etc. 用一个时间序列向量表示:

S(t)=[x1,x2,,xn]S_{(t)}=\left[x_{1}, x_{2}, \ldots, x_{n}\right]

  • Two-sample null hypothesis test.

{P<α,SASNPα,SA=SN\left\{\begin{array}{l}{P<\alpha, S_{A} \neq S_{N}} \\ {P \geqslant \alpha, S_{A}=S_{N}}\end{array}\right.

  • ϵ-Statistics (Energy distance correlation)

ρ2(SA,SN)={cov2(SA,SN)σ2(SA)σ2(SN),σ2(SA)σ2(SN)>00,σ2(SA)σ2(SN)=0\rho^{2}\left(S_{A}, S_{N}\right)=\left\{\begin{array}{ll}{\frac{\operatorname{cov}^{2}\left(S_{A}, S_{N}\right)}{\sqrt{\sigma^{2}\left(S_{A}\right) \sigma^{2}\left(S_{N}\right)}},} & {\sigma^{2}\left(S_{A}\right) \sigma^{2}\left(S_{N}\right)>0} \\ {0,} & {\sigma^{2}\left(S_{A}\right) \sigma^{2}\left(S_{N}\right)=0}\end{array}\right.

整体的算法流程如下所示:

Experiment Detail

Thoughts

文章中提出的方法简单易用,但是依赖与系统的多个metrics。我们的数据中只存在几个关键的metric,不能很好的度量。

联系作者