时间序列趋势检验方法 Mann-Kendall Test 及其 Python 实现

背景

一般的统计方法难以对趋势性指标进行异常检测，因此需要采用一种方法支持缓慢变化的异常进行识别，如下图所示。

本文采用 Mann-Kendall Test 趋势检验方法 来检验序列趋势，上升或者下降。

其实根据最小二乘法拟合曲线求斜率也可以，但是上升趋势不一定是线性的。

Mann-Kendall Test

理论

Mann-Kendall Test (Mann 1945, Kendall 1975, Gilbert 1987) 的目的是统计评估变量随着时间变化，是否有单调上升或下降的趋势。单调上升（下降）的趋势意味着该变量随时间增加（减少），但此趋势可能是/不是线性的。

Mann-Kendall Test 可替代参数线性回归分析——线性回归可检验线性拟合直线的斜率是否不为零。

回归分析要求拟合回归线的残差是正态分布的，Mann-Kendall Test 不需要这种假设，Mann-Kendall Test 是非参数检验（不要求服从任何分布）。

Mann-Kendall Test 的基础假设：

当没有趋势时，随时间获得的数据是独立同分布的。独立的假设是说数据随着时间不是连续相关的。
所获得的时间序列上的数据代表了采样时的真是条件。（样本具有代表性）
样本的采集、处理和测量方法提供了总体样本中的无偏且具有代表性的观测值。

Mann-Kendall Test 不要求数据是正态分布，也不要求变化趋势是线性的。如果有缺失值或者值低于一个或多个检测限制，是可以计算 Mann-Kendall Test 的，但检测性能会受到不利影响。独立性假设要求样本之间的时间足够大，这样在不同时间收集的测量值之间不存在相关性。

Mann-Kendall Test 是检验是否拒绝零假设(null hypothesis: $H_0$ )，并接受替代假设（alternative hypothesis: $H_a$ ）：

$H_0$ ：没有单调趋势
$H_a$ ：存在单调趋势

初始假设是： $H_0$ 为真，在拒绝 $H_0$ 并接受 $H_a$ 时，数据的统计值需要达到一定的置信度。

流程

包含 $n$ 样本点的时间序列 $x_1,x_2,…,x_n$ ；
确定所有 $n(n-1)/2$ 个 $x_j-x_k$ 差值的符号，其中 $j>k$ ；
令 $sgn(x_j-x_k)$ 作为指示函数，依据 $x_j-x_k$ 的正负号取值为 1， 0或-1；
计算
$S=\sum_{k-1}^{n-1} \sum_{j-k+1}^{n} sgn\left(x_{j}-x_{k}\right)$
即差值为正的数量减去差值为负的数量。
- 如果 $S$ 是一个正数，那么后一部分的观测值相比之前的观测值会趋向于变大；
- 如果 $S$ 是一个负数，那么后一部分的观测值相比之前的观测值会趋向于变小。
分两种情况
- 如果 $n\leq10$ ，需要在概率表 (Gilbert 1987, Table A18, page 272) 中查找 $S$ 。
  - 如果此概率小于 $\alpha$ (认为没有趋势时的截止概率)，那就拒绝零假设，认为趋势存在。
  - 如果在概率表中找不到 $S$ 「存在结数据（tied data values）情况」，就用表中的下一个值。
    比如 $S=12$ ，如果概率表中没有 $S=12$ ，那么就用 $S=13$ 来处理也是一样的。
- 如果 $n > 10$ ，则依以下步骤 6-10 来判断有无趋势；
计算 $S$ 的方差如下：
$\operatorname{VAR}(S)=\frac{1}{18}\left[n(n-1)(2 n+5)-\sum_{p-1}^{g} t_{p}\left(t_{p}-1\right)\left(2 t_{p}+5\right)\right]$
其中 $g$ 是结组（tied groups）的数量， $t_p$ 是第 $p$ 组的观测值的数量。
例如：在观测值的时间序列 $\{23, 24, 29, 6, 29, 24, 24, 29, 23\}$ 中有 $g = 3$ 个重复值（结值），相应地对于结值 (tied value) 23 有 $t1=2$ 、结值 24 有 $t2=3$ 、结值 29 有 $t3=3$ 。当因为有相等值或未检测到而出现结时， $VAR(S)$ 可以通过*Helsel (2005, p. 191)*中的结修正方法来调整。
计算 Mann-Kendall Test 统计量 $Z_{MK}$
$Z_{M K}=\left\{\begin{array}{cl} \frac{S-1}{\sqrt{V A R(S)}}, & S>0 \\ 0 & , \quad S=0 \\ \frac{S+1}{\sqrt{V A R(S)}}, & S<0 \end{array}\right.$
需要测试零假设：
- $H_0$ (没有单调趋势)对比替代假设
- $H_a$ (有单调增趋势)
其1型容忍概率为 $\alpha$ ， $0<\alpha<0.50$ 。
如果 $Z_{1-\alpha} \leq Z_{MK}$ ，则拒绝零假设 $H0$ ，接受替代假设 $H_a$ ，其中 $Z_{1−\alpha}$ 是标准正态分布的 $100(1−\alpha)^{th}$ 百分位。
如果 $Z_{MK}\leq -Z_{1−\alpha}$ ，就拒绝零假设 $H_0$ ，接受替代假设 $H_a$ -> 有单调递减趋势；
如果 $|Z_{MK}|≥Z_{1−\frac{\alpha}{2}}$ ，就拒绝零假设 $H0$ ，接受替代假设 $H_a$ ，其中竖线代表绝对值 -> 有单调递增或者递减趋势；

Python 实现

import math
from scipy.stats import norm, mstats
def mk_test(x, alpha=0.05):
    """
    This function is derived from code originally posted by Sat Kumar Tomer
    (satkumartomer@gmail.com)
    See also: http://vsp.pnnl.gov/help/Vsample/Design_Trend_Mann_Kendall.htm
    The purpose of the Mann-Kendall (MK) test (Mann 1945, Kendall 1975, Gilbert
    1987) is to statistically assess if there is a monotonic upward or downward
    trend of the variable of interest over time. A monotonic upward (downward)
    trend means that the variable consistently increases (decreases) through
    time, but the trend may or may not be linear. The MK test can be used in
    place of a parametric linear regression analysis, which can be used to test
    if the slope of the estimated linear regression line is different from
    zero. The regression analysis requires that the residuals from the fitted
    regression line be normally distributed; an assumption not required by the
    MK test, that is, the MK test is a non-parametric (distribution-free) test.
    Hirsch, Slack and Smith (1982, page 107) indicate that the MK test is best
    viewed as an exploratory analysis and is most appropriately used to
    identify stations where changes are significant or of large magnitude and
    to quantify these findings.
    Input:
        x:   a vector of data
        alpha: significance level (0.05 default)
    Output:
        trend: tells the trend (increasing, decreasing or no trend)
        h: True (if trend is present) or False (if trend is absence)
        p: p value of the significance test
        z: normalized test statistics
    Examples
    --------
      >>> x = np.random.rand(100)
      >>> trend,h,p,z = mk_test(x,0.05)
    """
    n = len(x)

    # calculate S
    s = 0
    for k in range(n-1):
        for j in range(k+1, n):
            s += np.sign(x[j] - x[k])

    # calculate the unique data
    unique_x, tp = np.unique(x, return_counts=True)
    g = len(unique_x)

    # calculate the var(s)
    if n == g:  # there is no tie
        var_s = (n*(n-1)*(2*n+5))/18
    else:  # there are some ties in data
        var_s = (n*(n-1)*(2*n+5) - np.sum(tp*(tp-1)*(2*tp+5)))/18

    if s > 0:
        z = (s - 1)/np.sqrt(var_s)
    elif s < 0:
        z = (s + 1)/np.sqrt(var_s)
    else: # s == 0:
        z = 0

    # calculate the p_value
    p = 2*(1-norm.cdf(abs(z)))  # two tail test
    h = abs(z) > norm.ppf(1-alpha/2)

    if (z < 0) and h:
        trend = 'decreasing'
    elif (z > 0) and h:
        trend = 'increasing'
    else:
        trend = 'no trend'

    return trend