KL 散度定义

相对熵,又称KL散度,如果我们对于同一个随机变量xx 有两个单独的概率分布p(x)p(x)q(x)q(x),可以使用 KL 散度(Kullback-Leibler (KL) divergence)来衡量这两个分布的差异,如果两个分布越接近,那么KL散度越小,如果越远,KL散度就会越大。

在机器学习中,pp 通常用来表示样本的真实分布,qq用来表示模型所预测的分布,那么KL散度就可以计算两个分布的差异,也就是Loss损失值。计算公式如下:

  • 离散概率分布:

KL(pq)=p(x)logp(x)q(x)K L(p \| q)=\sum p(x) \log \frac{p(x)}{q(x)}

  • 连续概率分布:

KL(pq)=p(x)logp(x)q(x)dxK L(p \| q)=\int p(x) \log \frac{p(x)}{q(x)} d x

1维高斯分布

假设我们有两个随机变量x1x_1,x2x_2,各自服从一个高斯分布N1(μ1,σ12),N2(μ2,σ22)N_{1}\left(\mu_{1}, \sigma_{1}^{2}\right), N_{2}\left(\mu_{2}, \sigma_{2}^{2}\right),那么这两个分布的KL散度该怎么计算呢?

已知:

N(μ,σ)=12πσ2e(xμ)22σ2N(\mu, \sigma)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{\frac{(x-\mu)^{2}}{2 \sigma^{2}}}

可得KL(p1,p2)KL(p_1, p_2) 等于:

p1(x)logp1(x)p2(x)dx=p1(x)(logp1(x)dxlogp2(x))dx=p1(x)×(log12πσ12e(xμ1)22σ12log12πσ22e(xμ2)22σ22)dx=p1(x)×(12log2πlogσ1(xμ1)22σ12+12log2π+logσ2+(xμ2)22σ22)dx=p1(x)(logσ2σ1+[(xμ2)22σ22(xμ1)22σ12])dx=(logσ2σ1)p1(x)dx+((xμ2)22σ22)p1(x)dx((xμ1)22σ12)p1(x)dx=logσ2σ1+12σ22((xμ2)2)p1(x)dx12σ12((xμ1)2)p1(x)dx\begin{array}{l} \int p_{1}(x) \log \frac{p_{1}(x)}{p_{2}(x)} d x \\\\ =\int p_{1}(x)\left(\log p_{1}(x) d x-\log p_{2}(x)\right) d x \\\\ =\int p_{1}(x) \times \left(\log \frac{1}{\sqrt{2 \pi \sigma_{1}^{2}}} e^{\frac{\left(x-\mu_{1}\right)^{2}}{2 \sigma_{1}^{2}}}-\log \frac{1}{\sqrt{2 \pi \sigma_{2}^{2}}} e^{\frac{\left(x-\mu_{2}\right)^{2}}{2 \sigma_{2}^{2}}}\right) d x \\\\ =\int p_{1}(x) \times \left(-\frac{1}{2} \log 2 \pi-\log \sigma_{1}-\frac{\left(x-\mu_{1}\right)^{2}}{2 \sigma_{1}^{2}}+\frac{1}{2} \log 2 \pi+\log \sigma_{2}+\frac{\left(x-\mu_{2}\right)^{2}}{2 \sigma_{2}^{2}}\right) d x \\\\ =\int p_{1}(x)\left(\log \frac{\sigma_{2}}{\sigma_{1}}+\left[\frac{\left(x-\mu_{2}\right)^{2}}{2 \sigma_{2}^{2}}-\frac{\left(x-\mu_{1}\right)^{2}}{2 \sigma_{1}^{2}}\right]\right) d x \\\\ =\int\left(\log \frac{\sigma_{2}}{\sigma_{1}}\right) p_{1}(x) d x+\int\left(\frac{\left(x-\mu_{2}\right)^{2}}{2 \sigma_{2}^{2}}\right) p_{1}(x) d x-\int\left(\frac{\left(x-\mu_{1}\right)^{2}}{2 \sigma_{1}^{2}}\right) p_{1}(x) d x \\\\ =\log \frac{\sigma_{2}}{\sigma_{1}}+\frac{1}{2 \sigma_{2}^{2}} \int\left(\left(x-\mu_{2}\right)^{2}\right) p_{1}(x) d x-\frac{1}{2 \sigma_{1}^{2}} \int\left(\left(x-\mu_{1}\right)^{2}\right) p_{1}(x) d x \end{array}

最右一项为方差计算,值约分为:12-\frac{1}{2},可知上式得:

=logσ2σ1+12σ22((xμ2)2)p1(x)dx12=logσ2σ1+12σ22((xμ1+μ1μ2)2)p1(x)dx12=logσ2σ1+12σ22[(xμ1)2p1(x)dx+(μ1μ2)2p1(x)dx+2(xμ1)(μ1μ2)]p1(x)dx12=logσ2σ1+12σ22[(xμ1)2p1(x)dx+(μ1μ2)2]12=logσ2σ1+σ12+(μ1μ2)22σ2212\begin{array}{l} =\log \frac{\sigma_{2}}{\sigma_{1}}+\frac{1}{2 \sigma_{2}^{2}} \int\left(\left(x-\mu_{2}\right)^{2}\right) p_{1}(x) d x-\frac{1}{2} \\\\ =\log \frac{\sigma_{2}}{\sigma_{1}}+\frac{1}{2 \sigma_{2}^{2}} \int\left(\left(x-\mu_{1}+\mu_{1}-\mu_{2}\right)^{2}\right) p_{1}(x) d x-\frac{1}{2} \\\\ =\log \frac{\sigma_{2}}{\sigma_{1}}+\frac{1}{2 \sigma_{2}^{2}}\left[\int\left(x-\mu_{1}\right)^{2} p_{1}(x) d x+\int\left(\mu_{1}-\mu_{2}\right)^{2} p_{1}(x) d x+2 \int\left(x-\mu_{1}\right)\left(\mu_{1}-\mu_{2}\right)\right] p_{1}(x) d x-\frac{1}{2} \\\\ =\log \frac{\sigma_{2}}{\sigma_{1}}+\frac{1}{2 \sigma_{2}^{2}}\left[\int\left(x-\mu_{1}\right)^{2} p_{1}(x) d x+\left(\mu_{1}-\mu_{2}\right)^{2}\right]-\frac{1}{2} \\\\ =\log \frac{\sigma_{2}}{\sigma_{1}}+\frac{\sigma_{1}^{2}+\left(\mu_{1}-\mu_{2}\right)^{2}}{2 \sigma_{2}^{2}}-\frac{1}{2} \end{array}

假设N2N_{2}为正态分布,μ2=0,σ22=1\mu_{2}=0, \sigma_{2}^{2}=1,可知分布N1N_1 及其对应的KL散度为:

KL(μ1,σ1)=logσ1+σ12+μ12212K L\left(\mu_{1}, \sigma_{1}\right)=-\log \sigma_{1}+\frac{\sigma_{1}^{2}+\mu_{1}^{2}}{2}-\frac{1}{2}

从上式可以看出当μ1=0,σ12=1\mu_{1}=0, \sigma_{1}^{2}=1时,KL散度值最小。

多维高斯分布的KL散度

多维高斯分布的公式如下:

p(x1,x2,xn)=12πdet(Σ)e(12(xμ)TΣ1(xμ))p\left(x_{1}, x_{2}, \ldots x_{n}\right)=\frac{1}{\sqrt{2 \pi * \operatorname{det}(\Sigma)}} e^{\left(-\frac{1}{2}(x-\mu)^{T} \Sigma^{-1}(x-\mu)\right)}

由于通常假定各维变量独立同分布,因此协方差矩阵为对角矩阵,下面直接给出多维高斯分布的KL散度计算结果:

KL(p1p2)=12[logdet(Σ2)det(Σ1)d+tr(Σ21Σ1)+(μ2μ1)TΣ21(μ2μ1)]K L(p 1 \| p 2)=\frac{1}{2}\left[\log \frac{\operatorname{det}\left(\Sigma_{2}\right)}{\operatorname{det}\left(\Sigma_{1}\right)}-d+\operatorname{tr}\left(\Sigma_{2}^{-1} \Sigma_{1}\right)+\left(\mu_{2}-\mu_{1}\right)^{T} \Sigma_{2}^{-1}\left(\mu_{2}-\mu_{1}\right)\right]

Python 实例

假设两个离散分布为ppqqpp 的分布为{1,1,2,2,3}\{1,1,2,2,3\}qq 的分布为{1,1,1,1,1,2,3,3,3,3}\{1,1,1,1,1,2,3,3,3,3\}

两个分布中元素数量不同,但是都包含1,2,3三个元素。

x=1x=1时,p(x=1)=25=0.4,q(x=1)=510=0.5p(x=1)=\frac{2}{5}=0.4,q(x=1)=\frac{5}{10}=0.5;

x=2x=2时,p(x=2)=25=0.4,q(x=2)=110=0.1p(x=2)=\frac{2}{5}=0.4,q(x=2)=\frac{1}{10}=0.1;

x=3x=3时,p(x=3)=15=0.2,q(x=3)=410=0.4p(x=3)=\frac{1}{5}=0.2,q(x=3)=\frac{4}{10}=0.4;

代入KL散度计算公式:

D(PQ)=0.4log20.40.5+0.4log20.40.1+0.2log20.20.4=0.47D(P \|\| Q)=0.4 \log _{2} \frac{0.4}{0.5}+0.4 \log _{2} \frac{0.4}{0.1}+0.2 \log _{2} \frac{0.2}{0.4}=0.47

PyTorch 实现

对于上例使用PyTorch进行实现:

1
2
3
4
5
6
7
8
In [1]: import torch

In [2]: p = torch.tensor([0.4,0.4,0.2], dtype=torch.float32)

In [3]: q = torch.tensor([0.5,0.1,0.4], dtype=torch.float32)

In [4]: (p*torch.log2(p/q)).sum()
Out[4]: tensor(0.4712)

内置函数

torch.nn.functional.kl_div(q.log(),p,reduction='sum')

函数中的ppqq 位置相反(也就是想要计算DL(pq)DL(p\|\|q),要写成kl_div(q.log(),p)的形式),而且qq要先取 log

1
2
3
4
5
6
In [10]: p = torch.tensor([0.4,0.4,0.2], dtype=torch.float32)

In [11]: q = torch.tensor([0.5,0.1,0.4], dtype=torch.float32)

In [12]: torch.nn.functional.kl_div(q.log(),p,reduction='sum')
Out[12]: tensor(0.3266)

计算结果不同但是同样是正确的,是函数log对数以e为底。

分类KL散度

1
2
3
4
5
6
7
8
9
import torch.nn.functional as F

# p_logit: [batch, class_num]
# q_logit: [batch, class_num]
def kl_categorical(p_logit, q_logit):
p = F.softmax(p_logit, dim=-1)
_kl = torch.sum(p * (F.log_softmax(p_logit, dim=-1) - F.log_softmax(q_logit, dim=-1)), 1)
return torch.mean(_kl)

联系作者