m ( α ) ≜ ⟨ α , 1 ⟩ = ∑ i α i m(\alpha) \triangleq\langle\alpha, 1\rangle=\sum_i \alpha_i m(α)≜⟨α,1⟩=∑iαi
OT
概率向量 ( α , β ) ∈ R + N × R + M (\alpha, \beta) \in \mathbb{R}_{+}^N \times \mathbb{R}_{+}^M (α,β)∈R+N×R+M, 满足 ∑ i α i = ∑ j β j = 1 \sum_i \alpha_i=\sum_j \beta_j=1 ∑iαi=∑jβj=1
代价矩阵 C ∈ R N × M \mathrm{C} \in \mathbb{R}^{N \times M} C∈RN×M
O T ( α , β ) ≜ inf π ⩾ 0 , π 1 = α , π 2 = β ⟨ π , C ⟩ = ∑ i , j π i , j C i , j \mathrm{OT}(\alpha, \beta) \triangleq \inf _{\pi \geqslant 0, \pi_1=\alpha, \pi_2=\beta}\langle\pi, \mathrm{C}\rangle=\sum_{i, j} \pi_{i, j} \mathrm{C}_{i, j} OT(α,β)≜π⩾0,π1=α,π2=βinf⟨π,C⟩=i,j∑πi,jCi,j
其中 ( π 1 , π 2 ) ≜ ( π 1 , π ⊤ 1 ) \left(\pi_1, \pi_2\right) \triangleq\left(\pi \mathbb{1}, \pi^{\top} \mathbb{1}\right) (π1,π2)≜(π1,π⊤1)
Csiszár divergences
entropy function φ : R + → R + \varphi: \mathbb{R}_{+} \rightarrow \mathbb{R}_{+} φ:R+→R+
满足大于等于0,凸函数,下半连续, ϕ ( 1 ) = 0 \phi\left(1\right)=0 ϕ(1)=0
定义 φ ∞ ′ ≜ lim x → ∞ φ ( x ) x \varphi_{\infty}^{\prime} \triangleq \lim _{x \rightarrow \infty} \frac{\varphi(x)}{x} φ∞′≜limx→∞xφ(x)
那么Csiszár divergences
D φ ( μ ∣ ν ) ≜ ∑ ν i > 0 φ ( μ i ν i ) ν i + φ ∞ ′ ∑ ν i = 0 μ i \mathrm{D}_{\varphi}(\mu \mid \nu) \triangleq \sum_{\nu_i>0} \varphi\left(\frac{\mu_i}{\nu_i}\right) \nu_i+\varphi_{\infty}^{\prime} \sum_{\nu_i=0} \mu_i Dφ(μ∣ν)≜νi>0∑φ(νiμi)νi+φ∞′νi=0∑μi
一种实例就是KL散度
φ ( x ) = x log x − x + 1 \varphi(x)=x \log x-x+1 φ(x)=xlogx−x+1
K L ( μ ∣ ν ) ≜ ∑ i [ log ( μ i ν i ) μ i − μ i + ν i ] \mathrm{KL}(\mu \mid \nu) \triangleq \sum_i\left[\log \left(\frac{\mu_i}{\nu_i}\right) \mu_i-\mu_i+\nu_i\right] KL(μ∣ν)≜i∑[log(νiμi)μi−μi+νi]
Legendre transform
设区 I ⊂ R I \subset \mathbb{R} I⊂R, f : I → R f:I\to \mathbb{R} f:I→R是一个凸函数,则 f f f的Legendre transform为 f ∗ : I ∗ → R f^*:I^*\to \mathbb{R} f∗:I∗→R
f ∗ ( x ∗ ) = sup x ∈ I ( x ∗ x − f ( x ) ) , x ∗ ∈ I ∗ f^*\left(x^*\right)=\sup_{x\in I}\left(x^*x-f\left(x\right)\right),\quad x^*\in I^* f∗(x∗)=x∈Isup(x∗x−f(x)),x∗∈I∗
其中 I ∗ = { x ∗ ∈ R : f ∗ ( x ∗ ) < ∞ } I^* = \left\{x^*\in \mathbb{R}:f^*\left(x^*\right)<\infty\right\} I∗={
x∗∈R:f∗(x∗)<∞}
类似地
定义在凸集 x ⊂ R n \mathbf{x}\subset \mathbb{R}^n x⊂Rn的凸函数 f : X → R f:X\to \mathbb{R} f:X→R,则 f ∗ : X ∗ → R f^*:X^*\to \mathbb{R} f∗:X∗→R
f ∗ ( x ∗ ) = sup x ∈ X ( ⟨ x ∗ , x ⟩ − f ( x ) ) , x ∗ ∈ X ∗ f^*\left(\mathbf{x}^*\right)=\sup_{\mathbf{x}\in X}\left(\langle \mathbf{x}^*,\mathbf{x}\rangle - f\left(x\right)\right), \mathbf{x}^*\in X^* f∗(x∗)=x∈Xsup(⟨x∗,x⟩−f(x)),x∗∈X∗
其中 X ∗ = { x ∗ ∈ R n : sup x ∈ X ( ⟨ x ∗ , x ⟩ − f ( x ) ) < ∞ } X^*=\left\{\mathbf{x}^* \in \mathbb{R}^n: \sup_{\mathbf{x} \in X}\left(\left\langle \mathbf{x}^*, \mathbf{x}\right\rangle-f(\mathbf{x})\right)<\infty\right\} X∗={
x∗∈Rn:supx∈X(⟨x∗,x⟩−f(x))<∞}
性质
Separable sum
f ( x 1 , x 2 ) = g ( x 1 ) + h ( x 2 ) f ∗ ( y 1 , y 2 ) = g ∗ ( y 1 ) + h ∗ ( y 2 ) f\left(x_1, x_2\right)=g\left(x_1\right)+h\left(x_2\right) \quad f^*\left(y_1, y_2\right)=g^*\left(y_1\right)+h^*\left(y_2\right) f(x1,x2)=g(x1)+h(x2)f∗(y1,y2)=g∗(y1)+h∗(y2)
放缩
f ( x ) = α g ( x ) f ∗ ( y ) = α g ∗ ( y / α ) f ( x ) = α g ( x / α ) f ∗ ( y ) = α g ∗ ( y ) \begin{array}{cc} f(x)=\alpha g(x) & f^*(y)=\alpha g^*(y / \alpha) \\ f(x)=\alpha g(x / \alpha) & f^*(y)=\alpha g^*(y) \end{array} f(x)=αg(x)f(x)=αg(x/α)f∗(y)=αg∗(y/α)f∗(y)=αg∗(y)
平移
f ( x ) = g ( x − b ) f ∗ ( y ) = b T y + g ∗ ( y ) f(x)=g(x-b) \quad f^*(y)=b^T y+g^*(y) f(x)=g(x−b)f∗(y)=bTy+g∗(y)
加上仿射函数
f ( x ) = g ( x ) + a T x + b f ∗ ( y ) = g ∗ ( y − a ) − b f(x)=g(x)+a^T x+b \quad f^*(y)=g^*(y-a)-b f(x)=g(x)+aTx+bf∗(y)=g∗(y−a)−b
可逆仿射变换
设 A A A是非奇异方阵
f ( x ) = g ( A x ) f ∗ ( y ) = g ∗ ( A − T y ) f(x)=g(A x) \quad f^*(y)=g^*\left(A^{-T} y\right) f(x)=g(Ax)f∗(y)=g∗(A−Ty)
实例
指示函数
ι { 1 } ( x ) = { 0 , x = 1 + ∞ , otherwise \iota_{\{1\}}(x)=\begin{cases}0, &x=1\\+\infty, &\text{otherwise}\\\end{cases} ι{
1}(x)={
0,+∞,x=1otherwise
ι { 1 } ∗ ( y ) = y \iota_{\{1\}}^*(y)=y ι{
1}∗(y)=y
KL散度
f ( μ ) = K L ( μ ∣ ν ) = ∑ i [ log ( μ i ν i ) μ i − μ i + ν i ] f\left(\mathbf{\mu}\right)=\mathrm{KL}(\mathbf{\mu} \mid \mathbf{\nu}) = \sum_i\left[\log \left(\frac{\mu_i}{\nu_i}\right) \mu_i-\mu_i+\nu_i\right] f(μ)=KL(μ∣ν)=∑i[log(νiμi)μi−μi+νi]
证明:
g ( y ) = y T μ − ∑ i [ log ( μ i ν i ) μ i − μ i + ν i ] g\left(\mathbf{y}\right)= \mathbf{y}^T\mathbf{\mu}-\sum_i\left[\log \left(\frac{\mu_i}{\nu_i}\right) \mu_i-\mu_i+\nu_i\right] g(y)=yTμ−i∑[log(νiμi)μi−μi+νi]
∂ g ∂ μ i = y i − log μ i ν i = 0 ⇒ μ i = v i e y i \frac{\partial g}{\partial \mu_i}=y_i-\log\frac{\mu_i}{\nu_i}=0\Rightarrow \mu_i=v_ie^{y_i} ∂μi∂g=yi−logνiμi=0⇒μi=vieyi
因此
f ∗ ( y ) = ∑ i ( v i e y i − v i ) = ⟨ v , e y − 1 ⟩ f^*\left(\mathbf{y}\right)=\sum_i\left(v_ie^{y_i}-v_i\right)=\langle\mathbf{v}, e^\mathbf{y}-1\rangle f∗(y)=i∑(vieyi−vi)=⟨v,ey−1⟩
例子3
φ ( x ) = x log x − x + 1 \varphi(x)=x \log x-x+1 φ(x)=xlogx−x+1
φ ∗ ( y ) = e y − 1 \varphi^*\left(y\right) = e^y -1 φ∗(y)=ey−1
Unbalanced optimal transport
U O T ( α , β ) ≜ inf π ⩾ 0 ⟨ π , C ⟩ + ε K L ( π ∣ α ⊗ β ) + D φ 1 ( π 1 ∣ α ) + D φ 2 ( π 2 ∣ β ) \begin{aligned} & \mathrm{UOT}(\alpha, \beta) \triangleq \inf _{\pi \geqslant 0}\langle\pi, \mathrm{C}\rangle+\varepsilon \mathrm{KL}(\pi \mid \alpha \otimes \beta) \\ &+\mathrm{D}_{\varphi_1}\left(\pi_1 \mid \alpha\right)+\mathrm{D}_{\varphi_2}\left(\pi_2 \mid \beta\right) \end{aligned} UOT(α,β)≜π⩾0inf⟨π,C⟩+εKL(π∣α⊗β)+Dφ1(π1∣α)+Dφ2(π2∣β)
其中 m ( α ) m\left(\mathbf{\alpha}\right) m(α)不一定等于 m ( β ) m\left(\mathbf{\beta}\right) m(β)
对偶
UOT ( α , β ) = sup ( f , g ) F ε ( f , g ) \operatorname{UOT}(\alpha, \beta)=\sup _{(f, g)} \mathcal{F}_{\varepsilon}(f, g) UOT(α,β)=(f,g)supFε(f,g)
其中
F ε ( f , g ) ≜ ⟨ α , − φ 1 ∗ ( − f ) ⟩ + ⟨ β , − φ 2 ∗ ( − g ) ⟩ − ε ⟨ α ⊗ β , e f ⊕ g − C ε − 1 ⟩ \begin{aligned} \mathcal{F}_{\varepsilon}(f, g) \triangleq & \left\langle\alpha,-\varphi_1^*(-f)\right\rangle+\left\langle\beta,-\varphi_2^*(-g)\right\rangle \\ & -\varepsilon\left\langle\alpha \otimes \beta, e^{\frac{f \oplus g-\mathrm{C}}{\varepsilon}}-1\right\rangle \end{aligned} Fε(f,g)≜⟨α,−φ1∗(−f)⟩+⟨β,−φ2∗(−g)⟩−ε⟨α⊗β,eεf⊕g−C−1⟩
其中 φ ∗ \varphi^* φ∗是Legendre transform
φ 1 ∗ ( − f ) ≜ ( φ 1 ∗ ( − f i ) ) i ∈ R N \varphi_1^*(-f) \triangleq \left(\varphi_1^*\left(-f_i\right)\right)_i \in \mathbb{R}^N φ1∗(−f)≜(φ1∗(−fi))i∈RN
如果 ϵ = 0 \epsilon=0 ϵ=0, 则最后一项变为约束 f ⊕ g ≤ C f\oplus g \le C f⊕g≤C
Sinkhorn算法
一种解决OT/UOT的对偶问题方法就是Sinkhorn算法
当 D φ = ρ K L D_\varphi= \rho KL Dφ=ρKL时,收敛速度为 ( 1 + ϵ ρ ) − 1 \left(1+\frac{\epsilon}{\rho}\right)^{-1} (1+ρϵ)−1,当 ε ≪ ρ \varepsilon \ll \rho ε≪ρ时,趋于 1 1 1
平移不变公式
当 φ 1 ( x ) = φ 2 ( x ) = ι { 1 } ( x ) \varphi_1(x) = \varphi_2(x)=\iota_{\{1\}}(x) φ1(x)=φ2(x)=ι{
1}(x)时
F ε ( f , g ) = ⟨ α , f ⟩ + ⟨ β , g ⟩ − ε ⟨ α ⊗ β , e f ⊕ g − C ε − 1 ⟩ \mathcal{F}_{\varepsilon}(f, g)=\langle\alpha, f\rangle+\langle\beta, g\rangle-\varepsilon\left\langle\alpha \otimes \beta, e^{\frac{f \oplus g-\mathrm{C}}{\varepsilon}}-1\right\rangle Fε(f,g)=⟨α,f⟩+⟨β,g⟩−ε⟨α⊗β,eεf⊕g−C−1⟩
对于任意常数 λ ∈ R , F ε ( f + λ , g − λ ) = F ε ( f , g ) + λ m ( α ) − λ m ( β ) = F ε ( f , g ) \lambda \in \mathbb{R}, \mathcal{F}_{\varepsilon}(f+\lambda, g-\lambda)=\mathcal{F}_{\varepsilon}(f, g) +\lambda m\left(\mathbf{\alpha}\right)-\lambda m\left(\mathbf{\beta}\right)=\mathcal{F}_{\varepsilon}(f, g) λ∈R,Fε(f+λ,g−λ)=Fε(f,g)+λm(α)−λm(β)=Fε(f,g)
但是对于一般的UOT不成立
设 ( f ∗ , g ∗ ) \left(f^*,g^*\right) (f∗,g∗)是 F ε \mathcal{F}_{\varepsilon} Fε的最优解。如果Sinkhorn算法的初值为 f 0 = f ⋆ + τ , τ ∈ R f_0=f^{\star}+\tau, \tau \in\mathbb{R} f0=f⋆+τ,τ∈R
对于 D φ = ρ K L D_\varphi=\rho KL Dφ=ρKL,有 f t = f ∗ + ( ρ ε + ρ ) 2 t τ f_t = f^* + \left(\frac{\rho}{\varepsilon + \rho}\right)^{2t}\tau ft=f∗+(ε+ρρ)2tτ,即迭代对平移敏感,并且当 ε ≪ ρ \varepsilon \ll \rho ε≪ρ时,误差减小的速度很慢
为了解决这个问题,作者提出
H ε ( f ˉ , g ˉ ) ≜ sup λ ∈ R F ε ( f ˉ + λ , g ˉ − λ ) \mathcal{H}_{\varepsilon}(\bar{f}, \bar{g}) \triangleq \sup _{\lambda \in \mathbb{R}} \mathcal{F}_{\varepsilon}(\bar{f}+\lambda, \bar{g}-\lambda) Hε(fˉ,gˉ)≜λ∈RsupFε(fˉ+λ,gˉ−λ)
此时 H ε ( f ˉ + λ , g ˉ − λ ) = H ε ( f ˉ , g ˉ ) \mathcal{H}_{\varepsilon}(\bar{f}+\lambda, \bar{g}-\lambda)=\mathcal{H}_{\varepsilon}(\bar{f}, \bar{g}) Hε(fˉ+λ,gˉ−λ)=Hε(fˉ,gˉ)
对于 U O T ( α , β ) UOT\left(\alpha,\beta\right) UOT(α,β), F ε ( f , g ) \mathcal{F}_{\varepsilon}(f, g) Fε(f,g)和 H ε ( f ˉ , g ˉ ) \mathcal{H}_{\varepsilon}(\bar{f}, \bar{g}) Hε(fˉ,gˉ)产生相同的解
( f , g ) = ( f ˉ + λ ⋆ ( f ˉ , g ˉ ) , g ˉ − λ ⋆ ( f ˉ , g ˉ ) ) where λ ⋆ ( f ˉ , g ˉ ) ≜ argmax λ ∈ R F ε ( f ˉ + λ , g ˉ − λ ) . \begin{array}{cl} & (f, g)=\left(\bar{f}+\lambda^{\star}(\bar{f}, \bar{g}), \bar{g}-\lambda^{\star}(\bar{f}, \bar{g})\right) \\ \text { where } & \lambda^{\star}(\bar{f}, \bar{g}) \triangleq \underset{\lambda \in \mathbb{R}}{\operatorname{argmax}} \mathcal{F}_{\varepsilon}(\bar{f}+\lambda, \bar{g}-\lambda) . \end{array} where (f,g)=(fˉ+λ⋆(fˉ,gˉ),gˉ−λ⋆(fˉ,gˉ))λ⋆(fˉ,gˉ)≜λ∈RargmaxFε(fˉ+λ,gˉ−λ).
作者假设 φ ∗ \varphi^* φ∗严格凸函数,进而 λ ∗ \lambda^* λ∗唯一
H ε H_\varepsilon Hε性质
m ( α ) ≜ ⟨ α , 1 ⟩ = ∑ i α i m(\alpha) \triangleq\langle\alpha, 1\rangle=\sum_i \alpha_i m(α)≜⟨α,1⟩=∑iαi
λ ⋆ ( f ˉ , g ˉ ) ≜ argmax λ ∈ R F ε ( f ˉ + λ , g ˉ − λ ) \lambda^{\star}(\bar{f}, \bar{g}) \triangleq \underset{\lambda \in \mathbb{R}}{\operatorname{argmax}} \mathcal{F}_{\varepsilon}(\bar{f}+\lambda, \bar{g}-\lambda) λ⋆(fˉ,gˉ)≜λ∈RargmaxFε(fˉ+λ,gˉ−λ)
性质1
设 φ 1 ∗ , φ 2 ∗ \varphi_1^*,\varphi_2^* φ1∗,φ2∗是光滑,严格凸函数
则存在一个唯一的极大值解 λ ⋆ ( f ˉ , g ˉ ) \lambda^{\star}(\bar{f}, \bar{g}) λ⋆(fˉ,gˉ)
进一步, ( α ~ , β ~ ) = ∇ H 0 ( f ˉ , g ˉ ) (\tilde{\alpha}, \tilde{\beta})=\nabla \mathcal{H}_0(\bar{f}, \bar{g}) (α~,β~)=∇H0(fˉ,gˉ)满足 α ~ = ∇ φ 1 ∗ ( − f ˉ − λ ∗ ( f ˉ , g ˉ ) ) α , β ~ = ∇ φ 2 ∗ ( − g ˉ + λ ⋆ ( f ˉ , g ˉ ) ) β \tilde{\alpha}=\nabla \varphi_1^*\left(-\bar{f}-\lambda^*(\bar{f}, \bar{g})\right) \alpha, \tilde{\beta} = \nabla \varphi_2^*\left(-\bar{g}+\lambda^{\star}(\bar{f}, \bar{g})\right) \beta α~=∇φ1∗(−fˉ−λ∗(fˉ,gˉ))α,β~=∇φ2∗(−gˉ+λ⋆(fˉ,gˉ))β
并且 m ( α ~ ) = m ( β ~ ) m(\tilde{\alpha})=m(\tilde{\beta}) m(α~)=m(β~)
(这里 φ ∗ \varphi^* φ∗是一个标量函数,因此这个 ∇ φ ∗ \nabla \varphi^* ∇φ∗应该是一个对角矩阵,即 ∇ φ ∗ ( f ) = d i a g ( ( φ ∗ ) ′ ( f 1 ) ( φ ∗ ) ′ ( f 2 ) ⋮ ( φ ∗ ) ′ ( f n ) ) \nabla\varphi^*\left(f\right) = \rm{diag}\begin{pmatrix}\left(\varphi^*\right)^{\prime}\left(f_1\right)\\\left(\varphi^*\right)^{\prime}\left(f_2\right)\\\vdots \\\left(\varphi^*\right)^{\prime}\left(f_n\right)\\\end{pmatrix} ∇φ∗(f)=diag (φ∗)′(f1)(φ∗)′(f2)⋮(φ∗)′(fn) )
证明:
对于任意的 ( f ˉ , g ˉ ) (\bar{f}, \bar{g}) (fˉ,gˉ),定义 G ε ( λ ) = def. F ε ( f ˉ + λ , g ˉ − λ ) \mathcal{G}_{\varepsilon}(\lambda) \stackrel{\text { def. }}{=} \mathcal{F}_{\varepsilon}(\bar{f}+\lambda, \bar{g}-\lambda) Gε(λ)= def. Fε(fˉ+λ,gˉ−λ)
根据[Liero et al., 2015], lim x → ∞ φ ( x ) = + ∞ \lim\limits_{x\to \infty}\varphi\left(x\right)=+\infty x→∞limφ(x)=+∞, 当 λ → ± ∞ \lambda \to \pm \infty λ→±∞时, G ε → − ∞ \mathcal{G}_{\varepsilon}\to -\infty Gε→−∞,即 G ε \mathcal{G}_{\varepsilon} Gε是强制函数
因此能在 R \mathbb{R} R取到全局最大值
因为 φ ∗ \varphi^* φ∗是严格凸的,所以唯一
对 G ε \mathcal{G}_\varepsilon Gε求导得
d G ε d λ = ⟨ α , ∇ φ 1 ∗ ( − f ˉ − λ ) ⟩ − ⟨ β , ∇ φ 2 ∗ ( − g ˉ + λ ) ⟩ = 0 ⇒ ⟨ α , ∇ φ 1 ∗ ( − f ˉ − λ ) ⟩ = ⟨ β , ∇ φ 2 ∗ ( − g ˉ + λ ) ⟩ ⇒ ⟨ α ~ , 1 ⟩ = ⟨ β ~ , 1 ⟩ ⇒ m ( α ~ ) = m ( β ~ ) \begin{aligned} &\frac{\rm{d}\mathcal{G}_\varepsilon}{\rm{d}\lambda}=\langle\alpha, \nabla\varphi_1^*\left(-\bar{f}-\lambda\right)\rangle-\left\langle\beta, \nabla \varphi_2^*(-\bar{g}+\lambda)\right\rangle=0\\ &\Rightarrow \langle\alpha, \nabla\varphi_1^*\left(-\bar{f}-\lambda\right)\rangle = \left\langle\beta, \nabla \varphi_2^*(-\bar{g}+\lambda)\right\rangle\\ &\Rightarrow\langle \tilde{\alpha}, 1\rangle = \langle\tilde{\beta}, 1\rangle\\ &\Rightarrow m(\tilde{\alpha})=m(\tilde{\beta}) \end{aligned} dλdGε=⟨α,∇φ1∗(−fˉ−λ)⟩−⟨β,∇φ2∗(−gˉ+λ)⟩=0⇒⟨α,∇φ1∗(−fˉ−λ)⟩=⟨β,∇φ2∗(−gˉ+λ)⟩⇒⟨α~,1⟩=⟨β~,1⟩⇒m(α~)=m(β~)
性质2
设 φ i = ρ i K L \varphi_i = \rho_iKL φi=ρiKL
λ ⋆ ( f ˉ , g ˉ ) = ρ 1 ρ 2 ρ 1 + ρ 2 log [ ⟨ α , e − f ˉ ρ 1 ⟩ ⟨ β , e − g ˉ ρ 2 ⟩ ] \lambda^{\star}(\bar{f}, \bar{g})=\frac{\rho_1 \rho_2}{\rho_1+\rho_2} \log \left[\frac{\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle}{\left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle}\right] λ⋆(fˉ,gˉ)=ρ1+ρ2ρ1ρ2log
⟨β,e−ρ2gˉ⟩⟨α,e−ρ1fˉ⟩
证明:
⟨ α , e − f ˉ + λ ⋆ ρ 1 ⟩ = ⟨ β , e − g ˉ − λ ⋆ ρ 2 ⟩ , ⇔ e − λ ⋆ ρ 1 ⟨ α , e − f ˉ ρ 1 ⟩ = e + λ ⋆ ρ 2 ⟨ β , e − g ˉ ρ 2 ⟩ , ⇔ − λ ⋆ ρ 1 + log ⟨ α , e − f ˉ ρ 1 ⟩ = λ ⋆ ρ 2 + log ⟨ β , e − g ˉ ρ 2 ⟩ , ⇔ λ ⋆ ( 1 ρ 1 + 1 ρ 2 ) = log [ ⟨ α , e − f ˉ ρ 1 ⟩ ⟨ β , e − g ˉ ρ 2 ⟩ ] , ⇔ λ ⋆ ( f ˉ , g ˉ ) = ρ 1 ρ 2 ρ 1 + ρ 2 log [ ⟨ α , e − f ˉ ρ 1 ⟩ ⟨ β , e − g ˉ ρ 2 ⟩ ] . \begin{aligned} & \left\langle\alpha, e^{-\frac{\bar{f}+\lambda^{\star}}{\rho_1}}\right\rangle=\left\langle\beta, e^{-\frac{\bar{g}-\lambda^{\star}}{\rho_2}}\right\rangle, \\ & \Leftrightarrow e^{-\frac{\lambda^{\star}}{\rho_1}}\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle=e^{+\frac{\lambda^{\star}}{\rho_2}}\left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle, \\ & \Leftrightarrow-\frac{\lambda^{\star}}{\rho_1}+\log \left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle=\frac{\lambda^{\star}}{\rho_2}+\log \left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle, \\ & \Leftrightarrow \lambda^{\star}\left(\frac{1}{\rho_1}+\frac{1}{\rho_2}\right)=\log \left[\frac{\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle}{\left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle}\right], \\ & \Leftrightarrow \lambda^{\star}(\bar{f}, \bar{g})=\frac{\rho_1 \rho_2}{\rho_1+\rho_2} \log \left[\frac{\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle}{\left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle}\right] . \end{aligned} ⟨α,e−ρ1fˉ+λ⋆⟩=⟨β,e−ρ2gˉ−λ⋆⟩,⇔e−ρ1λ⋆⟨α,e−ρ1fˉ⟩=e+ρ2λ⋆⟨β,e−ρ2gˉ⟩,⇔−ρ1λ⋆+log⟨α,e−ρ1fˉ⟩=ρ2λ⋆+log⟨β,e−ρ2gˉ⟩,⇔λ⋆(ρ11+ρ21)=log
⟨β,e−ρ2gˉ⟩⟨α,e−ρ1fˉ⟩
,⇔λ⋆(fˉ,gˉ)=ρ1+ρ2ρ1ρ2log
⟨β,e−ρ2gˉ⟩⟨α,e−ρ1fˉ⟩
.
性质3
令 τ 1 = ρ 1 ρ 1 + ρ 2 , τ 2 = ρ 2 ρ 1 + ρ 2 \tau_1 = \frac{\rho_1}{\rho_1 + \rho_2}, \tau_2 = \frac{\rho_2}{\rho_1 + \rho_2} τ1=ρ1+ρ2ρ1,τ2=ρ1+ρ2ρ2
则
H ε ( f ˉ , g ˉ ) = ρ 1 m ( α ) + ρ 2 m ( β ) − ε ⟨ α ⊗ β , e f ˉ ⊕ g ˉ − C ε − 1 ⟩ − ( ρ 1 + ρ 2 ) ( ⟨ α , e − f ˉ ρ 1 ⟩ ) τ 1 ( ⟨ β , e − g ˉ ρ 2 ⟩ ) τ 2 ⋅ \begin{aligned} \mathcal{H}_{\varepsilon}(\bar{f}, \bar{g})=\rho_1 m(\alpha)+\rho_2 m(\beta)-\varepsilon\left\langle\alpha \otimes \beta, e^{\frac{\bar{f} \oplus \bar{g}-\mathrm{C}}{\varepsilon}}-1\right\rangle \\ \quad-\left(\rho_1+\rho_2\right)\left(\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle\right)^{\tau_1}\left(\left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle\right)^{\tau_2} \cdot \end{aligned} Hε(fˉ,gˉ)=ρ1m(α)+ρ2m(β)−ε⟨α⊗β,eεfˉ⊕gˉ−C−1⟩−(ρ1+ρ2)(⟨α,e−ρ1fˉ⟩)τ1(⟨β,e−ρ2gˉ⟩)τ2⋅
特别地,当 ρ 1 = ρ 2 = ρ , ε = 0 \rho_1=\rho_2=\rho, \varepsilon=0 ρ1=ρ2=ρ,ε=0时
H 0 ( f ˉ , g ˉ ) = ρ [ m ( α ) + m ( β ) − 2 ⟨ α , e − f ˉ ρ ⟩ ⟨ β , e − g ˉ ρ ⟩ ] \mathcal{H}_0(\bar{f}, \bar{g})=\rho\left[m(\alpha)+m(\beta)-2 \sqrt{\left\langle\alpha, e^{-\frac{\bar{f}}{\rho}}\right\rangle\left\langle\beta, e^{-\frac{\bar{g}}{\rho}}\right\rangle}\right] H0(fˉ,gˉ)=ρ[m(α)+m(β)−2⟨α,e−ρfˉ⟩⟨β,e−ρgˉ⟩]
证明:
φ i ( x ) = ρ i ( x log x − x + 1 ) \varphi_i(x)=\rho_i\left(x \log x-x+1\right) φi(x)=ρi(xlogx−x+1)
φ i ∗ ( x ) = ρ i ( e x ρ i − 1 ) \varphi_i^*\left(x\right)=\rho_i\left(e^{\frac{x}{\rho_i}}-1\right) φi∗(x)=ρi(eρix−1)
F 0 ( f , g ) = ⟨ α , − ρ 1 ( e − f ˉ ρ 1 − 1 ) ⟩ + ⟨ β , − ρ 2 ( e − g ˉ ρ 2 − 1 ) ⟩ = ρ 1 m ( α ) + ρ 2 m ( β ) − ρ 1 ⟨ α , e − f ˉ ρ 1 ⟩ − ρ 2 ⟨ β , e − g ˉ ρ 2 ⟩ \begin{aligned} \mathcal{F}_{0}(f, g) & =\left\langle\alpha,-\rho_1\left(e^{-\frac{\bar{f}}{\rho_1}}-1\right)\right\rangle+\left\langle\beta,-\rho_2\left(e^{-\frac{\bar{g}}{\rho_2}}-1\right)\right\rangle \\ & =\rho_1 m(\alpha)+\rho_2 m(\beta)-\rho_1\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle-\rho_2\left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle \end{aligned} F0(f,g)=⟨α,−ρ1(e−ρ1fˉ−1)⟩+⟨β,−ρ2(e−ρ2gˉ−1)⟩=ρ1m(α)+ρ2m(β)−ρ1⟨α,e−ρ1fˉ⟩−ρ2⟨β,e−ρ2gˉ⟩
根据性质2
⟨ α , e − f ˉ + λ ⋆ ( f ˉ , g ˉ ) ρ 1 ⟩ = ⟨ α , e − f ˉ ρ 1 ⟩ ⋅ exp ( − ρ 2 ρ 1 + ρ 2 log [ ⟨ α , e − f ˉ ρ 1 ⟩ ⟨ β , e − g ˉ ρ 2 ⟩ ] ) = ⟨ α , e − f ˉ ρ 1 ⟩ ⋅ ⟨ α , e − f ˉ ρ 1 ⟩ − ρ 2 ρ 1 + ρ 2 ⋅ ⟨ β , e − g ˉ ρ 2 ⟩ ρ 2 ρ 1 + ρ 2 = ⟨ α , e − f ˉ ρ 1 ⟩ ρ 1 ρ 1 + ρ 2 ⋅ ⟨ β , e − g ˉ ρ 2 ⟩ ρ 2 ρ 1 + ρ 2 \begin{aligned} \left\langle\alpha, e^{-\frac{\bar{f}+\lambda^{\star}(\bar{f}, \bar{g})}{\rho_1}}\right\rangle & =\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle \cdot \exp \left(-\frac{\rho_2}{\rho_1+\rho_2} \log \left[\frac{\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle}{\left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle}\right]\right) \\ & =\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle \cdot\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle^{-\frac{\rho_2}{\rho_1+\rho_2}} \cdot\left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle^{\frac{\rho_2}{\rho_1+\rho_2}} \\ & =\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle^{\frac{\rho_1}{\rho_1+\rho_2}} \cdot\left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle^{\frac{\rho_2}{\rho_1+\rho_2}} \end{aligned} ⟨α,e−ρ1fˉ+λ⋆(fˉ,gˉ)⟩=⟨α,e−ρ1fˉ⟩⋅exp
−ρ1+ρ2ρ2log
⟨β,e−ρ2gˉ⟩⟨α,e−ρ1fˉ⟩
=⟨α,e−ρ1fˉ⟩⋅⟨α,e−ρ1fˉ⟩−ρ1+ρ2ρ2⋅⟨β,e−ρ2gˉ⟩ρ1+ρ2ρ2=⟨α,e−ρ1fˉ⟩ρ1+ρ2ρ1⋅⟨β,e−ρ2gˉ⟩ρ1+ρ2ρ2
类似地
⟨ β , e − g ˉ − λ ⋆ ( f ˉ , g ˉ ) ρ 2 ⟩ = ⟨ α , e − f ˉ ρ 1 ⟩ ρ 1 ρ 1 + ρ 2 ⋅ ⟨ β , e − g ˉ ρ 2 ⟩ ρ 2 ρ 1 + ρ 2 \left\langle\beta, e^{-\frac{\bar{g}-\lambda^{\star}(\bar{f}, \bar{g})}{\rho_2}}\right\rangle = \left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle^{\frac{\rho_1}{\rho_1+\rho_2}} \cdot\left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle^{\frac{\rho_2}{\rho_1+\rho_2}} ⟨β,e−ρ2gˉ−λ⋆(fˉ,gˉ)⟩=⟨α,e−ρ1fˉ⟩ρ1+ρ2ρ1⋅⟨β,e−ρ2gˉ⟩ρ1+ρ2ρ2
全部代回去即可得到结论
Translation invariant Sinkhorn
Unbalanced Sinkhorn
对于任意初值 f 0 f_0 f0
g t + 1 ( y ) = − aprox φ 1 ∗ ( − Smin ε α ( C ( ⋅ , y ) − f t ) ) f t + 1 ( x ) = − aprox φ 2 ∗ ( − Smin ε β ( C ( x , ⋅ ) − g t + 1 ) ) \begin{aligned} & g_{t+1}(y)=-\operatorname{aprox}_{\varphi_1^*}\left(-\operatorname{Smin}_{\varepsilon}^\alpha\left(\mathrm{C}(\cdot, y)-f_t\right)\right) \\ & f_{t+1}(x)=-\operatorname{aprox}_{\varphi_2^*}\left(-\operatorname{Smin}_{\varepsilon}^\beta\left(\mathrm{C}(x, \cdot)-g_{t+1}\right)\right) \end{aligned} gt+1(y)=−aproxφ1∗(−Sminεα(C(⋅,y)−ft))ft+1(x)=−aproxφ2∗(−Sminεβ(C(x,⋅)−gt+1))
其中softmin定义为 Smin ε α ( f ) ≜ − ε log ⟨ α , e − f / ε ⟩ \operatorname{Smin}_{\varepsilon}^\alpha(f) \triangleq-\varepsilon \log \left\langle\alpha, e^{-f / \varepsilon}\right\rangle Sminεα(f)≜−εlog⟨α,e−f/ε⟩
anisotropic prox为
aprox φ ∗ ( x ) ≜ arg min y ∈ R ε e x − y ε + φ ∗ ( y ) \operatorname{aprox}_{\varphi^*}(x) \triangleq \arg \min _{y \in \mathbb{R}} \varepsilon e^{\frac{x-y}{\varepsilon}}+\varphi^*(y) aproxφ∗(x)≜argy∈Rminεeεx−y+φ∗(y)
如果 φ = ρ K L \varphi = \rho KL φ=ρKL,则 aprox φ ∗ ( x ) = ρ ε + ρ x \operatorname{aprox}_{\varphi^*}(x)=\frac{\rho}{\varepsilon+\rho} x aproxφ∗(x)=ε+ρρx
softmin和 aprox φ ∗ \operatorname{aprox}_{\varphi^*} aproxφ∗在 ∥ ⋅ ∥ ∞ \|\cdot\|_\infty ∥⋅∥∞下分别是1-压缩和 ( 1 + ε ρ ) − 1 \left(1+\frac{\varepsilon}{\rho}\right)^{-1} (1+ρε)−1-压缩的
TI-Sinkhorn
对 H ε \mathcal{H}_\varepsilon Hε进行交替对偶上升
令
Ψ 1 ( g ˉ ) ≜ arg max H ε ( ⋅ , g ˉ ) Ψ 2 ( f ˉ ) ≜ arg max H ε ( f ˉ , ⋅ ) \Psi_1\left(\bar{g}\right)\triangleq\arg\max \mathcal{H}_{\varepsilon}\left(\cdot,\bar{g}\right)\\ \Psi_2\left(\bar{f}\right)\triangleq\arg\max \mathcal{H}_{\varepsilon}\left(\bar{f}, \cdot\right)\\ Ψ1(gˉ)≜argmaxHε(⋅,gˉ)Ψ2(fˉ)≜argmaxHε(fˉ,⋅)
TI-Sinkhorn算法为
g ˉ t + 1 = Ψ 2 ( f ˉ t ) , f ˉ t + 1 = Ψ 2 ( g ˉ t + 1 ) \bar{g}_{t+1}=\Psi_2\left(\bar{f}_t\right), \bar{f}_{t+1}=\Psi_2\left(\bar{g}_{t+1}\right) gˉt+1=Ψ2(fˉt),fˉt+1=Ψ2(gˉt+1)
最后
( f t , g t ) ≜ ( f ˉ t + λ ∗ ( f ˉ t , g ˉ t ) , g ˉ t − λ ∗ ( f ˉ t , g ˉ t ) ) \left(f_t,g_t\right)\triangleq \left(\bar{f}_t + \lambda^*\left(\bar{f}_t,\bar{g}_t\right), \bar{g}_t - \lambda^*\left(\bar{f}_t,\bar{g}_t\right)\right) (ft,gt)≜(fˉt+λ∗(fˉt,gˉt),gˉt−λ∗(fˉt,gˉt))
这个算法继承了 H ε \mathcal{H}_\varepsilon Hε的不变性
Ψ 1 ( f ˉ t + μ ) = Ψ 1 ( f ˉ t ) − μ \Psi_1\left(\bar{f}_t+\mu\right)=\Psi_1\left(\bar{f}_t\right)-\mu Ψ1(fˉt+μ)=Ψ1(fˉt)−μ
即,如果 f ˉ t \bar{f}_t fˉt变为 f ˉ t + μ \bar{f}_t+\mu fˉt+μ,则 g ˉ t + 1 \bar{g}_{t+1} gˉt+1变为 g ˉ t + 1 − μ \bar{g}_{t+1}-\mu gˉt+1−μ
或者如果 f ˉ t → f ˉ t + μ \bar{f}_t\to \bar{f}_{t}+\mu fˉt→fˉt+μ,则 g ˉ t + 1 → g ˉ t + 1 − μ \bar{g}_{t+1}\to \bar{g}_{t+1}-\mu gˉt+1→gˉt+1−μ
性质4:对于固定的 ( g ˉ , f ˉ ) \left(\bar{g},\bar{f}\right) (gˉ,fˉ)
令 f ^ ≜ Smin ε α ( C ( x , ⋅ ) − f ˉ ) \hat{f}\triangleq \text{Smin}_\varepsilon^{\alpha}\left(C\left(x,\cdot\right)-\bar{f}\right) f^≜Sminεα(C(x,⋅)−fˉ)
g ^ ≜ Smin ε α ( C ( ⋅ , y ) − g ˉ ) \hat{g}\triangleq \text{Smin}_\varepsilon^{\alpha}\left(C\left(\cdot,y\right)-\bar{g}\right) g^≜Sminεα(C(⋅,y)−gˉ)
ψ 1 = def. Ψ 1 ( g ˉ ) , ψ 2 = def. Ψ 2 ( g ˉ ) \psi_1 \stackrel{\text { def. }}{=} \Psi_1(\bar{g}), \psi_2 \stackrel{\text { def. }}{=} \Psi_2(\bar{g}) ψ1= def. Ψ1(gˉ),ψ2= def. Ψ2(gˉ)
则
ψ 1 = − aprox φ 2 ∗ ( − g ^ + λ ⋆ ( ψ 1 , g ˉ ) ) − λ ⋆ ( ψ 1 , g ˉ ) , ψ 2 = − aprox φ 1 ∗ ( − f ^ − λ ⋆ ( f ˉ , ψ 2 ) ) + λ ⋆ ( f ˉ , ψ 2 ) . \begin{aligned} & \psi_1=-\operatorname{aprox}_{\varphi_2^*}\left(-\hat{g}+\lambda^{\star}\left(\psi_1, \bar{g}\right)\right)-\lambda^{\star}\left(\psi_1, \bar{g}\right), \\ & \psi_2=-\operatorname{aprox}_{\varphi_1^*}\left(-\hat{f}-\lambda^{\star}\left(\bar{f}, \psi_2\right)\right)+\lambda^{\star}\left(\bar{f}, \psi_2\right) . \end{aligned} ψ1=−aproxφ2∗(−g^+λ⋆(ψ1,gˉ))−λ⋆(ψ1,gˉ),ψ2=−aproxφ1∗(−f^−λ⋆(fˉ,ψ2))+λ⋆(fˉ,ψ2).
证明:
H ε ( f ˉ , g ˉ ) = F ε ( f ˉ + λ ⋆ ( f ˉ , g ˉ ) , g ˉ − λ ⋆ ( f ˉ , g ˉ ) ) \mathcal{H}_{\varepsilon}(\bar{f}, \bar{g})=\mathcal{F}_{\varepsilon}\left(\bar{f}+\lambda^{\star}(\bar{f}, \bar{g}), \bar{g}-\lambda^{\star}(\bar{f}, \bar{g})\right) Hε(fˉ,gˉ)=Fε(fˉ+λ⋆(fˉ,gˉ),gˉ−λ⋆(fˉ,gˉ))
对 g ˉ \bar{g} gˉ求导
β ⊙ e g ˉ / ε ⟨ α , e ( f ˉ − C ) / ε ⟩ = β ⊙ ∇ φ 2 ∗ ( − g ˉ + λ ⋆ ( f ˉ , g ˉ ) ) e g ˉ / ε ⟨ α , e ( f ˉ − C ) / ε ⟩ = ∇ φ 2 ∗ ( − g ˉ + λ ⋆ ( f ˉ , g ˉ ) ) \begin{aligned} \beta \odot e^{\bar{g} / \varepsilon}\left\langle\alpha, e^{(\bar{f}-\mathrm{C}) / \varepsilon}\right\rangle &=\beta \odot\nabla \varphi_2^*\left(-\bar{g}+\lambda^{\star}(\bar{f}, \bar{g})\right)\\ e^{\bar{g} / \varepsilon}\left\langle\alpha, e^{(\bar{f}-\mathrm{C}) / \varepsilon}\right\rangle &=\nabla \varphi_2^*\left(-\bar{g}+\lambda^{\star}(\bar{f}, \bar{g})\right) \end{aligned} β⊙egˉ/ε⟨α,e(fˉ−C)/ε⟩egˉ/ε⟨α,e(fˉ−C)/ε⟩=β⊙∇φ2∗(−gˉ+λ⋆(fˉ,gˉ))=∇φ2∗(−gˉ+λ⋆(fˉ,gˉ))
对 f ˉ \bar{f} fˉ求导
e f ˉ / ε ⟨ α , e ( g ˉ − C ) / ε ⟩ = ∇ φ 1 ∗ ( − f ˉ − λ ⋆ ( f ˉ , g ˉ ) ) e^{\bar{f} / \varepsilon}\left\langle\alpha, e^{(\bar{g}-\mathrm{C}) / \varepsilon}\right\rangle=\nabla \varphi_1^*\left(-\bar{f}-\lambda^{\star}(\bar{f}, \bar{g})\right) efˉ/ε⟨α,e(gˉ−C)/ε⟩=∇φ1∗(−fˉ−λ⋆(fˉ,gˉ))
令 g ^ = g ˉ − λ ⋆ ( f ˉ , g ˉ ) \hat{g}=\bar{g}-\lambda^{\star}(\bar{f}, \bar{g}) g^=gˉ−λ⋆(fˉ,gˉ)
e g ^ / ε ⟨ β , e ( f ˉ + λ ⋆ ( f ˉ , g ˉ ) − C ) / ε ⟩ = ∇ φ 2 ∗ ( − g ^ ) e^{\hat{g} / \varepsilon}\left\langle\beta, e^{\left(\bar{f}+\lambda^{\star}(\bar{f}, \bar{g})-\mathrm{C}\right) / \varepsilon}\right\rangle=\nabla \varphi_2^*(-\hat{g}) eg^/ε⟨β,e(fˉ+λ⋆(fˉ,gˉ)−C)/ε⟩=∇φ2∗(−g^)
g ^ = g ˉ − λ ⋆ ( f ˉ , g ˉ ) = − aprox φ 2 ∗ ( − Smin ε α ( C − f ˉ − λ ⋆ ( f ˉ , g ˉ ) ) ) \hat{g}=\bar{g}-\lambda^{\star}(\bar{f}, \bar{g})=-\operatorname{aprox}_{\varphi_2^*}\left(-\operatorname{Smin}_{\varepsilon}^\alpha\left(\mathrm{C}-\bar{f}-\lambda^{\star}(\bar{f}, \bar{g})\right)\right) g^=gˉ−λ⋆(fˉ,gˉ)=−aproxφ2∗(−Sminεα(C−fˉ−λ⋆(fˉ,gˉ)))
由 g ˉ = Ψ 1 ( f ˉ ) \bar{g}=\Psi_1(\bar{f}) gˉ=Ψ1(fˉ),因此结论成立
剩下的
还在看= =