Faster Unbalanced Optimal Transport: Translation invariant Sinkhorn and 1-D Frank-Wolfe阅读笔记

m ( α ) ≜ ⟨ α , 1 ⟩ = ∑ i α i m(\alpha) \triangleq\langle\alpha, 1\rangle=\sum_i \alpha_i m(α)α,1=iαi

OT

概率向量 ( α , β ) ∈ R + N × R + M (\alpha, \beta) \in \mathbb{R}_{+}^N \times \mathbb{R}_{+}^M (α,β)R+N×R+M, 满足 ∑ i α i = ∑ j β j = 1 \sum_i \alpha_i=\sum_j \beta_j=1 iαi=jβj=1

代价矩阵 C ∈ R N × M \mathrm{C} \in \mathbb{R}^{N \times M} CRN×M
O T ( α , β ) ≜ inf ⁡ π ⩾ 0 , π 1 = α , π 2 = β ⟨ π , C ⟩ = ∑ i , j π i , j C i , j \mathrm{OT}(\alpha, \beta) \triangleq \inf _{\pi \geqslant 0, \pi_1=\alpha, \pi_2=\beta}\langle\pi, \mathrm{C}\rangle=\sum_{i, j} \pi_{i, j} \mathrm{C}_{i, j} OT(α,β)π0,π1=α,π2=βinfπ,C=i,jπi,jCi,j
其中 ( π 1 , π 2 ) ≜ ( π 1 , π ⊤ 1 ) \left(\pi_1, \pi_2\right) \triangleq\left(\pi \mathbb{1}, \pi^{\top} \mathbb{1}\right) (π1,π2)(π1,π1)

Csiszár divergences

entropy function φ : R + → R + \varphi: \mathbb{R}_{+} \rightarrow \mathbb{R}_{+} φ:R+R+

满足大于等于0,凸函数,下半连续, ϕ ( 1 ) = 0 \phi\left(1\right)=0 ϕ(1)=0

定义 φ ∞ ′ ≜ lim ⁡ x → ∞ φ ( x ) x \varphi_{\infty}^{\prime} \triangleq \lim _{x \rightarrow \infty} \frac{\varphi(x)}{x} φlimxxφ(x)

那么Csiszár divergences
D φ ( μ ∣ ν ) ≜ ∑ ν i > 0 φ ( μ i ν i ) ν i + φ ∞ ′ ∑ ν i = 0 μ i \mathrm{D}_{\varphi}(\mu \mid \nu) \triangleq \sum_{\nu_i>0} \varphi\left(\frac{\mu_i}{\nu_i}\right) \nu_i+\varphi_{\infty}^{\prime} \sum_{\nu_i=0} \mu_i Dφ(μν)νi>0φ(νiμi)νi+φνi=0μi
一种实例就是KL散度

φ ( x ) = x log ⁡ x − x + 1 \varphi(x)=x \log x-x+1 φ(x)=xlogxx+1
K L ( μ ∣ ν ) ≜ ∑ i [ log ⁡ ( μ i ν i ) μ i − μ i + ν i ] \mathrm{KL}(\mu \mid \nu) \triangleq \sum_i\left[\log \left(\frac{\mu_i}{\nu_i}\right) \mu_i-\mu_i+\nu_i\right] KL(μν)i[log(νiμi)μiμi+νi]

Legendre transform

设区 I ⊂ R I \subset \mathbb{R} IR, f : I → R f:I\to \mathbb{R} f:IR是一个凸函数,则 f f f的Legendre transform为 f ∗ : I ∗ → R f^*:I^*\to \mathbb{R} f:IR
f ∗ ( x ∗ ) = sup ⁡ x ∈ I ( x ∗ x − f ( x ) ) , x ∗ ∈ I ∗ f^*\left(x^*\right)=\sup_{x\in I}\left(x^*x-f\left(x\right)\right),\quad x^*\in I^* f(x)=xIsup(xxf(x)),xI
其中 I ∗ = { x ∗ ∈ R : f ∗ ( x ∗ ) < ∞ } I^* = \left\{x^*\in \mathbb{R}:f^*\left(x^*\right)<\infty\right\} I={ xR:f(x)<}

类似地

定义在凸集 x ⊂ R n \mathbf{x}\subset \mathbb{R}^n xRn的凸函数 f : X → R f:X\to \mathbb{R} f:XR,则 f ∗ : X ∗ → R f^*:X^*\to \mathbb{R} f:XR
f ∗ ( x ∗ ) = sup ⁡ x ∈ X ( ⟨ x ∗ , x ⟩ − f ( x ) ) , x ∗ ∈ X ∗ f^*\left(\mathbf{x}^*\right)=\sup_{\mathbf{x}\in X}\left(\langle \mathbf{x}^*,\mathbf{x}\rangle - f\left(x\right)\right), \mathbf{x}^*\in X^* f(x)=xXsup(x,xf(x)),xX
其中 X ∗ = { x ∗ ∈ R n : sup ⁡ x ∈ X ( ⟨ x ∗ , x ⟩ − f ( x ) ) < ∞ } X^*=\left\{\mathbf{x}^* \in \mathbb{R}^n: \sup_{\mathbf{x} \in X}\left(\left\langle \mathbf{x}^*, \mathbf{x}\right\rangle-f(\mathbf{x})\right)<\infty\right\} X={ xRn:supxX(x,xf(x))<}

性质

Separable sum

f ( x 1 , x 2 ) = g ( x 1 ) + h ( x 2 ) f ∗ ( y 1 , y 2 ) = g ∗ ( y 1 ) + h ∗ ( y 2 ) f\left(x_1, x_2\right)=g\left(x_1\right)+h\left(x_2\right) \quad f^*\left(y_1, y_2\right)=g^*\left(y_1\right)+h^*\left(y_2\right) f(x1,x2)=g(x1)+h(x2)f(y1,y2)=g(y1)+h(y2)

放缩

f ( x ) = α g ( x ) f ∗ ( y ) = α g ∗ ( y / α ) f ( x ) = α g ( x / α ) f ∗ ( y ) = α g ∗ ( y ) \begin{array}{cc} f(x)=\alpha g(x) & f^*(y)=\alpha g^*(y / \alpha) \\ f(x)=\alpha g(x / \alpha) & f^*(y)=\alpha g^*(y) \end{array} f(x)=αg(x)f(x)=αg(x/α)f(y)=αg(y/α)f(y)=αg(y)

平移

f ( x ) = g ( x − b ) f ∗ ( y ) = b T y + g ∗ ( y ) f(x)=g(x-b) \quad f^*(y)=b^T y+g^*(y) f(x)=g(xb)f(y)=bTy+g(y)

加上仿射函数

f ( x ) = g ( x ) + a T x + b f ∗ ( y ) = g ∗ ( y − a ) − b f(x)=g(x)+a^T x+b \quad f^*(y)=g^*(y-a)-b f(x)=g(x)+aTx+bf(y)=g(ya)b

可逆仿射变换

A A A是非奇异方阵
f ( x ) = g ( A x ) f ∗ ( y ) = g ∗ ( A − T y ) f(x)=g(A x) \quad f^*(y)=g^*\left(A^{-T} y\right) f(x)=g(Ax)f(y)=g(ATy)

实例

指示函数

ι { 1 } ( x ) = { 0 , x = 1 + ∞ , otherwise \iota_{\{1\}}(x)=\begin{cases}0, &x=1\\+\infty, &\text{otherwise}\\\end{cases} ι{ 1}(x)={ 0,+,x=1otherwise
ι { 1 } ∗ ( y ) = y \iota_{\{1\}}^*(y)=y ι{ 1}(y)=y

KL散度

f ( μ ) = K L ( μ ∣ ν ) = ∑ i [ log ⁡ ( μ i ν i ) μ i − μ i + ν i ] f\left(\mathbf{\mu}\right)=\mathrm{KL}(\mathbf{\mu} \mid \mathbf{\nu}) = \sum_i\left[\log \left(\frac{\mu_i}{\nu_i}\right) \mu_i-\mu_i+\nu_i\right] f(μ)=KL(μν)=i[log(νiμi)μiμi+νi]

证明:
g ( y ) = y T μ − ∑ i [ log ⁡ ( μ i ν i ) μ i − μ i + ν i ] g\left(\mathbf{y}\right)= \mathbf{y}^T\mathbf{\mu}-\sum_i\left[\log \left(\frac{\mu_i}{\nu_i}\right) \mu_i-\mu_i+\nu_i\right] g(y)=yTμi[log(νiμi)μiμi+νi]

∂ g ∂ μ i = y i − log ⁡ μ i ν i = 0 ⇒ μ i = v i e y i \frac{\partial g}{\partial \mu_i}=y_i-\log\frac{\mu_i}{\nu_i}=0\Rightarrow \mu_i=v_ie^{y_i} μig=yilogνiμi=0μi=vieyi

因此
f ∗ ( y ) = ∑ i ( v i e y i − v i ) = ⟨ v , e y − 1 ⟩ f^*\left(\mathbf{y}\right)=\sum_i\left(v_ie^{y_i}-v_i\right)=\langle\mathbf{v}, e^\mathbf{y}-1\rangle f(y)=i(vieyivi)=v,ey1

例子3

φ ( x ) = x log ⁡ x − x + 1 \varphi(x)=x \log x-x+1 φ(x)=xlogxx+1
φ ∗ ( y ) = e y − 1 \varphi^*\left(y\right) = e^y -1 φ(y)=ey1

Unbalanced optimal transport

U O T ( α , β ) ≜ inf ⁡ π ⩾ 0 ⟨ π , C ⟩ + ε K L ( π ∣ α ⊗ β ) + D φ 1 ( π 1 ∣ α ) + D φ 2 ( π 2 ∣ β ) \begin{aligned} & \mathrm{UOT}(\alpha, \beta) \triangleq \inf _{\pi \geqslant 0}\langle\pi, \mathrm{C}\rangle+\varepsilon \mathrm{KL}(\pi \mid \alpha \otimes \beta) \\ &+\mathrm{D}_{\varphi_1}\left(\pi_1 \mid \alpha\right)+\mathrm{D}_{\varphi_2}\left(\pi_2 \mid \beta\right) \end{aligned} UOT(α,β)π0infπ,C+εKL(παβ)+Dφ1(π1α)+Dφ2(π2β)

其中 m ( α ) m\left(\mathbf{\alpha}\right) m(α)不一定等于 m ( β ) m\left(\mathbf{\beta}\right) m(β)

对偶

UOT ⁡ ( α , β ) = sup ⁡ ( f , g ) F ε ( f , g ) \operatorname{UOT}(\alpha, \beta)=\sup _{(f, g)} \mathcal{F}_{\varepsilon}(f, g) UOT(α,β)=(f,g)supFε(f,g)

其中
F ε ( f , g ) ≜ ⟨ α , − φ 1 ∗ ( − f ) ⟩ + ⟨ β , − φ 2 ∗ ( − g ) ⟩ − ε ⟨ α ⊗ β , e f ⊕ g − C ε − 1 ⟩ \begin{aligned} \mathcal{F}_{\varepsilon}(f, g) \triangleq & \left\langle\alpha,-\varphi_1^*(-f)\right\rangle+\left\langle\beta,-\varphi_2^*(-g)\right\rangle \\ & -\varepsilon\left\langle\alpha \otimes \beta, e^{\frac{f \oplus g-\mathrm{C}}{\varepsilon}}-1\right\rangle \end{aligned} Fε(f,g)α,φ1(f)+β,φ2(g)εαβ,eεfgC1
其中 φ ∗ \varphi^* φ是Legendre transform

φ 1 ∗ ( − f ) ≜ ( φ 1 ∗ ( − f i ) ) i ∈ R N \varphi_1^*(-f) \triangleq \left(\varphi_1^*\left(-f_i\right)\right)_i \in \mathbb{R}^N φ1(f)(φ1(fi))iRN

如果 ϵ = 0 \epsilon=0 ϵ=0, 则最后一项变为约束 f ⊕ g ≤ C f\oplus g \le C fgC

Sinkhorn算法

一种解决OT/UOT的对偶问题方法就是Sinkhorn算法

D φ = ρ K L D_\varphi= \rho KL Dφ=ρKL时,收敛速度为 ( 1 + ϵ ρ ) − 1 \left(1+\frac{\epsilon}{\rho}\right)^{-1} (1+ρϵ)1,当 ε ≪ ρ \varepsilon \ll \rho ερ时,趋于 1 1 1

平移不变公式

φ 1 ( x ) = φ 2 ( x ) = ι { 1 } ( x ) \varphi_1(x) = \varphi_2(x)=\iota_{\{1\}}(x) φ1(x)=φ2(x)=ι{ 1}(x)
F ε ( f , g ) = ⟨ α , f ⟩ + ⟨ β , g ⟩ − ε ⟨ α ⊗ β , e f ⊕ g − C ε − 1 ⟩ \mathcal{F}_{\varepsilon}(f, g)=\langle\alpha, f\rangle+\langle\beta, g\rangle-\varepsilon\left\langle\alpha \otimes \beta, e^{\frac{f \oplus g-\mathrm{C}}{\varepsilon}}-1\right\rangle Fε(f,g)=α,f+β,gεαβ,eεfgC1
对于任意常数 λ ∈ R , F ε ( f + λ , g − λ ) = F ε ( f , g ) + λ m ( α ) − λ m ( β ) = F ε ( f , g ) \lambda \in \mathbb{R}, \mathcal{F}_{\varepsilon}(f+\lambda, g-\lambda)=\mathcal{F}_{\varepsilon}(f, g) +\lambda m\left(\mathbf{\alpha}\right)-\lambda m\left(\mathbf{\beta}\right)=\mathcal{F}_{\varepsilon}(f, g) λR,Fε(f+λ,gλ)=Fε(f,g)+λm(α)λm(β)=Fε(f,g)

但是对于一般的UOT不成立

( f ∗ , g ∗ ) \left(f^*,g^*\right) (f,g) F ε \mathcal{F}_{\varepsilon} Fε的最优解。如果Sinkhorn算法的初值为 f 0 = f ⋆ + τ , τ ∈ R f_0=f^{\star}+\tau, \tau \in\mathbb{R} f0=f+τ,τR

对于 D φ = ρ K L D_\varphi=\rho KL Dφ=ρKL,有 f t = f ∗ + ( ρ ε + ρ ) 2 t τ f_t = f^* + \left(\frac{\rho}{\varepsilon + \rho}\right)^{2t}\tau ft=f+(ε+ρρ)2tτ,即迭代对平移敏感,并且当 ε ≪ ρ \varepsilon \ll \rho ερ时,误差减小的速度很慢

为了解决这个问题,作者提出
H ε ( f ˉ , g ˉ ) ≜ sup ⁡ λ ∈ R F ε ( f ˉ + λ , g ˉ − λ ) \mathcal{H}_{\varepsilon}(\bar{f}, \bar{g}) \triangleq \sup _{\lambda \in \mathbb{R}} \mathcal{F}_{\varepsilon}(\bar{f}+\lambda, \bar{g}-\lambda) Hε(fˉ,gˉ)λRsupFε(fˉ+λ,gˉλ)
此时 H ε ( f ˉ + λ , g ˉ − λ ) = H ε ( f ˉ , g ˉ ) \mathcal{H}_{\varepsilon}(\bar{f}+\lambda, \bar{g}-\lambda)=\mathcal{H}_{\varepsilon}(\bar{f}, \bar{g}) Hε(fˉ+λ,gˉλ)=Hε(fˉ,gˉ)

对于 U O T ( α , β ) UOT\left(\alpha,\beta\right) UOT(α,β), F ε ( f , g ) \mathcal{F}_{\varepsilon}(f, g) Fε(f,g) H ε ( f ˉ , g ˉ ) \mathcal{H}_{\varepsilon}(\bar{f}, \bar{g}) Hε(fˉ,gˉ)产生相同的解
( f , g ) = ( f ˉ + λ ⋆ ( f ˉ , g ˉ ) , g ˉ − λ ⋆ ( f ˉ , g ˉ ) )  where  λ ⋆ ( f ˉ , g ˉ ) ≜ argmax ⁡ λ ∈ R F ε ( f ˉ + λ , g ˉ − λ ) . \begin{array}{cl} & (f, g)=\left(\bar{f}+\lambda^{\star}(\bar{f}, \bar{g}), \bar{g}-\lambda^{\star}(\bar{f}, \bar{g})\right) \\ \text { where } & \lambda^{\star}(\bar{f}, \bar{g}) \triangleq \underset{\lambda \in \mathbb{R}}{\operatorname{argmax}} \mathcal{F}_{\varepsilon}(\bar{f}+\lambda, \bar{g}-\lambda) . \end{array}  where (f,g)=(fˉ+λ(fˉ,gˉ),gˉλ(fˉ,gˉ))λ(fˉ,gˉ)λRargmaxFε(fˉ+λ,gˉλ).
作者假设 φ ∗ \varphi^* φ严格凸函数,进而 λ ∗ \lambda^* λ唯一

H ε H_\varepsilon Hε性质

m ( α ) ≜ ⟨ α , 1 ⟩ = ∑ i α i m(\alpha) \triangleq\langle\alpha, 1\rangle=\sum_i \alpha_i m(α)α,1=iαi

λ ⋆ ( f ˉ , g ˉ ) ≜ argmax ⁡ λ ∈ R F ε ( f ˉ + λ , g ˉ − λ ) \lambda^{\star}(\bar{f}, \bar{g}) \triangleq \underset{\lambda \in \mathbb{R}}{\operatorname{argmax}} \mathcal{F}_{\varepsilon}(\bar{f}+\lambda, \bar{g}-\lambda) λ(fˉ,gˉ)λRargmaxFε(fˉ+λ,gˉλ)

性质1

φ 1 ∗ , φ 2 ∗ \varphi_1^*,\varphi_2^* φ1,φ2是光滑,严格凸函数

则存在一个唯一的极大值解 λ ⋆ ( f ˉ , g ˉ ) \lambda^{\star}(\bar{f}, \bar{g}) λ(fˉ,gˉ)

进一步, ( α ~ , β ~ ) = ∇ H 0 ( f ˉ , g ˉ ) (\tilde{\alpha}, \tilde{\beta})=\nabla \mathcal{H}_0(\bar{f}, \bar{g}) (α~,β~)=H0(fˉ,gˉ)满足 α ~ = ∇ φ 1 ∗ ( − f ˉ − λ ∗ ( f ˉ , g ˉ ) ) α , β ~ = ∇ φ 2 ∗ ( − g ˉ + λ ⋆ ( f ˉ , g ˉ ) ) β \tilde{\alpha}=\nabla \varphi_1^*\left(-\bar{f}-\lambda^*(\bar{f}, \bar{g})\right) \alpha, \tilde{\beta} = \nabla \varphi_2^*\left(-\bar{g}+\lambda^{\star}(\bar{f}, \bar{g})\right) \beta α~=φ1(fˉλ(fˉ,gˉ))α,β~=φ2(gˉ+λ(fˉ,gˉ))β

并且 m ( α ~ ) = m ( β ~ ) m(\tilde{\alpha})=m(\tilde{\beta}) m(α~)=m(β~)

(这里 φ ∗ \varphi^* φ是一个标量函数,因此这个 ∇ φ ∗ \nabla \varphi^* φ应该是一个对角矩阵,即 ∇ φ ∗ ( f ) = d i a g ( ( φ ∗ ) ′ ( f 1 ) ( φ ∗ ) ′ ( f 2 ) ⋮ ( φ ∗ ) ′ ( f n ) ) \nabla\varphi^*\left(f\right) = \rm{diag}\begin{pmatrix}\left(\varphi^*\right)^{\prime}\left(f_1\right)\\\left(\varphi^*\right)^{\prime}\left(f_2\right)\\\vdots \\\left(\varphi^*\right)^{\prime}\left(f_n\right)\\\end{pmatrix} φ(f)=diag (φ)(f1)(φ)(f2)(φ)(fn) )

证明:

对于任意的 ( f ˉ , g ˉ ) (\bar{f}, \bar{g}) (fˉ,gˉ),定义 G ε ( λ ) =  def.  F ε ( f ˉ + λ , g ˉ − λ ) \mathcal{G}_{\varepsilon}(\lambda) \stackrel{\text { def. }}{=} \mathcal{F}_{\varepsilon}(\bar{f}+\lambda, \bar{g}-\lambda) Gε(λ)= def. Fε(fˉ+λ,gˉλ)

根据[Liero et al., 2015], lim ⁡ x → ∞ φ ( x ) = + ∞ \lim\limits_{x\to \infty}\varphi\left(x\right)=+\infty xlimφ(x)=+, 当 λ → ± ∞ \lambda \to \pm \infty λ±时, G ε → − ∞ \mathcal{G}_{\varepsilon}\to -\infty Gε,即 G ε \mathcal{G}_{\varepsilon} Gε是强制函数

因此能在 R \mathbb{R} R取到全局最大值

因为 φ ∗ \varphi^* φ是严格凸的,所以唯一

G ε \mathcal{G}_\varepsilon Gε求导得
d G ε d λ = ⟨ α , ∇ φ 1 ∗ ( − f ˉ − λ ) ⟩ − ⟨ β , ∇ φ 2 ∗ ( − g ˉ + λ ) ⟩ = 0 ⇒ ⟨ α , ∇ φ 1 ∗ ( − f ˉ − λ ) ⟩ = ⟨ β , ∇ φ 2 ∗ ( − g ˉ + λ ) ⟩ ⇒ ⟨ α ~ , 1 ⟩ = ⟨ β ~ , 1 ⟩ ⇒ m ( α ~ ) = m ( β ~ ) \begin{aligned} &\frac{\rm{d}\mathcal{G}_\varepsilon}{\rm{d}\lambda}=\langle\alpha, \nabla\varphi_1^*\left(-\bar{f}-\lambda\right)\rangle-\left\langle\beta, \nabla \varphi_2^*(-\bar{g}+\lambda)\right\rangle=0\\ &\Rightarrow \langle\alpha, \nabla\varphi_1^*\left(-\bar{f}-\lambda\right)\rangle = \left\langle\beta, \nabla \varphi_2^*(-\bar{g}+\lambda)\right\rangle\\ &\Rightarrow\langle \tilde{\alpha}, 1\rangle = \langle\tilde{\beta}, 1\rangle\\ &\Rightarrow m(\tilde{\alpha})=m(\tilde{\beta}) \end{aligned} dλdGε=α,φ1(fˉλ)β,φ2(gˉ+λ)=0α,φ1(fˉλ)=β,φ2(gˉ+λ)α~,1=β~,1m(α~)=m(β~)

性质2

φ i = ρ i K L \varphi_i = \rho_iKL φi=ρiKL

λ ⋆ ( f ˉ , g ˉ ) = ρ 1 ρ 2 ρ 1 + ρ 2 log ⁡ [ ⟨ α , e − f ˉ ρ 1 ⟩ ⟨ β , e − g ˉ ρ 2 ⟩ ] \lambda^{\star}(\bar{f}, \bar{g})=\frac{\rho_1 \rho_2}{\rho_1+\rho_2} \log \left[\frac{\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle}{\left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle}\right] λ(fˉ,gˉ)=ρ1+ρ2ρ1ρ2log β,eρ2gˉα,eρ1fˉ
证明:
⟨ α , e − f ˉ + λ ⋆ ρ 1 ⟩ = ⟨ β , e − g ˉ − λ ⋆ ρ 2 ⟩ , ⇔ e − λ ⋆ ρ 1 ⟨ α , e − f ˉ ρ 1 ⟩ = e + λ ⋆ ρ 2 ⟨ β , e − g ˉ ρ 2 ⟩ , ⇔ − λ ⋆ ρ 1 + log ⁡ ⟨ α , e − f ˉ ρ 1 ⟩ = λ ⋆ ρ 2 + log ⁡ ⟨ β , e − g ˉ ρ 2 ⟩ , ⇔ λ ⋆ ( 1 ρ 1 + 1 ρ 2 ) = log ⁡ [ ⟨ α , e − f ˉ ρ 1 ⟩ ⟨ β , e − g ˉ ρ 2 ⟩ ] , ⇔ λ ⋆ ( f ˉ , g ˉ ) = ρ 1 ρ 2 ρ 1 + ρ 2 log ⁡ [ ⟨ α , e − f ˉ ρ 1 ⟩ ⟨ β , e − g ˉ ρ 2 ⟩ ] . \begin{aligned} & \left\langle\alpha, e^{-\frac{\bar{f}+\lambda^{\star}}{\rho_1}}\right\rangle=\left\langle\beta, e^{-\frac{\bar{g}-\lambda^{\star}}{\rho_2}}\right\rangle, \\ & \Leftrightarrow e^{-\frac{\lambda^{\star}}{\rho_1}}\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle=e^{+\frac{\lambda^{\star}}{\rho_2}}\left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle, \\ & \Leftrightarrow-\frac{\lambda^{\star}}{\rho_1}+\log \left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle=\frac{\lambda^{\star}}{\rho_2}+\log \left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle, \\ & \Leftrightarrow \lambda^{\star}\left(\frac{1}{\rho_1}+\frac{1}{\rho_2}\right)=\log \left[\frac{\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle}{\left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle}\right], \\ & \Leftrightarrow \lambda^{\star}(\bar{f}, \bar{g})=\frac{\rho_1 \rho_2}{\rho_1+\rho_2} \log \left[\frac{\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle}{\left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle}\right] . \end{aligned} α,eρ1fˉ+λ=β,eρ2gˉλ,eρ1λα,eρ1fˉ=e+ρ2λβ,eρ2gˉ,ρ1λ+logα,eρ1fˉ=ρ2λ+logβ,eρ2gˉ,λ(ρ11+ρ21)=log β,eρ2gˉα,eρ1fˉ ,λ(fˉ,gˉ)=ρ1+ρ2ρ1ρ2log β,eρ2gˉα,eρ1fˉ .

性质3

τ 1 = ρ 1 ρ 1 + ρ 2 , τ 2 = ρ 2 ρ 1 + ρ 2 \tau_1 = \frac{\rho_1}{\rho_1 + \rho_2}, \tau_2 = \frac{\rho_2}{\rho_1 + \rho_2} τ1=ρ1+ρ2ρ1τ2=ρ1+ρ2ρ2

H ε ( f ˉ , g ˉ ) = ρ 1 m ( α ) + ρ 2 m ( β ) − ε ⟨ α ⊗ β , e f ˉ ⊕ g ˉ − C ε − 1 ⟩ − ( ρ 1 + ρ 2 ) ( ⟨ α , e − f ˉ ρ 1 ⟩ ) τ 1 ( ⟨ β , e − g ˉ ρ 2 ⟩ ) τ 2 ⋅ \begin{aligned} \mathcal{H}_{\varepsilon}(\bar{f}, \bar{g})=\rho_1 m(\alpha)+\rho_2 m(\beta)-\varepsilon\left\langle\alpha \otimes \beta, e^{\frac{\bar{f} \oplus \bar{g}-\mathrm{C}}{\varepsilon}}-1\right\rangle \\ \quad-\left(\rho_1+\rho_2\right)\left(\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle\right)^{\tau_1}\left(\left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle\right)^{\tau_2} \cdot \end{aligned} Hε(fˉ,gˉ)=ρ1m(α)+ρ2m(β)εαβ,eεfˉgˉC1(ρ1+ρ2)(α,eρ1fˉ)τ1(β,eρ2gˉ)τ2
特别地,当 ρ 1 = ρ 2 = ρ , ε = 0 \rho_1=\rho_2=\rho, \varepsilon=0 ρ1=ρ2=ρ,ε=0
H 0 ( f ˉ , g ˉ ) = ρ [ m ( α ) + m ( β ) − 2 ⟨ α , e − f ˉ ρ ⟩ ⟨ β , e − g ˉ ρ ⟩ ] \mathcal{H}_0(\bar{f}, \bar{g})=\rho\left[m(\alpha)+m(\beta)-2 \sqrt{\left\langle\alpha, e^{-\frac{\bar{f}}{\rho}}\right\rangle\left\langle\beta, e^{-\frac{\bar{g}}{\rho}}\right\rangle}\right] H0(fˉ,gˉ)=ρ[m(α)+m(β)2α,eρfˉβ,eρgˉ ]
证明:

φ i ( x ) = ρ i ( x log ⁡ x − x + 1 ) \varphi_i(x)=\rho_i\left(x \log x-x+1\right) φi(x)=ρi(xlogxx+1)
φ i ∗ ( x ) = ρ i ( e x ρ i − 1 ) \varphi_i^*\left(x\right)=\rho_i\left(e^{\frac{x}{\rho_i}}-1\right) φi(x)=ρi(eρix1)

F 0 ( f , g ) = ⟨ α , − ρ 1 ( e − f ˉ ρ 1 − 1 ) ⟩ + ⟨ β , − ρ 2 ( e − g ˉ ρ 2 − 1 ) ⟩ = ρ 1 m ( α ) + ρ 2 m ( β ) − ρ 1 ⟨ α , e − f ˉ ρ 1 ⟩ − ρ 2 ⟨ β , e − g ˉ ρ 2 ⟩ \begin{aligned} \mathcal{F}_{0}(f, g) & =\left\langle\alpha,-\rho_1\left(e^{-\frac{\bar{f}}{\rho_1}}-1\right)\right\rangle+\left\langle\beta,-\rho_2\left(e^{-\frac{\bar{g}}{\rho_2}}-1\right)\right\rangle \\ & =\rho_1 m(\alpha)+\rho_2 m(\beta)-\rho_1\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle-\rho_2\left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle \end{aligned} F0(f,g)=α,ρ1(eρ1fˉ1)+β,ρ2(eρ2gˉ1)=ρ1m(α)+ρ2m(β)ρ1α,eρ1fˉρ2β,eρ2gˉ

根据性质2
⟨ α , e − f ˉ + λ ⋆ ( f ˉ , g ˉ ) ρ 1 ⟩ = ⟨ α , e − f ˉ ρ 1 ⟩ ⋅ exp ⁡ ( − ρ 2 ρ 1 + ρ 2 log ⁡ [ ⟨ α , e − f ˉ ρ 1 ⟩ ⟨ β , e − g ˉ ρ 2 ⟩ ] ) = ⟨ α , e − f ˉ ρ 1 ⟩ ⋅ ⟨ α , e − f ˉ ρ 1 ⟩ − ρ 2 ρ 1 + ρ 2 ⋅ ⟨ β , e − g ˉ ρ 2 ⟩ ρ 2 ρ 1 + ρ 2 = ⟨ α , e − f ˉ ρ 1 ⟩ ρ 1 ρ 1 + ρ 2 ⋅ ⟨ β , e − g ˉ ρ 2 ⟩ ρ 2 ρ 1 + ρ 2 \begin{aligned} \left\langle\alpha, e^{-\frac{\bar{f}+\lambda^{\star}(\bar{f}, \bar{g})}{\rho_1}}\right\rangle & =\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle \cdot \exp \left(-\frac{\rho_2}{\rho_1+\rho_2} \log \left[\frac{\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle}{\left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle}\right]\right) \\ & =\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle \cdot\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle^{-\frac{\rho_2}{\rho_1+\rho_2}} \cdot\left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle^{\frac{\rho_2}{\rho_1+\rho_2}} \\ & =\left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle^{\frac{\rho_1}{\rho_1+\rho_2}} \cdot\left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle^{\frac{\rho_2}{\rho_1+\rho_2}} \end{aligned} α,eρ1fˉ+λ(fˉ,gˉ)=α,eρ1fˉexp ρ1+ρ2ρ2log β,eρ2gˉα,eρ1fˉ =α,eρ1fˉα,eρ1fˉρ1+ρ2ρ2β,eρ2gˉρ1+ρ2ρ2=α,eρ1fˉρ1+ρ2ρ1β,eρ2gˉρ1+ρ2ρ2
类似地
⟨ β , e − g ˉ − λ ⋆ ( f ˉ , g ˉ ) ρ 2 ⟩ = ⟨ α , e − f ˉ ρ 1 ⟩ ρ 1 ρ 1 + ρ 2 ⋅ ⟨ β , e − g ˉ ρ 2 ⟩ ρ 2 ρ 1 + ρ 2 \left\langle\beta, e^{-\frac{\bar{g}-\lambda^{\star}(\bar{f}, \bar{g})}{\rho_2}}\right\rangle = \left\langle\alpha, e^{-\frac{\bar{f}}{\rho_1}}\right\rangle^{\frac{\rho_1}{\rho_1+\rho_2}} \cdot\left\langle\beta, e^{-\frac{\bar{g}}{\rho_2}}\right\rangle^{\frac{\rho_2}{\rho_1+\rho_2}} β,eρ2gˉλ(fˉ,gˉ)=α,eρ1fˉρ1+ρ2ρ1β,eρ2gˉρ1+ρ2ρ2
全部代回去即可得到结论

Translation invariant Sinkhorn

Unbalanced Sinkhorn

对于任意初值 f 0 f_0 f0
g t + 1 ( y ) = − aprox ⁡ φ 1 ∗ ( − Smin ⁡ ε α ( C ( ⋅ , y ) − f t ) ) f t + 1 ( x ) = − aprox ⁡ φ 2 ∗ ( − Smin ⁡ ε β ( C ( x , ⋅ ) − g t + 1 ) ) \begin{aligned} & g_{t+1}(y)=-\operatorname{aprox}_{\varphi_1^*}\left(-\operatorname{Smin}_{\varepsilon}^\alpha\left(\mathrm{C}(\cdot, y)-f_t\right)\right) \\ & f_{t+1}(x)=-\operatorname{aprox}_{\varphi_2^*}\left(-\operatorname{Smin}_{\varepsilon}^\beta\left(\mathrm{C}(x, \cdot)-g_{t+1}\right)\right) \end{aligned} gt+1(y)=aproxφ1(Sminεα(C(,y)ft))ft+1(x)=aproxφ2(Sminεβ(C(x,)gt+1))
其中softmin定义为 Smin ⁡ ε α ( f ) ≜ − ε log ⁡ ⟨ α , e − f / ε ⟩ \operatorname{Smin}_{\varepsilon}^\alpha(f) \triangleq-\varepsilon \log \left\langle\alpha, e^{-f / \varepsilon}\right\rangle Sminεα(f)εlogα,ef/ε
anisotropic prox为
aprox ⁡ φ ∗ ( x ) ≜ arg ⁡ min ⁡ y ∈ R ε e x − y ε + φ ∗ ( y ) \operatorname{aprox}_{\varphi^*}(x) \triangleq \arg \min _{y \in \mathbb{R}} \varepsilon e^{\frac{x-y}{\varepsilon}}+\varphi^*(y) aproxφ(x)argyRminεeεxy+φ(y)
如果 φ = ρ K L \varphi = \rho KL φ=ρKL,则 aprox ⁡ φ ∗ ( x ) = ρ ε + ρ x \operatorname{aprox}_{\varphi^*}(x)=\frac{\rho}{\varepsilon+\rho} x aproxφ(x)=ε+ρρx

softmin和 aprox ⁡ φ ∗ \operatorname{aprox}_{\varphi^*} aproxφ ∥ ⋅ ∥ ∞ \|\cdot\|_\infty 下分别是1-压缩和 ( 1 + ε ρ ) − 1 \left(1+\frac{\varepsilon}{\rho}\right)^{-1} (1+ρε)1-压缩的

TI-Sinkhorn

H ε \mathcal{H}_\varepsilon Hε进行交替对偶上升

Ψ 1 ( g ˉ ) ≜ arg ⁡ max ⁡ H ε ( ⋅ , g ˉ ) Ψ 2 ( f ˉ ) ≜ arg ⁡ max ⁡ H ε ( f ˉ , ⋅ ) \Psi_1\left(\bar{g}\right)\triangleq\arg\max \mathcal{H}_{\varepsilon}\left(\cdot,\bar{g}\right)\\ \Psi_2\left(\bar{f}\right)\triangleq\arg\max \mathcal{H}_{\varepsilon}\left(\bar{f}, \cdot\right)\\ Ψ1(gˉ)argmaxHε(,gˉ)Ψ2(fˉ)argmaxHε(fˉ,)
TI-Sinkhorn算法为
g ˉ t + 1 = Ψ 2 ( f ˉ t ) , f ˉ t + 1 = Ψ 2 ( g ˉ t + 1 ) \bar{g}_{t+1}=\Psi_2\left(\bar{f}_t\right), \bar{f}_{t+1}=\Psi_2\left(\bar{g}_{t+1}\right) gˉt+1=Ψ2(fˉt),fˉt+1=Ψ2(gˉt+1)

最后
( f t , g t ) ≜ ( f ˉ t + λ ∗ ( f ˉ t , g ˉ t ) , g ˉ t − λ ∗ ( f ˉ t , g ˉ t ) ) \left(f_t,g_t\right)\triangleq \left(\bar{f}_t + \lambda^*\left(\bar{f}_t,\bar{g}_t\right), \bar{g}_t - \lambda^*\left(\bar{f}_t,\bar{g}_t\right)\right) (ft,gt)(fˉt+λ(fˉt,gˉt),gˉtλ(fˉt,gˉt))

这个算法继承了 H ε \mathcal{H}_\varepsilon Hε的不变性
Ψ 1 ( f ˉ t + μ ) = Ψ 1 ( f ˉ t ) − μ \Psi_1\left(\bar{f}_t+\mu\right)=\Psi_1\left(\bar{f}_t\right)-\mu Ψ1(fˉt+μ)=Ψ1(fˉt)μ
即,如果 f ˉ t \bar{f}_t fˉt变为 f ˉ t + μ \bar{f}_t+\mu fˉt+μ,则 g ˉ t + 1 \bar{g}_{t+1} gˉt+1变为 g ˉ t + 1 − μ \bar{g}_{t+1}-\mu gˉt+1μ
或者如果 f ˉ t → f ˉ t + μ \bar{f}_t\to \bar{f}_{t}+\mu fˉtfˉt+μ,则 g ˉ t + 1 → g ˉ t + 1 − μ \bar{g}_{t+1}\to \bar{g}_{t+1}-\mu gˉt+1gˉt+1μ

性质4:对于固定的 ( g ˉ , f ˉ ) \left(\bar{g},\bar{f}\right) (gˉ,fˉ)
f ^ ≜ Smin ε α ( C ( x , ⋅ ) − f ˉ ) \hat{f}\triangleq \text{Smin}_\varepsilon^{\alpha}\left(C\left(x,\cdot\right)-\bar{f}\right) f^Sminεα(C(x,)fˉ)
g ^ ≜ Smin ε α ( C ( ⋅ , y ) − g ˉ ) \hat{g}\triangleq \text{Smin}_\varepsilon^{\alpha}\left(C\left(\cdot,y\right)-\bar{g}\right) g^Sminεα(C(,y)gˉ)
ψ 1 =  def.  Ψ 1 ( g ˉ ) , ψ 2 =  def.  Ψ 2 ( g ˉ ) \psi_1 \stackrel{\text { def. }}{=} \Psi_1(\bar{g}), \psi_2 \stackrel{\text { def. }}{=} \Psi_2(\bar{g}) ψ1= def. Ψ1(gˉ),ψ2= def. Ψ2(gˉ)


ψ 1 = − aprox ⁡ φ 2 ∗ ( − g ^ + λ ⋆ ( ψ 1 , g ˉ ) ) − λ ⋆ ( ψ 1 , g ˉ ) , ψ 2 = − aprox ⁡ φ 1 ∗ ( − f ^ − λ ⋆ ( f ˉ , ψ 2 ) ) + λ ⋆ ( f ˉ , ψ 2 ) . \begin{aligned} & \psi_1=-\operatorname{aprox}_{\varphi_2^*}\left(-\hat{g}+\lambda^{\star}\left(\psi_1, \bar{g}\right)\right)-\lambda^{\star}\left(\psi_1, \bar{g}\right), \\ & \psi_2=-\operatorname{aprox}_{\varphi_1^*}\left(-\hat{f}-\lambda^{\star}\left(\bar{f}, \psi_2\right)\right)+\lambda^{\star}\left(\bar{f}, \psi_2\right) . \end{aligned} ψ1=aproxφ2(g^+λ(ψ1,gˉ))λ(ψ1,gˉ),ψ2=aproxφ1(f^λ(fˉ,ψ2))+λ(fˉ,ψ2).

证明:
H ε ( f ˉ , g ˉ ) = F ε ( f ˉ + λ ⋆ ( f ˉ , g ˉ ) , g ˉ − λ ⋆ ( f ˉ , g ˉ ) ) \mathcal{H}_{\varepsilon}(\bar{f}, \bar{g})=\mathcal{F}_{\varepsilon}\left(\bar{f}+\lambda^{\star}(\bar{f}, \bar{g}), \bar{g}-\lambda^{\star}(\bar{f}, \bar{g})\right) Hε(fˉ,gˉ)=Fε(fˉ+λ(fˉ,gˉ),gˉλ(fˉ,gˉ))

g ˉ \bar{g} gˉ求导
β ⊙ e g ˉ / ε ⟨ α , e ( f ˉ − C ) / ε ⟩ = β ⊙ ∇ φ 2 ∗ ( − g ˉ + λ ⋆ ( f ˉ , g ˉ ) ) e g ˉ / ε ⟨ α , e ( f ˉ − C ) / ε ⟩ = ∇ φ 2 ∗ ( − g ˉ + λ ⋆ ( f ˉ , g ˉ ) ) \begin{aligned} \beta \odot e^{\bar{g} / \varepsilon}\left\langle\alpha, e^{(\bar{f}-\mathrm{C}) / \varepsilon}\right\rangle &=\beta \odot\nabla \varphi_2^*\left(-\bar{g}+\lambda^{\star}(\bar{f}, \bar{g})\right)\\ e^{\bar{g} / \varepsilon}\left\langle\alpha, e^{(\bar{f}-\mathrm{C}) / \varepsilon}\right\rangle &=\nabla \varphi_2^*\left(-\bar{g}+\lambda^{\star}(\bar{f}, \bar{g})\right) \end{aligned} βegˉ/εα,e(fˉC)/εegˉ/εα,e(fˉC)/ε=βφ2(gˉ+λ(fˉ,gˉ))=φ2(gˉ+λ(fˉ,gˉ))
f ˉ \bar{f} fˉ求导
e f ˉ / ε ⟨ α , e ( g ˉ − C ) / ε ⟩ = ∇ φ 1 ∗ ( − f ˉ − λ ⋆ ( f ˉ , g ˉ ) ) e^{\bar{f} / \varepsilon}\left\langle\alpha, e^{(\bar{g}-\mathrm{C}) / \varepsilon}\right\rangle=\nabla \varphi_1^*\left(-\bar{f}-\lambda^{\star}(\bar{f}, \bar{g})\right) efˉ/εα,e(gˉC)/ε=φ1(fˉλ(fˉ,gˉ))
g ^ = g ˉ − λ ⋆ ( f ˉ , g ˉ ) \hat{g}=\bar{g}-\lambda^{\star}(\bar{f}, \bar{g}) g^=gˉλ(fˉ,gˉ)
e g ^ / ε ⟨ β , e ( f ˉ + λ ⋆ ( f ˉ , g ˉ ) − C ) / ε ⟩ = ∇ φ 2 ∗ ( − g ^ ) e^{\hat{g} / \varepsilon}\left\langle\beta, e^{\left(\bar{f}+\lambda^{\star}(\bar{f}, \bar{g})-\mathrm{C}\right) / \varepsilon}\right\rangle=\nabla \varphi_2^*(-\hat{g}) eg^/εβ,e(fˉ+λ(fˉ,gˉ)C)/ε=φ2(g^)
g ^ = g ˉ − λ ⋆ ( f ˉ , g ˉ ) = − aprox ⁡ φ 2 ∗ ( − Smin ⁡ ε α ( C − f ˉ − λ ⋆ ( f ˉ , g ˉ ) ) ) \hat{g}=\bar{g}-\lambda^{\star}(\bar{f}, \bar{g})=-\operatorname{aprox}_{\varphi_2^*}\left(-\operatorname{Smin}_{\varepsilon}^\alpha\left(\mathrm{C}-\bar{f}-\lambda^{\star}(\bar{f}, \bar{g})\right)\right) g^=gˉλ(fˉ,gˉ)=aproxφ2(Sminεα(Cfˉλ(fˉ,gˉ)))

g ˉ = Ψ 1 ( f ˉ ) \bar{g}=\Psi_1(\bar{f}) gˉ=Ψ1(fˉ),因此结论成立

在这里插入图片描述

剩下的
还在看= =

猜你喜欢

转载自blog.csdn.net/qq_39942341/article/details/131628570
今日推荐