下载链接

2018年IEEE INFOCOM

人与人之间的差距不能一概而论…

0x00 Abstract

We analyze the convergence rate of distributed gradient descent from a theoretical point of view, based on which we propose a control algorithm that determines the best trade-off between local update and global parameter aggregation to minimize the loss function under a given resource budget. The performance of the proposed algorithm is evaluated via extensive experiments with real datasets, both on a networked prototype system and in a larger-scale simulated environment.

Key Words: Convergence Rate,Federated Learning, Resource Budget

0x01 INTRODUCTION

In this paper, we address the problem of how to efficiently utilize the limited computation and communication resources at the edge for the optimal learning performance.We consider a typical edge computing architecture where edge nodes are interconnected with the remote cloud via network elements, such as gateways and routers.

The frequency of global aggregation is configurable; one can aggregate at an interval of one or multiple local updates. Each local update consumes computation resource of the edge node, and each global aggregation consumes communication resource of the network. The amount of con- sumed resources may vary over time, and there is a complex relationship among the frequency of global aggregation, the model training accuracy, and resource consumption.

Main contributions:

  • We analyze the convergence rate of gradient-descent based distributed learning from a theoretical perspective, and obtain a novel convergence bound that incorporates non-i.i.d. data distributions among nodes and an arbitrary number of local updates between two global aggregations.
  • Using the above theoretical result, we propose a control algorithm that learns the data distribution, system dynamics, and model characteristics, based on which it dynamically adapts the frequency of global aggregation in real time to minimize the learning loss under a fixed resource budget.
  • Experiments

0x03 PRELIMINARIES AND DEFINITIONS

A.Loss Function

Fi(w)=1DijDifj(w)(1)F_i(w)= \frac{1}{|\mathcal{D_i}|}\sum_{j\in\mathcal{D_i}}f_j(w) \tag{1}\\

F(w)=jiDifj(w)iDi=i=1NDiFi(w)D(2)F(w)=\frac{\sum_{j\in\cup_i\mathcal{D_i}}f_j(w)}{|\cup_i\mathcal{D_i}|}=\frac{\sum_{i=1}^{N}D_iF_i(w)}{D} \tag{2}

下标ii表示节点信息;没有下标的表示全局信息。

B.The Learning Problem

w=argminF(w)(3)w^*=argminF(w) \tag{3}

找到权重的最优值,没有解析解,只能用梯度下降。

C.Distributed Gradient Descent

由联邦学习的特征可知,当本地一定轮数后,就会进行全局模型聚合。所以对节点上某次迭代来说,最后的模型要么就是单纯的本地训练模型,要么就是全局聚合的模型。

w~i(t)\widetilde{w}_i(t) 表示第tt迭代获得模型参数:

w~i(t)={wi(t),no aggregation is performedw(t),aggregation is performed \begin{aligned} \widetilde{w}_i(t) = \left\{ \begin{array}{} w_i(t), \text{no aggregation is performed}\\ w(t),\text{aggregation is performed } \end{array} \right. \end{aligned}

其中

  • wi(t)w_i(t)是利用梯度下降计算:

    wi(t)=w~i(t1)ηFi(w~i(t1))(4)w_i(t) = \widetilde{w}_i(t-1) - \eta\nabla F_i(\widetilde{w}_i(t-1)) \tag{4}

  • w(t)w(t)是带权聚合:

    w(t)=i=1NDiwi(t)D(5)w(t)=\frac{\sum_{i=1}^ND_iw_i(t)}{D} \tag{5}

总共会执行TT次节点本地计算,每τ\tau次聚合一次,所以整个训练过程会聚合Tτ\frac{T}{\tau}(为了方便,只考虑其为整数),

The rationale behind Algorithm 1 is that when τ=1\tau=1, i.e., when we perform global aggregation after every local update step, the distributed gradient descent (ignoring communication aspects) is equivalent to the centralized gradient descent, where the latter assumes that all data samples are available at a centralized location and the global loss function and its gradient can be observed directly. This is due to the linearity of the gradient operator.

0x04 PROBLEM FORMULATION

The notion of “resources” here is generic and can include time, energy, monetary cost etc. related to both computation and communication.

Therefore, a natural question is how to make efficient use of a given amount of resources to minimize the loss function of model training. For the distributed gradient-descent based learning approach presented above, the question narrows down to determining the optimal values of TT and τ\tau, so that F(w)F(w) is minimized subject to a given resource constraint for this learning task.

Formally, we define that each step of local update at all participating nodes consumes cc units of resource, and each step of global aggregation consumes bb units of resource,where c0c≥0 and $ b≥0$ are both real numbers.

For given TT and τ\tau, the total amount of consumed resource is then T(c+bτ)T(c+\frac{b}{\tau}) . Let RR denote the total resource budget.

Further, the resource consumptions cc and bb can be time-varying in practice which makes the problem

even more challenging than (6) alone.

目标问题:

minτ,TF(w(T))s.t.T(c+bτ)R(6)\mathop{min}\limits_{\tau,T} F(w(T)) \tag{6}\\ s.t. T(c+\frac{b}{\tau}) \leq R

0x05 CONVERGENCE ANALYSIS

A.Definitions

为了找出F(w(T))F(w)F(w(T))-F(w^*)的上界,先引入一写定义:

  • K=TτK=\frac{T}{\tau},表示整个训练过程有KK个间隔(interval)
  • [k]=def[(k1)τ,kτ][k]\mathop{=}\limits^{def}[(k-1)\tau,k\tau],表示一个间隔
  • v[k](t)v_{[k]}(t):在区间kk的辅助参数向量
    • 表示中心化梯度下降的,对集中式的梯度下降是研究比较多的,也是分布式梯度下降的性能天花板。

    • v[k](t)=v[k](t1)ηF(v[k](t1))(7)v_{[k]}(t)=v_{[k]}(t-1)-\eta\nabla F(v_{[k]}(t-1)) \tag{7}

    • 此时的损失函数F()F(\cdot)能够访问任何样本(解释了下图,为何F(v)F(v)下降快)

观察上图:

  • τ\tau的整数倍(分割线处),就会进行全局聚合,即w~i((k1)τ)=w((k1)τ)\widetilde{w}_i((k-1)\tau)=w((k-1)\tau)

  • 在间隔界限上,v[k]((k1)τ)=w((k1)τ)v_{[k]}((k-1)\tau)=w((k-1)\tau)

    这个图可以理解成为上帝视角,所以才能观测出连续的全局损失变化。


为了获得算法1的收敛界限需要分两步走:

  • 获得每个间距[k][k]w(kτ)w(k\tau)v[k](kτ)v_{[k]}(k\tau)的差距
  • 将这个差距和v[k](t)v_{[k]}(t)的收敛速度相结合,获得w(t)w(t)的收敛速度

v[k](t)v_{[k]}(t)是集中式的梯度下降,收敛速度较容易获得,再加之某个时间段的积累差距,以此来估计w(t)w(t)的速度


Lemma 1: F(w)F(w) is convex, ρ\rho-Lipschitz, and β\beta-smooth.

ProofProof: This is straightforward from Assumption 1, the definition of F(w), and triangle inequality.

Definition 1:(Gradient Divergence)

For any ii and ww, we define δi\delta_i as an upper bound of Fi(w)F(w)||\nabla F_i(w)-\nabla F(w)||,i.e.,

Fi(w)F(w)δi(8)||\nabla F_i(w)-\nabla F(w)|| \leq \delta_i \tag{8}

We also define δ=iDiδiD\delta = \frac{\sum_iD_i\delta_i}{D}

B.Main Results

Theorem 1.

For any interval [k][k] and t[k]t\in[k], we have

w(t)v[k](t)h(t(k1)τ)(9)||w(t)-v_{[k]}(t)|| \leq h(t-(k-1)\tau) \tag{9}

where

h(x)=defδβ((ηβ+1)x1)ηβx(10)h(x)\mathop{=}\limits^{def} \frac{\delta}{\beta}((\eta\beta+1)^x-1)-\eta\beta x \tag{10}

for any x>0x > 0. Furthermore, as $F (\cdot) $ is ρ\rho-Lipschitz, we have F(w(t))F(v[k](t))ρh(t(k1)τ)F(w(t))-F(v_{[k]}(t))\leq\rho h(t-(k-1)\tau).

ProofProof :数学归纳法,有空再细看

Theorem 2.

The convergence upper bound of Algorithm 1 after TT iterations is given by

F(w(T))F(w)1T(ωη(1βη2)ρh(τ)τϵ)(11)F(w(T))-F(w^*)\leq \frac{1}{T(\omega\eta(1-\frac{\beta\eta}{2})-\frac{\rho h(\tau)}{\tau\epsilon})} \tag{11}

when the following conditions are satisfied:

  • η1β\eta\leq \frac{1}{\beta}
  • ωη(1βη2)ρh(τ)τϵ>0\omega\eta(1-\frac{\beta\eta}{2})-\frac{\rho h(\tau)}{\tau\epsilon} > 0
  • F(v[k](t))F(w)ϵF(v_{[k]}(t))-F(w^*)\leq \epsilon for all tt and kk for which v[k](t)v_{[k]}(t) is defined
  • F(w(T))F(w)ϵF(w(T))-F(w^*)\geq \epsilon Where ϵ>0\epsilon >0 and ω=mink1v[k]((k1)τ)w\omega=min_k\frac{1}{||v_{[k]}((k-1)\tau)-w^*||}

ProofProof:没看。。。

0x06 CONTROL ALGORITHM

A.Approximate Solution to (6)

根据公式(6)和定理2,目标问题如下:

minτ,T1T(ωη(1βη2)ρh(τ)τϵ)s.t.TRτcτ+b(12)\mathop{min}\limits_{\tau,T} \frac{1}{T(\omega\eta(1-\frac{\beta\eta}{2})-\frac{\rho h(\tau)}{\tau\epsilon})} \tag{12}\\ s.t. T\leq \frac{R\tau}{c\tau+b}

η\etaϵ\epsilon设置的较小,能满足定理2的要求的时候,公式(12)可以转换成如下:

maxτ=1,2,3..Rτcτ+b(ωη(1βη2)ρh(τ)τϵ)s.t.Conditions 1, 3, 4 in Theorem 2(13)\mathop{max}\limits_{\tau=1,2,3..} \frac{R\tau}{c\tau+b}(\omega\eta(1-\frac{\beta\eta}{2})-\frac{\rho h(\tau)}{\tau\epsilon}) \tag{13}\\ s.t. \text{Conditions 1, 3, 4 in Theorem 2}

The optimal solution to (13) always satisfies condition 2 in Theorem 2, because τ=1\tau=1 (in which case h(τ)=0h(\tau) = 0) yields the objective function to be positive and thus satisfying the condition.

公式(13)除以Rωc\frac{R\omega}{c},则目标函数如下,其中定义a=bca=\frac{b}{c}

G(τ)=defττ+a(η(1βη2))G(\tau)\mathop{=}\limits^{def} \frac{\tau}{\tau+a}(\eta(1-\frac{\beta\eta}{2}))

现在我们的目标就是寻找最优的τ\tau:

τ=argmaxτG(τ)(15)\tau^* = \mathop{argmax}\limits_{\tau} G(\tau) \tag{15}

然后就能获得最优的T=Rτcτ+bT^*=\frac{R\tau^*}{c\tau^*+b}.

Proposition 1.

When η1β\eta\leq\frac{1}{\beta},G(τ)G(\tau) is a strictly concave function for τ1\tau\geq1.

B.Dynamic Control Algorithm

其中超参数确定后φ=defρϵ2ω\varphi\mathop{=}\limits^{def} \frac{\rho}{\epsilon^2\omega}

参考

  • 伯努利不等式
  • 凸优化