## 0x00 Abstract

A main challenge of FL is that the training data are usually non-Independent, Identically Distributed (non-IID) on the clients, which may bring the biases in the model training and cause possible accuracy degradation. To address this issue, this paper aims to propose a novel FL algorithm to alleviate the accuracy degradation caused by non-IID data at clients. Firstly, we observe that the clients with different degrees of non-IID data present heterogeneous weight divergence with the clients owning IID data. Inspired by this, we utilize weight divergence to recognize the non-IID degrees of clients.

## 0x01 INTRODUCTION

‘‘Federated learning with non-IID data’’ proposes that the public data could be distributed to clients such that the clients’ data become IID.

The work in ‘‘Hybrid-FL for wireless networks: Cooperative learning mechanism using non-IID data’’ proposes a Hybrid-FL scheme, which allows a few number of clients to upload their data to a FL server.

The work in ‘‘Astraea: Self-balancing federated learning for improving classification accuracy of mobile deep learning applications’’ assumes that the clients are willing to send their data distribution/label information to the server.

However, few work studies how to identify the non-IID degrees of the data at clients.

Similar to ‘‘Federated learning with non-IID data’’ , we assume that there exists a limited amount of approximately uniform data, which could be gathered from clients (e.g., 1% of the collection of the data from clients) or public available data.

Main Contribution:

• We define $weight\ divergence$ between clients, and observe that larger value of weight divergence indicates higher non-IID degree of data at client.
• Based on weight divergence, we propose an efficient client selection algorithm (CSFedAvg), in which clients with lower non-IID degrees of data will be more often chosen in training.

## 0x02 PRELIMINARY

There are the following two questions:

• How to differentiate the clients according to the degrees of their non-IID data, and choose them in the model training?

• If we let the clients with lower degree of non-IID data participate in the training more often than the clients with higher degree of non-IID data, whether the accuracy of FL can be improved or not?

## 0x03 THE DESIGN OF CSFedAvg

### A. OBSERVATION ON WEIGHT DIVERGENCE BETWEEN CLIENT MODELS

An intuition is that the divergence of models between clients with all IID data should be smaller than one of clients with non-IID data.

$d_{m,n}=\frac{||w_{m}(t)-w_n(t)||}{||w_n(t)||}$

It is observed that the average weight divergence of the clients with non-IID data is always higher than the clients with IID data, which confirms our intuition.

Furthermore, the weight divergence of clients with IID data is very close to each other.

### B. THE FRAMEWORK OF CSFEDAVG

CSFedAvg的主要核心就是利用自己的辅助数据集，去评估训练节点数据分布是否与该辅助数据集很接近。

CSFedAvg FL 主要过程如下：

• Global Model Downloading

中央服务器随机选择一些节点构成 随机节点集

该类节点和候选节点集的节点，从中央服务器下载模型。

• Client Model Update

• 上述节点进行本地模型训练。

• 中央服务器用辅助数据集训练进行本地训练。

• Client Set Update and Model Aggregation

• 利用中央模型与节点上传模型之间的WD，更新候选节点集和最终选择集。
• 基于随机节点集最终选择集进行模型聚合更新 (候选节点集不参与聚合)。
• 重复上述条件，直至到达固定轮数或目标准确值

### C. THE ALGORITHM DESIGN OF CSFedAvg

• $\mathbb{N}_t,\mathbb{C}_t,\mathbb{S}$：随机节点集、候选节点集、最终节点集
• $w_k(t),w_0(t)$：节点$k$在第$t$轮的模型、中央服务器使用辅助数据训练的模型
• $p,\alpha$：两个选择比例，超参

CSFedAvg在第$t$轮主要过程：

• 中央服务器随机选择 $pK$个节点构成$\mathbb{N}_t$$\mathbb{N}_t,\mathbb{C}_t,\mathbb{S}$下载中央模型并训练，中央服务器也用辅助数据集训练$w_0(t-1)$

• $\mathbb{N}_t,\mathbb{C}_t,\mathbb{S}$上传其本地模型

• 中央服务器计算模型间WD，用$d_k(t)$表示:

$d_k(t)=\frac{||w_k(t)-w_0(t)||}{||w_0(t)||}$

• 根据WD对$\mathbb{N}_t$的模型进行排序(从小到大)：

• 如果$|\mathbb{S}|< \alpha K$,从$\mathbb{C}_t$中选择前$\lfloor \alpha|\mathbb{N}_t|\rfloor$个加入$\mathbb{S}$（退出条件：$\mathbb{C}_t$全部考虑完或$|\mathbb{S}|=\lfloor \alpha K \rfloor$
• 更新$\mathbb{C}_t$为前$\lfloor \alpha|\mathbb{N}_t|\rfloor$
• 如果$|\mathbb{S}|\geq \alpha K$,中央服务器直接进行模型聚合（此处聚合的是$w_0$以及$\mathbb{N}_t,\mathbb{S}$的模型，详见算法1
• 模型聚合：

• 如果$|\mathbb{S}|$未满，聚合方式如下：

$w(t)=\sum_{c_k \in \mathbb{N}_t\cup \mathbb{S}}\frac{\mathcal{D_k}}{\mathcal{D}}w_k(t)\\ |\mathcal{D}|=\sum_{c_k \in \mathbb{N}_t\cup \mathbb{S}}|\mathcal{D_k}|$

• 如果$|\mathbb{S}|$已满，聚合方式如下：

$w(t)=\frac{|\mathcal{D}_0|}{|\mathcal{D}|}w_0(t)+\sum_{c_k \in \mathbb{N}_t\cup \mathbb{S}}\frac{\mathcal{D_k}}{\mathcal{D}}w_k(t)\\ |\mathcal{D}|=|\mathcal{D}_0|+\sum_{c_k \in \mathbb{N}_t\cup \mathbb{S}}|\mathcal{D_k}|$

• 直到满足循环结束条件

算法1

算法2

• 只进不出也需要思考，在模型初期，WD相近就绝对反映出数据分布相近吗？

• 两个比例超参怎么设置才合适？实验也只是进行探索，并无证明，这个超参是否还与辅助数据集大小相关？

• 这个辅助数据集的分布是直接决定$\mathbb{S}$的选取！

• 如果这个辅助数据集很小？有必要一直伴随着训练吗？会不会过拟合？如果很大，那验证集的准确率是否是因为辅助数据集提高的？

## 0x04 SIMULATION RESULTS

### A. SETUP

• DataSetCIFAR-10Fashion MNIST

These two datasets will be dis- tributed to $K = 500$ clients, with each client holding 500 sam- ples of CIFAR-10, and each client holding 600 samples of Fashion MNIST.

• IID/Non-IID

For both Hybrid I and Hybrid II settings, only a small number of clients, $\gamma*100\%$ clients are assigned with IID data, while other clients are randomly assigned by images from a few certain limited classes.

这里还需要部分节点的数据是IID的。

• CNN

Specifically, we set four 3 × 3 convolution layers (32, 32, 64, 64 channels, each of which was activated by ReLU and batch normalized, and every two of which were followed by 2 × 2 max pool- ing), two fully connected layers (384 and 192 units withReLU activation), and a final output layer with 10 units for training both CIFAR-10 and Fashion MNIST datasets.

• P=0.1, Mini-batch SGD

• Baseline: FedAvg,FedProx

### B. PERFORMANCE COMPARISON ON TRAINING ACCURACY

$\alpha = \gamma=6\%$

There are two reasons why CSFedAvg can mitigate more accuracy loss on Hybrid I setting.

Firstly, clients with higher degree of non-IID data will lead to larger weight divergence, when the number of clients with IID data and non-IID data is constant.

Secondly, clients with higher degree of non-IID data can accelerate the degradation of the global model performance.

### C. PERFORMANCE COMPARISON ON THE NUMBER OF COMMUNICATION ROUNDS

We record the Time of Arrival at a desired accuracy, denoted by ToA@x, where x denotes the target accuracy. Similarly, we measure the accu- racy of each algorithm after running 300 rounds, denoted by $Accuracy$ in the table.

emmm, 739？893？没检查出来？

### D. THE IMPACT OF SELECTION FACTOR $\alpha$ ON ACCURACY AND COMMUNICATION ROUNDS

we record the number of communication rounds required to reach a target accuracy based on both CIFAR-10 and Fashion MNIST datasets.