The FL protocol iteratively asks random clients to download a trainable model from a server, update it with own data, and upload the updated model to the server, while asking the server to aggregate multiple client updates to further improve the model. While clients in this protocol are free from disclosing own private data, the overall training process can become inefficient when some clients are with limited computational resources (i.e., requiring longer update time) or under poor wireless channel conditions (longer upload time).
Our new FL protocol, which we refer to as FedCS, mitigates this problem and performs FL efficiently while actively managing clients based on their resource conditions. Specifically, FedCS solves a client selection problem with resource constraints, which allows the server to aggregate as many client updates as possible and to accelerate performance improvement in ML models.
Key Words: Federated Learning, Client Selection,Greedy Algorithm
In particular, we consider the problem of running FL in a cellular network used by heterogeneous mobile devices with different data resources, computational capabilities, and wireless channel conditions.
Our main contribution is a new protocol referred to as FedCS, which can run FL efficiently while an operator of MEC frameworks actively manages the resources of heterogeneous clients. Specifically, FedCS sets a certain deadline for clients to download, update, and upload ML models in the FL protocol.
This is technically formulated by a client- selection problem that determines which clients participate in the training process and when each client has to complete the process while considering the computation and communication resource constraints imposed by the client, which we can solve in a greedy fashion.
0x02 FEDERATED LEARNING
A. Federated Learning
The only technical requirement is that each client must have a certain level of computational resources because Update and Upload consists of multiple iterations of the forward propagation and backpropagation of the model
B. Heterogeneous Client Problem in FL
Protocol 1 can experience major problems while training ML models in a practical cellular network, which are mainly due to the lack of consideration of the heterogeneous data sizes, computational capacities, and channel conditions of each client.
All such problems about heterogeneous client resources will become bottlenecks in the FL training process; the server can complete the Aggregation step only after it receives all client updates. One may set a deadline for random clients to complete the Update and Upload step and ignore any update submitted after the deadline. However, this straight-forward approach will lead to the inefficient use of network bandwidths and waste the resources of delayed clients.
0x03 FEDCS: FEDERATED LEARNING WITH CLIENT SELECTION
B. FedCS Protocol
Scheduled Update and Upload所需要耗费的时间，并以此决定选择哪些节点加入训练
C. Algorithm for
Client Selection Step
Our goal in the Client Selection step is to allow the server to aggregate as many client updates as possible within a specified deadline.
在MEC选择节点的同时，也会给选中的节点安排可用网络资源块，以防网络拥堵的情况发生！本文假设节点完成模型的更新后，会依次上传模型(One by One)！
Now, the objective of Client Selection, namely accepting as many client updates as possible, can be achieved by maximizing the number of selected clients, i.e., . To describe the constraint, we define the estimated elapsed time from the beginning of the Scheduled Update and Upload step until the -th client completes the update and upload procedures.
Selection of :
0x04 PERFORMANCE EVALUATION
A. Simulated Environment
We simulated a MEC environment implemented on the cellular network of an urban microcell consisting of an edge server, a BS, and K = 1000 clients, on a single workstation with GPUs. The BS and server were co-located at the center of the cell with a radius of 2 km, and the clients were uniformly distributed in the cell.
B. Experimental Setup of ML Tasks
C. Global Models and Their Updates
Our model consisted of six 3 × 3 convolution layers (32, 32, 64, 64, 128, 128 channels, each of which was activated by ReLU and batch normalized, and every two of which were followed by 2 × 2 max pooling) followed by three fully-connected layers (382 and 192 units with ReLU activation and another 10 units activated by soft-max).
CIFAR-10:4.6 million 模型参数
Fashion MNIST: 3.6 million 模型参数
50 for mini-batch size, 5 for the number of epochs in each round, 0.25 for the initial learning rate of stochastic gradient descent updates, and 0.99 for learning rate decay
We determined the mean capability of each client randomly from a range of 10 to 100, which are used the value for
As a result, each update time, , used in Client Selection varied from 5 to 500 k seconds averagely.
we empirically set to 3 minutes and to 400 minutes.
D. Evaluation Details
Nevertheless, our selection of model architectures was suf- ficient to show how our new protocol allowed for efficient training under resource-constrained settings and was not for achieving the best accuracies.