Federated Learning FedAvg-Decentralized Data-based Efficient Communication Learning for Deep Networks

        With the improvement of computer computing power, machine learning, as an analysis and processing technology for massive data, has widely served human society. However, the development of machine learning technology faces two major challenges: one is that data security is difficult to be guaranteed, and the problem of privacy leakage needs to be solved urgently; the other is network security isolation and industry privacy. Isolated islands cannot be safely shared, and the performance of the machine learning model trained only on the independent data of each department cannot achieve global optimization. To solve the above problems, Google proposed federated learning (FL, federated learning) technology.

        This article mainly interprets and summarizes the key content of "Communication-Efficient Learning of Deep Networks from Decentralized Data", the pioneering work of federated learning.

Link to the paper: Communication-Efficient Learning of Deep Networks from Decentralized Data

Source code implementation: https://gitcode.net/mirrors/WHDY/fedavg?utm_source=csdn_github_accelerator 

Table of contents

Summary

1 Introduction

1.1 Source of the problem

1.2 Contribution of this article

1.3 Features of federated learning

1.4 Federation Optimization

1.5 Related work

1.6 Framework of federated learning

2. Algorithm introduction

2.1 Federated Stochastic Gradient Descent (FedSGD)

2.2 Federal Averaging Algorithm (FedAvg)

3. Experiment design and implementation

3.1 Model initialization

3.2 Dataset setup

3.2.1 MNIST dataset

3.2.2 Collection of Shakespeare

3.3 Experimental optimization

3.3.1 Adding parallelism

3.3.2 Increase the calculation amount of the client

 3.4 Exploring over-optimization of client-side datasets

3.5 CIFAR experiment

3.6 Large-Scale LSTM Experiments

4. Summary Outlook

 Summary

Modern mobile devices have a large amount of data suitable for model learning, and models trained based on these data can greatly improve user experience. For example, the language model can improve the accuracy of speech recognition and the efficiency of text input, and the image model can automatically filter good photos. However, the rich data owned by mobile devices often has sensitive private information about users and the total amount of data stored by multiple mobile devices is large, so it is not suitable to upload the data of each mobile device to the data center, and then Use traditional methods for model training. The authors propose an alternative approach that learns a shared model based on data distributed across devices (without uploading to a data center) and then aggregated by locally computed updates. The authors define this decentralized approach as "federated learning". The authors propose a practical approach for the federated learning task of deep networks that averages the model multiple times during the learning process. At the same time, the authors used five different models and four datasets to experimentally validate this method. Experimental results show that this method is robust to unbalanced and non-IID data. In this method, the resource overhead generated by communication is the main bottleneck, and experimental results show that the communication rounds of this method are reduced by 10-100 times compared with synchronous stochastic gradient descent.

1 Introduction

1.1 Source of the problem

        There is a large amount of data in mobile devices suitable for machine learning tasks, and leveraging this data can in turn improve user experience. For example, image recognition models can help users pick good photos. However, these data are highly private and have a large amount of data, so it is impossible for us to take these data to the cloud server for centralized training. The paper proposes a distributed machine learning method called Federal Learning (Federal Learning). In this framework, the server sends the global model to the client, the client uses the local data set for training, and uploads the trained weights to the server to update the global model.

1.2 Contribution of this article

  • It is proposed that training models from decentralized data stored in various mobile devices is an important research direction
  • A simple and practical algorithm is proposed to solve this learning problem in a decentralized setting
  • Extensive experiments were performed to evaluate the proposed algorithm

        Specifically, this paper introduces the "federated averaging" algorithm, which fuses local stochastic gradient descent computation on the client with model averaging on the server. The author has conducted extensive experiments with this algorithm, and the results show that this algorithm is robust to unbalanced and non-IID data, and enables the communication required for deep network training on non-centrally stored data. Rounds are reduced by several orders of magnitude.

1.3 Features of federated learning

  • Model training from real data stored in multiple mobile devices is more advantageous than training models from data stored in data centers
  • Due to the privacy of data and the large amount of data stored by multiple mobile devices, it is not suitable to upload it to the data center for model training
  • For supervised learning tasks, label information in the data can be inferred from user interactions with the application

1.4 Federation Optimization

        The traditional distributed learning focuses on how to distribute the training of a large neural network, and the data may still be stored in several large training centers. However, federated learning pays more attention to the data itself, using federated learning to ensure that the data does not go out of the local area, and to improve the learning model according to the characteristics of the data. Compared with typical distributed optimization problems, federated optimization has several key features:

  • Non-IID: The characteristics and distribution of data are different among different parties
  • Unbalanced: Some users will use the service or application more, resulting in a difference in the amount of local training data
  • Massively distributed: the number of users participating in optimization >> the average amount of data per user
  • Limited communication: Efficient communication between client and server cannot be guaranteed

 This paper focuses on non-IID and imbalance problems in optimization tasks, as well as the communication-limited critical property.

Note: Independent and identical distribution assumption (IID)

        Objective function for a non-convex neural network:

For a machine learning problem, there is , that is, the loss of predicting instances with model parameters w .

        There are K clients, the data points of the kth client are P_{k}, and the number of corresponding data sets is n_{k}=\left | P_{k} \right |The above formula can be written as:

If P_{k}the above data set is randomly and evenly sampled, it is called IID setting, and at this time:

If it is not established, it is called Non-IID. 

1.5 Related work

        In related work, distributed training of perceptrons by iteratively averaging locally trained models was conducted in 2010, distributed training of deep neural networks for speech recognition was studied in 2015, and asynchronous training using "soft" averaging was studied in a 2015 paper method. These works all consider distributed training in the context of data centralization, and do not consider federated learning tasks with data imbalance and non-IID characteristics. But they provide a way to solve the problem of federated learning by iteratively averaging the algorithm of locally trained models. Similar to the research motivation of this paper , the advantages of protecting the privacy of user data in devices are discussed in this paper . In this paper, the authors focus on training deep networks, emphasizing the importance of privacy and reducing communication overhead by sharing only a part of parameters in each round of communication; however , they also do not consider data imbalance and non-independence distribution, and their research work lacks experimental evaluation.

1.6 Framework of federated learning

2 Algorithm Introduction

2.1 Federated Stochastic Gradient Descent (FedSGD)

Set a fixed learning rate η, and calculate the loss gradient for the data of K clients:

The central server aggregates the gradients calculated by each client to update the model parameters:

in,

2.2 Federal Averaging Algorithm (FedAvg)

Update the local model on the client side:

The central server performs a weighted average of the updated parameters of each client:

Each client can independently update the model parameters multiple times, and then send the updated parameters to the central server for weighted average:

The calculation amount of FedAvg is related to three parameters:

  • C: The proportion of clients selected for each round of training
  • E: A factor designed for the number of cycles to update parameters for each client
  • B: When the client updates the parameters, the amount of data used for each gradient descent

For a n_{k}client with data samples, the number of local parameter updates per round is:

Note: FedSGD is just a special case of FedAvg, that is, when the parameters E=1, B=∞, FedAvg is equivalent to FedSGD.
 
Schematic diagram of the relationship between FedSGD and FedAvg:

3 Experiment design and implementation

3.1 Model initialization

experiment settings
  • Dataset: 600 independent and identically distributed samples without repetition in MNIST
  • E=20; C=1; B=50; The central server aggregates once
  • Different models use different/same initialization models, and the parameters of the two models are weighted and summed by θ
       

Study the effect of model averaging on model performance:

        There are two cases here, one is that different models use different initialization models; the other is that different models use the same initialization model. And the weighted summation of the model can be performed by controlling the weight ratio through parameters.

        It can be seen that after using different initialization parameters for model averaging, the effect of the average model becomes worse, and the performance of the model is worse than that of the two parent models; after using the same initialization parameters for model averaging, the average of the model can be significantly reduced For the loss of the entire training set, the model outperforms the two parent models.

        This conclusion is an important support for realizing federated learning. During each round of training, the server releases the global model so that each client uses the same parameter model for training, which can effectively reduce the loss of the training set.

3.2 Dataset setup

        The preliminary research includes two data sets and three model families. The first two models are used to recognize the MNIST data set, and the latter is used to realize the word prediction of Shakespeare's works.

3.2.1 MNIST dataset

2NN: Multi-layer perceptron model with two hidden layers, 200 neurons in each layer, ReLu activation;

CNN: Two convolution layers with a convolution kernel size of 5X5 (32 channels and 64 channels respectively, each layer is followed by a 2X2 maximum pooling layer);

IDD: The data is randomly shuffled and distributed to 100 clients, each with 600 samples;

Non-IDD: Divide the dataset into 200 shards of size 300 by numerical label, two shards per client;

  • 3.2.2 Collection of Shakespeare

LSTM: Embed the input characters into a learned 8-dimensional space, then process the embedded characters through two LSTM layers, each with 256 nodes, and finally, the output of the second LSTM layer is sent to each character with a The softmax output layer of the node is trained using the 80 character length of unroll;

Unbalanced-Non-IID: Each role forms a client, a total of 1146 clients;

Balanced-IID: Divide the data set directly to 1146 clients;

3.3 Experimental optimization

        In the optimization of data center storage, the communication overhead is relatively small, and the computational overhead dominates. In federated optimization, any single device has a small amount of data, and modern mobile devices have relatively fast processors , so here more attention is paid to communication overhead . Therefore, we want to use the extra computation to reduce the number of rounds of communication required to train the model . There are two main methods, which are to increase the degree of parallelism and to increase the calculation amount of each client.

3.3.1 Adding parallelism

The parameter E is fixed, and C and B are discussed.

  •  When B=∞, increasing the proportion of clients, the advantage of effect improvement is small;
  • When B=10, there is a significant improvement, especially in the case of Non-IID;
  • At B=10, when C≥0.1, the convergence speed is significantly improved, and when the number of users reaches a certain number, the convergence increase speed is no longer obvious.

3.3.2 Increase the calculation amount of the client

To increase the calculation amount of each client, it can be realized by reducing B or increasing E.

  • Adding more local SGD updates per round can significantly reduce communication costs;
  • For the Shakespeare data of Unbalanced-Non-IDD, the number of communication rounds is reduced by more multiples. It is speculated that some clients may have relatively large local data sets, making it more valuable to increase local training;

 The above experimental results are displayed in the form of a line graph, where the blue line represents the result of federated stochastic gradient descent:

  • Compared with FedSGD, FedAvg not only reduces the number of communication rounds, but also has higher test accuracy. It is speculated that the average model produces a regularization effect similar to Dropout; 

 3.4  Exploring over-optimization of client-side datasets

        Under the settings of E=5 and E=25, for a large number of local updates, the training loss of the federation average will stagnate or diverge; therefore, in practical applications, for some models, reducing the local training period in the later stage of training will have a help to converge. 

3.5 CIFAR experiment

Experiment on the CTFAR dataset, the model is the model in the TensorFlow tutorial, including two convolutional layers, two fully connected layers and a linear transmission layer, about 10^6 parameters. The table below gives the number of communication rounds for baselineSGD, FedSGD, and FedAvg to achieve three different accuracy targets.

Curves of FedSGD and FedAvg at different learning rates:

3.6 Large-Scale LSTM Experiments

 To demonstrate the effectiveness of our method for solving practical problems, we conduct a large-scale word prediction task.

The training set contains 1 million public posts from large social networks. We group posts by author, with a total of over 50 clients. We limit each customer's dataset to a maximum of 5000 words. The model is a 256-node LSTM with a vocabulary of 10,000 words. The input and output embeddings for each word are 192-dimensional and are co-trained with the model; there are 4,950,544 parameters in total, using a 10-character unroll.

Optimal learning rate curves for federated averaging and federated stochastic gradient descent:

  • In the case of the same accuracy rate, FedAvg has fewer communication rounds; the variance of test accuracy is smaller;
  • E=1 performs better than E=5; 

4 Summary Outlook

         Our experiments show that federated learning can be achieved in practice, as it can train high-quality models using relatively few rounds of communication, as demonstrated on various model architectures: a multilayer perceptron, Two different convolutional NNs, a two-layer LSTM and a large-scale LSTM. While federated learning provides many practical privacy protections, stronger guarantees can be provided through differential privacy, secure multi-party computation, or a combination of them is an interesting direction for future work. Note that these two classes of techniques are most naturally applied to synchronization algorithms like FedAvg.

Reference article:

https://blog.csdn.net/qq_41605740/article/details/124584939?spm=1001.2014.3001.5506

https://blog.csdn.net/weixin_45662974/article/details/119464191?spm=1001.2014.3001.5506 

https://zhuanlan.zhihu.com/p/515756280 

Guess you like

Origin blog.csdn.net/SmartLab307/article/details/132583309