Survey of Medical Applications of Federated Learning
Article information
Abstract
Objectives
Medical artificial intelligence (AI) has recently attracted considerable attention. However, training medical AI models is challenging due to privacy-protection regulations. Among the proposed solutions, federated learning (FL) stands out. FL involves transmitting only model parameters without sharing the original data, making it particularly suitable for the medical field, where data privacy is paramount. This study reviews the application of FL in the medical domain.
Methods
We conducted a literature search using the keywords "federated learning" in combination with "medical," "healthcare," or "clinical" on Google Scholar and PubMed. After reviewing titles and abstracts, 58 papers were selected for analysis. These FL studies were categorized based on the types of data used, the target disease, the use of open datasets, the local model of FL, and the neural network model. We also examined issues related to heterogeneity and security.
Results
In the investigated FL studies, the most commonly used data type was image data, and the most studied target diseases were cancer and COVID-19. The majority of studies utilized open datasets. Furthermore, 72% of the FL articles addressed heterogeneity issues, while 50% discussed security concerns.
Conclusions
FL in the medical domain appears to be in its early stages, with most research using open data and focusing on specific data types and diseases for performance verification purposes. Nonetheless, medical FL research is anticipated to be increasingly applied and to become a vital component of multi-institutional research.
I. Introduction
Artificial intelligence (AI) has recently emerged as a promising tool in medical research and applications [1,2]. The performance of AI, particularly in machine learning and deep learning, improves and stabilizes with access to large datasets. Consequently, researchers have been motivated to amass substantial amounts of data, often referred to as big data. However, traditional AI models necessitate centralized data repositories, which pose significant concerns for the protection of sensitive medical information. In response, privacy-protection regulations such as the General Data Protection Regulation (GDPR) in the European Union [3], the Health Insurance Portability and Accountability Act (HIPAA) in the United States [4], and the Personal Information Protection Act in Korea [5] have been enacted to secure personal data, including medical and healthcare records. As a result, medical AI development must adhere to these regulations, requiring researchers to implement appropriate privacy-preserving methods.
Several methods for protecting privacy have been proposed, including de-identification techniques such as differential privacy [6,7], the generation of synthetic data [8–10], homomorphic encryption [11–13], and federated learning (FL) [14]. Our focus is on the privacy-preserving attributes of FL. The process of FL is as follows: each client independently trains a model on their local data, ensuring that individual data remains secure and is not exposed externally. The clients then transmit their model parameters to a central server. This server aggregates the parameters received from all clients to create a global model. Once the central server distributes the global model back to the clients, they perform additional local training using this model. The updated parameters are then sent back to the central server, which uses them to develop the subsequent iteration of the global model.
The fact that FL utilizes local client resources without the need for centralized resources and data has attracted interest as a privacy-preserving alternative. This is particularly relevant for hospitals, which often hold clinical data but are hesitant to share or expose the data due to privacy concerns. Originally, FL was proposed to leverage the unused resources of handheld devices. The primary distinction between FL in institutional settings, such as hospitals (referred to as on-site FL), and FL that taps into the resources of handheld devices (known as on-device FL), lies in the number of clients. On-device FL typically involves a significantly larger number of clients than on-site FL, where the clients often form a data silo and number in the single to double digits. Since medical data are housed within a hospital, FL involving medical data typically takes the form of on-site FL.
Review papers on FL in the medical domain have been published [15–19]; however, these studies have only introduced a limited number of examples of medical FL research. Our study differs from existing FL reviews by concentrating on specific instances of medical FL research. Additionally, we have organized the selected FL papers according to (1) the types of data utilized, (2) the targeted disease, (3) the use of an open dataset, (4) the local model of FL, and (5) the neural network model employed. This study categorizes and analyzes current medical FL research to provide insights into areas that have been well-explored and those that remain underexplored.
II. Methods
A literature search was conducted using the keywords "federated learning" combined with "medical," "healthcare," or "clinical" on Google Scholar and PubMed. This search took place in September 2022 and was not restricted by publication year. Initially, the search yielded 129,000 papers on Google Scholar and 173 on PubMed. We arbitrarily chose to review the first 400 articles listed on Google Scholar, which is more than double the number of articles found on PubMed. From this initial set of 400 papers, we carefully selected 58 papers by applying specific exclusion criteria. These criteria excluded papers that only presented methodology without using medical data, papers that were inaccessible or could not be downloaded, and studies that were duplicates.
To extract insights from the papers we reviewed, we organized the studies according to several criteria: (1) the types of data utilized, including images, free text, signals, and laboratory data; (2) the disease of interest; (3) the employment of open datasets; (4) the type of local model applied in FL, distinguishing between machine learning and neural network models; and (5) the implementation of neural network models within the context of FL.
According to previous literature reviews on FL [15–17,20], heterogeneity and security concerns were frequently discussed. Therefore, we explored the extent to which studies addressed these issues. We also examined whether any of the papers proposed countermeasures to mitigate these concerns.
III. Results
According to our literature survey, the first article [14] on FL was published in 2016 and the first article [21] on the medical applications of FL was published in 2019. In 2019, three articles were published, followed by 11 in 2020, 29 in 2021, and 15 in 2022. This trend demonstrates a growing interest in FL research within the medical field. Table 1 provides a summary of these studies, categorizing them by their target diseases and the types of data utilized [21–78].
1. Data Types
The most frequently utilized data type in the reviewed studies was image data, represented in 36 studies. This was followed by laboratory data in nine studies, free text and signals, each in six studies, mobile health data in a single study, and genomic data also in one study. Among the 58 studies reviewed, only one combined two types of data—image and laboratory data [35].
Among the 36 studies that used image data, the distribution was as follows: 24 studies used radiology images, six studies utilized pathology images, three studies focused on skin images, two studies examined ultrasound images, and two studies involved other types of images, including fundus and surgical images. Notably, one study utilized two types of images: radiology and ultrasound [43]. Of the 24 studies involving radiology images, chest radiology images were the most common, with 10 studies [26,35–43] examining chest X-rays and chest computed tomography (CT) scans. Six studies investigated nervous system radiology images, delving into topics such as autism spectrum disorder using functional magnetic resonance imaging (MRI) [49], brain age prediction [50], brain tumor segmentation [21], multiple sclerosis lesion segmentation [51], brain tumor MRI [22], and glioblastoma using multiparametric MRI [27]. Additionally, three studies were dedicated to radiology image reconstruction, with two focusing on MRI [44,45] and one on CT [46]. The remaining five studies explored a variety of radiology images, including mammography [24,47], prostate MRI [23], cardiac MRI [52], and pancreatic CT [25]. Pathology images were the focus of six studies, which included applications of differential privacy to pathological images [28], use of the open datasets Camelyon16 and Camelyon17 [29], analysis of gigapixel whole-slide images [30], brain pathology segmentation [31], colorectal cancer data analysis [32], and examination of tumor-infiltrating lymphocytes in whole-slide images [33]. Three studies focused on skin images, tackling issues such as skin disease detection using the Dermatology Atlas dataset [53,54] and melanoma detection with the dermoscopic skin lesion image dataset [56]. Two studies involved ultrasound images [34,43], and the final two studies used other image types, specifically fundus [48] and surgical images [55].
The use of free text in studies was primarily associated with natural language processing, as evidenced by six studies. These investigations encompassed a range of applications: a violence risk assessment [57], benchmarking bidirectional encoder representations from transformers (BERT) models [58], a named entity recognition task [59], detecting adverse events related to vaccines [60], developing a medical relation extraction model [61], and creating a deep learning-based personalized clinical decision support system [62].
Signal data primarily consisted of time-series data obtained from medical devices, and six case studies used this type of data. Research on FL using signal data has largely concentrated on disease research involving heart-activity data or the development of health-monitoring systems. The identified objectives for FL research using signal data included predicting the severity of major depressive disorder based on heart rate variability [63], detecting arrhythmias through electrocardiography [64], automatically detecting stress using heart-activity signals [65], implementing wearable healthcare solutions [66], monitoring health at home [67], and developing health-monitoring systems that employ wearable sensing devices [68].
Furthermore, nine studies utilized laboratory data. These studies focused on predicting various outcomes, including disease risks from electronic health record systems [69], adverse drug reactions [70], mortality within 7 days of hospitalization in COVID-19 patients [71], acute kidney injury within three and seven days of admission [72], patient mortality and length of stay in the intensive care unit [74], and intensive care unit mortality using the MIMIC-III benchmark database [76]. Additionally, they involved evaluating FL with existing datasets [73] and assessing the performance of FL on two typical electronic health record machine learning tasks [75].
Data that did not fall into the categories of image, free text, signals, or laboratory data were categorized as “other.” Two studies that belonged to the “other” category were depression detection from mobile data [77] and prediction of disease from genomic data [78].
2. Target Diseases
The most common target disease was cancer (17 studies, one of which [34] involved thyroid nodules) [21–34,56,69,73]. The second most common target disease was COVID-19 (12 studies) due to the recent worldwide COVID-19 pandemic [35–43,71–73]. The remaining 29 studies did not include specific target diseases.
3. Use of Open Datasets
We evaluated whether the data utilized in the studies were open or private [21–78] (Table 2), as the source of the data is important in medical research. Overall, 37 studies used open data, 11 used private data, and 10 used a combination. As can be seen, open datasets were primarily used.
4. Local Models Used in FL
We investigated whether the local models used for FL research were deep or machine learning [21–78] (Table 2). In total, 53 used neural networks as the local models, one used machine learning as the local model, and four used a combination of both. Thus, the majority of the investigated FL studies chose neural networks as the local models.
5. Utilization of Neural Network Models in FL
Among the studies we evaluated, neural network algorithms were employed in most cases [21–77,79,80], with only one study being the exception [78] (Table 3). A significant number of these studies, 35 to be exact, utilized convolutional neural networks (CNNs) as their primary model [81]. CNNs were predominantly used in research involving image and signal data [63,64,66]. In contrast, studies that focused on free text primarily implemented recurrent neural networks, including long short-term memory networks [82], and some incorporated the more recent BERT [83]. For laboratory test data, which typically has lower complexity compared to image or signal data, simpler neural network architectures were favored, such as shallow networks with one or two hidden layers, or multilayer perceptron models
Regarding optimization methods, the Adam optimizer [84] emerged as the most commonly used, being adopted in 35 of the 57 studies. The most widely employed method was stochastic gradient descent (SGD) [85], with mini-batch SGD being the second most utilized, serving as the optimization technique in 13 studies. Notably, two studies did not explicitly specify the optimization method employed [38,62].
6. Commonly Mentioned Issues
Although many algorithms employed in FL assume independent and identically distributed (IID) data, real-world data often deviate from this assumption, being non-IID. FedAvg [14], a prominent algorithm in FL, demonstrates slow convergence and suboptimal accuracy when dealing with non-IID data. Since the data characteristics vary across clients, the performance of the global model suffers. This problem is known as the heterogeneity issue.
FL offers advantages in terms of privacy; however, it is still susceptible to security attacks. The most concerning attack in the medical field is the inference attack, which aims to deduce sensitive information from the learning data. Research on inference attacks in FL includes a study on the use of generative adversarial networks for such attacks [86], an examination of inference attacks in vertical FL [87], an analysis of membership inference attacks that could lead to privacy breaches [88], and an investigation into source inference attacks that can extract more information than traditional inference methods [89]. A poisoning attack compromises the performance of FL by reducing the accuracy of the global model through malicious updates. Various studies have explored poisoning attacks, including those on model poisoning that aim to cause misclassification [90,91], research on data poisoning where malicious participants submit updates from incorrectly labeled data [92], a study on FL poisoning attacks utilizing generative adversarial networks [93], and an examination of FL's vulnerability to Sybil-based poisoning attacks [94]. Collectively, these security threats [95] to FL are referred to as "security issues."
Among the 58 articles, 42 (72%) mentioned non-IID data or heterogeneity issues, and 29 (50%) noted security issues. In addition, 22 articles (37%) pointed out both issues, whereas 10 articles (17%) mentioned neither issue.
1) Countermeasures against the heterogeneity issue
To address the issue of heterogeneity, researchers have proposed algorithms that perform well with non-IID data. Two notable algorithms are as follows: Li et al. [96] introduced FedProx, which enhances stability in heterogeneous environments by incorporating a proximal term. Karimireddy et al. [97] identified that data heterogeneity can cause client drift, leading to a decline in FL performance. To counteract this, they developed the SCAFFOLD algorithm, which corrects client drift and has been shown to be at least as efficient as SGD. Li et al. [98] evaluated the accuracy and communication efficiency of several leading FL algorithms, including FedAvg, FedProx, SCAFFOLD, and FedNova, across a range of non-IID scenarios. Their experiments indicated that no single algorithm consistently outperformed the others under the various non-IID conditions. This issue of heterogeneity has also been noted in the widely studied context of handheld device-based FL [96]. However, FL in hospitals (onsite FL) involves a significantly smaller number of clients—ranging from single to double digits—which can make the model more susceptible to bias and exacerbate heterogeneity issues. Consequently, these issues are more pronounced in hospital FL, necessitating the development of specific countermeasures. Despite extensive research aimed at enhancing the performance of global models in non-IID situations, an approach that is both cost-effective and universally effective in all non-IID contexts has yet to be discovered [98].
2) Countermeasures against security issues
In medical FL, the use of patient data necessitates stringent security and privacy protections. To protect against security threats, measures such as differential privacy and homomorphic encryption can be implemented. Our review of FL studies revealed that 11 papers [21,24,28,30,35,48,49,58,60,73,78] employed differential privacy as a security measure, while four papers [36,48,66,67] utilized homomorphic encryption. Differential privacy emerged as the most commonly adopted security measure. It can be easily applied by adding Gaussian noise [6,24,30,48]. In contrast, homomorphic encryption is more challenging to implement than differential privacy and incurs additional computational costs [11,99,100]. Consequently, differential privacy has been more frequently adopted than homomorphic encryption in medical FL research.
IV. Discussion
In this survey, studies within the medical domain that utilized FL were reviewed. The selected FL papers were categorized based on the following criteria: (1) the types of data used, (2) the target disease, (3) the use of an open dataset, (4) the local model of FL, and (5) the employment of neural network models in FL.
Most studies used image data, while relatively few studies utilized free text, signal, and laboratory data. In the broader context of medical research, free text, signals, and laboratory data are frequently used; however, these data types appear to be underrepresented in the field of medical FL research. Cancer and COVID-19 emerged as the most frequently studied diseases in medical FL. In contrast, there have been relatively few FL studies focusing on cardiovascular diseases [101,102] and neurological disorders [103,104], such as Alzheimer's disease, epilepsy, Parkinson’s disease, and schizophrenia, despite the active research efforts in these areas. Upon examining the data types and target diseases within medical FL research, a pattern of high research frequency for certain data types and diseases becomes evident. It is noteworthy that among the data types commonly used in medical research and the diseases that are the focus of active study, there are instances where FL is less frequently applied. This observation suggests that FL has the potential to be leveraged across a diverse range of data types and for the study of various diseases.
We also investigated whether the datasets used were open or private. Most studies utilized open datasets, while a smaller number relied on proprietary data. FL appears to be in its nascent phase, with open datasets predominantly used for initial testing purposes, such as performance validation. However, as the field matures and the volume of research utilizing authentic medical data grows, the utilization of proprietary data is expected to rise accordingly.
Most local models for medical FL research were neural networks, while very few were machine learning models. Considering that certain types of medical data, such as laboratory results, are captured in tabular formats that exhibit low data complexity [69–78], there is a need for FL research that utilizes machine learning. Machine learning models typically have lower complexity than neural network models and could be more suitable for these types of data.
We investigated neural network models and optimization methods. CNNs, the most widely utilized type of deep learning model, were employed in 35 out of 57 studies. The prevalent use of CNNs is likely due to the fact that image data were the most common type of data in these studies. Additionally, CNNs have been applied to the analysis of signal data [63,64,66]. The optimization method most frequently used was Adam, which was adopted in 35 studies. The application of Adam optimization was not limited to any particular data type; rather, it was employed across a broad range of data types. SGD was the optimization method used in 13 studies. Similar to Adam optimization, SGD was not predominantly used for any specific data types.
Moreover, we investigated the heterogeneity and security issues, which have been examined in many previous review papers. We found that although many algorithms have been proposed to address the issue of heterogeneity, there is still no low-cost, universally effective solution for all non-IID scenarios [98]. Given that FL in hospitals is a form of crosssilo FL with a limited number of clients, heterogeneity issues are more pronounced, necessitating further research.
Furthermore, we identified numerous security threats within FL, and measures such as differential privacy and homomorphic encryption have been proposed to mitigate these risks. Specifically, medical FL involves the use of patient data, which necessitates robust privacy and security safeguards. This makes it necessary to implement security enhancement measures, including differential privacy, to protect this sensitive information.
Currently, FL in the medical field is in its early stages, with a significant amount of research focusing on specific data types, such as imaging data, and particular diseases, such as cancer and COVID-19. As the field evolves, it is anticipated that FL will be applied to a broader range of data types and disease research. While many studies at present concentrate on open data, it is expected that the utilization of private data in research will grow in the future. Most FL local models in use today are based on neural networks. However, given the existence of tabular medical data, such as laboratory results, there is a potential for increased research into machine learning models, which typically have simpler structures than neural network models, for use as FL local models. As a result, medical FL research is poised to be actively pursued and is likely to become a critical component of collaborative research across multiple institutions.
Acknowledgments
This work was supported by the Bio-Industrial Technology Development Program (No. 20014841), funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea).
Notes
Conflict of Interest
No potential conflict of interest relevant to this article was reported.