eanm-logo eanm-logo
European Nuclear Medicine Guide
eanm-logo eanm-logo
European Nuclear Medicine Guide
Chapter 20

Artificial Intelligence, Machine deep learning and Radiomics

1. Introduction to Artificial Intelligence

The key for an optimal patient care in clinical routine settings is encoded in quick, fact-based diagnoses as well as treatment planning based on reliable decision-making processes. This holds true for centuries, however, the amount of patient data existing and being generated has changed in the last years due to ongoing hospital digitalization and advanced technological breakthroughs such as improved imaging devices, more accurate blood analysis techniques, and efficient data analysis approaches. Especially, so-called in silico methods under the overarching umbrella of Artificial Intelligence (AI) yield a great potential for medicine[204].

Nowadays, patients can be monitored and the disease outcome be predicted in real-time according to their individual risk for therapy failure or success only based on electronic health records (EHR), molecular data or medical images that might be further investigated due to the experience and intuition of the clinician[205].The automated and reliable clustering of patients to find an optimal treatment or matching clinical studies is also not a distant futuristic concept anymore[206].

1.1 A primer to understand AI approaches

Over the last decade and ongoing, we are observing a fast progress in the development of AI, which is defined as an independent learning of a computational entity based on available information of any kind.  Such amount of information obtained from patients or pre-clinical studies is too complex, vast and heterogeneous to be comprehensively interpretable by humans without any technological support[207]. With the dawn of AI, which encompasses the concepts of Machine Learning (ML) and Deep Learning (DL), a specific type of ML, the supportive analysis of high-dimensional patient-specific information could enable clinicians to improve their diagnostic, prognostic, and therapeutic decisions.

Numerous AI architectures have been developed and are used to classify, impute, predict, and cluster datasets based on so-called features. Such features can include relevant patient specific information such as medical traits and clinical measurements like blood test parameters, smart watch sensory data, conventional imaging, or hybrid imaging data such as SPECT/CT or PET/MRI, which open the door to multi-parametric assessment of diseases[208]. One term that is used frequently is Radiomics, which refers to the high throughput extraction of quantitative features from images to build diagnostic or predictive models through ML or DL [15]. The main difference between ML and DL approaches most lies on how the features are obtained. In ML, this process is performed by extracting handcrafted (or engineered) features and selecting them through ML algorithms or manual pre-selection or labelling, often identified by a domain expert. As a result, ML has the advantage of a rational understanding about the importance of each individual feature and interpretability of the resulting models (usually a combination of selected handcrafted features). However, the selection of features may be suboptimal, because it may introduce a certain bias. In contrast, DL does not rely on handcrafted feature extraction and selection or precise labelling, but rather utilize internal nodes representing learned features that are automatically determined and weighted due to their importance for a certain decision based on the given data. This approach can lead to complex and robust models with less bias and dependence on prior expert knowledge[209], but is also more dependent on large amounts of high quality training data and less explainable, as resulting networks may contain millions of parameters.

The technical implementation of an AI algorithm utilizing features for prediction is called a model. The underlying mathematical concepts of such a model try to find combinations of linear and non-linear decision boundaries or patterns in a given set of information to separate individual data points such as patients or diseases. The model either classifies or clusters similar objects together into distinct groups.

An AI analysis is supervised if the training label (e.g., disease, treatment type, patient outcome, etc.) is provided to the algorithm, e.g., in case of retrospective data. The supervised learning algorithms differ in the way they compute the decision boundary, which maps an input data point to a specific class based on the formerly learned pattern. An overview of these approaches can be found in[210]. The most common state-of-the-art supervised learning algorithms are Random Forest, Support Vector Machine, K-Nearest Neighbour, Logistic Regression, and Boosting methods, such as Gradient Boosting or AdaBoost and most of these have been used for radiomics[211].

Unsupervised learning is applied to learn from unlabelled data in which each data point only consists of the features and the true labels are unknown. This is mainly utilized by clustering methods, which try to find common patterns to group the data points or patients into clusters based on their feature similarities. These clustering methods are different according to the group determination and the similarity measure. The most commonly used clustering approaches include hierarchical clustering and k-means clustering[212]. After clustering patients into groups, it is possible to use these newly identified groups as labels to be able to apply the formerly mentioned supervised approaches. Further methods belonging to the domain of unsupervised learning include dimensionality reduction techniques such as principle component analysis (PCA), uniform manifold approximation and projection for dimension reduction (UMAP) as well as t-distributed stochastic neighbour embedding (t-SNE)[213].These techniques are less frequently exploited in radiomic studies. Deep learning-based approaches to unsupervised learning include autoencoders and generative adversarial networks which can be used for dimensionality reduction, denoising, standardization and data augmentation[214–216].

1.2 Transferring AI approaches to nuclear medicine

Despite being quantitative by nature, nuclear medicine images are, in most clinical publications, clinical trials and obviously routine clinical practice, exploited in a very restrictive manner (i.e., analysed mostly visually or semi quantitatively)[217]. Where radiologists or nuclear medicine physicians mostly rely on recognition of a handful of semantic features e.g. to detect and describe tumours, thousands of agnostic features can potentially be extracted from medical images[205], including some not even visible to an expertly trained eye [218]. This complexity within medical images extending beyond the scope of the human brain is amenable to an analysis by ML and DL approaches that will reveal the additional information that the images may hold. The field of Radiomics has progressed from a direct selection of predefined features that can be used alone or in combination as inputs into ML classifiers, to obtaining indirect learned features without a priori definition using DL data-driven methodology[219].           

2. The Pitfalls of AI in Nuclear Medicine

There are several challenges that make the successful applicability of AI on functional as well as hybrid studies in clinical routine difficult. The common denominator in most AI applications is high quality data. Given the wide-range of hybrid imaging systems and their specific characteristics, they are particularly prone to be victims of the „garbage-in & garbage-out” issue. It is therefore important to generate the data according to standardized guidelines and understand the origin of the data at hand, especially when data comes from multiple sites. The sections below address some of the pitfalls that AI-scientists should be aware of.

2.1 Data management

Good Clinical Practice (GCP) as well as Good Laboratory Practice (GLP) guidelines define how standardized clinical processes as well as high-level medical research shall be conducted to achieve high-quality, trustable, and reusable data [28]. Nevertheless, the understanding of how and in which extent such guidelines have been followed by particular research groups is difficult to maintain. While high-impact journals require to report certain aspects of processing steps in line with the above guidelines, the publication of research data represents different practices. Many journals do not make data publication mandatory, however, even if data is mandatory to be published, there is no guarantee that it was properly peer reviewed[220]. This phenomenon renders most AI-related medical studies challenging to reproduce by other research groups on their own datasets. Current recommendations are being produced to make data FAIR (i.e., findable, accessible, interoperable, and reusable) in such investigations [221].  

2.2 Properties of imaging data

Typical properties of imaging data further complicate their successful analysis with AI. Due to the fact that various imaging as well as clinical protocols change over time even within a single centre, a retrospective patient cohort may represent missing, inhomogeneous or unstructured data records. Clinicians who supply the data for AI analysis are therefore often required to delete incomplete cases from the collected database, which may dramatically reduce the amount of exploitable data to an insufficient level for training complex AI algorithms[222]. Furthermore, data in the field is often imbalanced, as e.g. subtypes of diseases or adverse outcome events are typically not presented with the same occurrence in the given patient population[223]. Imbalanced data is one of the main reasons why AI-established predictive models may result in poor performance over a minority disease subgroup[224], especially if adequate imbalance management approaches were not applied[225]. The imbalanced nature of data subgroups is particularly true for tumours, where hybrid imaging plays a prominent role in the detection and characterization phases[226]. Data may vary in between centres and over time as well, e.g. due to different metabolical processes of the human body presented in PET images, especially if patients have undergone different treatments prior to imaging[227].

2.3 Data access in-house

Novel AI solutions often require large amounts of standardized and curated retrospective data. Nevertheless, old in-house might have been acquired with suboptimal devices and protocols or partially archived. As an example, raw data, particularly in case of PET, is often not stored at all, which makes standardization efforts through modified reconstruction parameters impossible[228]. Even though recent hospital digitalization endeavours are highly welcomed for the promise of increased patient care, on its own, it cannot be considered a remedy for AI applications. Without appropriate, applicable standardization processes, the heterogeneity of digitized big data may still degrade predictive AI performances[227].

2.4 Multi-centric data

Multi-centric data are generally hard to access and process in a normalized way. First, there is a certain element of reluctancy present in most clinicians and some image scientists to share data in general. Second, local hospital rules and sharing processes may appear overcomplicated and time-consuming, which delays successful research built on multi-centric collaborations. And last, even if the willingness to share is present and the data went through local anonymization processes, imaging data may still reveal certain characteristics of individuals[229]. All these factors together, especially in light of otherwise highly-appreciated proceedings of the general data protection regulation appear to challenge the establishment of a publicly available, multi-centric imaging dataset, which could boost AI-related research[208].Despite some database providing small multi-centric imaging data, such as the TCIA (https://www.cancerimagingarchive.net/), the lack of multi-centric data is generally considered one of the major reasons that only few AI solutions have been integrated into clinical routine practice[230].

2.5 Evaluation

Existing AI solutions that are applied in the field of functional as well as hybrid imaging research are either radiomics or DL based with a current overweight of radiomics approaches[227]. There are multiple reasons for this phenomenon. On the one hand, radiomics models are simpler, built on so-called engineered or manually handcrafted features[231], which makes their applicability as well as interpretability easier than more complex DL frameworks. Second, given the fact that most research groups only have access to small datasets, simple radiomics models - that have fewer unknown parameters to optimize - can be better trained using such limited size datasets. In contrast, DL approaches appear to be powerful alternatives towards radiomics, but as they have much more unknown parameters to identify and optimize during the trining process, hence requiring larger data samples for proper training[232]. Irrespectively of the choice of AI approaches, functional and hybrid imaging AI studies are generally prone to establish overfitted models operating with small, single-centre data[233].

There is a certain element of bias in the selection of AI methods as well, which is typically driven by prior expertise and familiarity of AI tools or popularity of certain AI methods that may be sub-optimal for a given study. The „no free lunch theorem” states that there is no superior AI approach above all in general, but the ideal AI approach is rather data and application-specific[234,235]. This suggests that one shall test multiple AI models over the available data to understand the underlying characteristics of the data and the applicability of the AI method. Nevertheless, to date, this approach is rarely present in the corresponding literature[227]. Furthermore, different performance metrics such as the area under the receiver operating characteristic curve (AUC), receiver operating characteristic curve (ROC) or F2-score also make established model performances hard to compare among different research groups, especially because different AI tools tend to utilize different metrics for the training process themselves [236]. The lack of proper cross-validation in single-centre studies is one the major concerns of AI-driven predictive models[227]. Even though today’s mainstream processing capacities may allow to perform advanced cross-validation, e.g. a high Monte Carlo fold count based cross-validation of radiomics models[237], this practice is rarely followed in the corresponding literature, potentially rendering most works to the level of advanced correlation analyses, rather than clinically-applicable predictive models[238]. Similarly, since DL training may be extremely time-consuming, the vast majority of studies utilizing DL either perform a one-fold training-validation approach or a very low cross-validation count, which leaves room for selection bias and high variances of DL-related predictive performances. Due to the afore mentioned challenges, the vast majority of PET and hybrid imaging related research focusing on AI are single-centre only [239], which on its own potentially introduces an overestimation of predictive models[240].

Last, lack of interpretability of predictive models is a general concern for clinicians. AI predictive models can be considered as „black boxes” from which understanding basic underlying mechanisms and gathering new knowledge is rarely possible[241]. The same is true for the output of predictive models, that are typically probability-based and need further process. In contrast, there is an inherent wish to simplify the results of otherwise complex predictive models to the level of „green-yellow-red” outputs, which may challenge the establishment of a truly personalized treatment decision process.

3. The promise of AI in Nuclear Medicine

Despite these challenges, AI will, without any doubt, transform healthcare. It has the potential to play a pivotal role in personalized/precision/systems medicine, where interpreting large amounts of multi-modal data into a single model or CDSS might be central. AI shows great promise in the field of nuclear medicine and is already setting new standards. At this point, there is no universal nuclear medicine AI algorithm that can replace all parts of the medical imaging workflow. The research has therefore been focused on developing specialized alternatives to each task. A typical medical imaging workflow can be divided into planning, image acquisition, interpretation, and reporting[242]. AI has the potential to assist, guide and/or replace elements in all these steps. In the following, we will examine the areas of acquisition, interpretation, and reporting, where AI is already now being utilized. A non-exhaustive list of examples of each area is given in Table 1.

3.1 Image acquisition

Rather than focusing on replacing medical doctors by directly predicting a disease outcome, there has been a focus on supportive approaches, such as utilizing AI to improve image quality [243]. This is typically an image-to-image task, where the advantage is that training data are typically easily and widely accessible. Given a high quality image, a low quality image can be simulated. Such a training scheme allows generation of perfectly co-registered paired data for training. It is therefore possible to build and train a model that predicts a high quality image from a low quality input, allowing faster image acquisition protocols, noise reduction, or a better dosimetry, to the benefit of both the patients and personnel.

Examples of applications for noise reduction in MRI acquired over a short time range as well as low-dose CT and PET are given in Table 1. Rather than a focus on quantitative image quality metrics, there needs to be a focus on clinical accuracy for these methods to be implemented in hospital routine settings. If validation is achived, we will enter in a new era for low-dose PET imaging[244]. One domain where low-dose PET appears ready for clinical implementation is assessment of dementia. Chen et al. showed that a reading of a noise-reduced image with only 1% of the original radiotracer had high accuracy for amyloid status definition (89%), which was similar to intra-reader reproducibility of the full-dose images (91%).

Examples where PET images are synthesized from MR images, completely removing any given PET dose, have also been proposed. Guo et al., with structural and functional MR images as input, were are able to predict oxygen-15 water PET cerebral blood flow (CBF) maps with higher accuracy than any of the MRI sequences designed to measure CBF alone [245].Wei et al. were able to predict myelin content as measured with 11C Pittsburg Compound B (PIB) using only multi-modal MRI[246].

In hybrid PET imaging, one of the largest challenges has been to achieve accurate attenuation correction without CT. Several studies have demonstrated the ability of DL-based networks to generate artificial CT from only MRI input (Table 1), or even directly from non-attenuation corrected PET to attenuation and scatter corrected PET, bypassing the need for AC all together[247].

3.2 Interpretation and reporting

There is a large part of AI research in nuclear medicine aiming at replacing manual tasks such as delineation. Automated delineations could free the physician up to higher valued tasks[248], or part of research that would allow to collect more data. Several automatic segmentation challenges exist for e.g. brain tumours[249], lung nodules[250], or ischemic stroke lesions[251,252]. Few of the reported methods have moved into clinical routine despite impressive results in some patients populations, probably due to the extremely diverse appearance of these diseases, requiring large amounts of labelled training data emerging from various centres[242].

            Another large area of research is the early detection of Alzheimer disease and mild cognitive impairment using DL [253–255]. Ding et al. showed how DL was able to outperform human interpreters for the early diagnosis of  Alzheimer’s disease with 82% specificity and 100% sensitivity (AUC: 0.98)[256]. Similarly, Kim et al. used 54 normal and 54 abnormal 123I-ioflupane SPECT scans to train a network that predicts the diagnosis of Parkinson’s disease[257],with a achieved sensitivity of 96% at 67% specificity (AUC: 0.87). 

            In oncology, there is a need to predict overall survival or response to therapy. This task is often not achievable with imaging alone, which is why several studies incorporate non-imaging features. Papp et al. combined PET features, histopathologic features, and patient characteristics in a ML model to predict the 36-month survival in 70 patients with treatment-negative gliomas (AUC: 0.9)[258].Xiong et al. demonstrated the feasibility of predicting local disease control with chemoradiotherapy in patients with oesophageal cancer using radiomic features from 18F-FDG PET/CT[259], and Milgron et al. found five features extracted from mediastinal sites to be highly predictive of primary refractory disease in 251 patients with stage I or II Hodgkin lymphoma[260].

            The major drawback for networks predicts disease evolution is the amount of available training data. While image-to-image translation, e.g. MR to CT generation, essentially has one output value for each input value, disease prediction only has the same single label for the entire data input. This increases the amount of required training data significantly, depending on the complexity of the disease, often to a level that one single department cannot provide. One way to overcome the lack of data is by generating shared databases with data from multiple hospitals. A standardization approach is essential for a successful implementation, especially for MRI where there are a vast number of sequences in use and even variations between scanners for the same sequence. Work by Gao et al. has shown that this can be overcome, again by the use of a DL to transform the MR input images to a standardized MR image[214]. Similarly, radiomic features themselves can also be harmonized for achieving better cross-validation in a multi-centric setting [261].

Area

Example publication

MR acquisition

[69]

CT (low-dose)

[70]

PET (low-dose)

[71–74]

MR à PET

[53,54]

MR à CT

[75–81]

NAC-PET à CT

[82–84]

NAC-PET à AC-PET

[55]

Delineation

[57–60]

Disease Detection

[61–65]

Survival prediction

[66–68]

MR standardization

[21]

Table 1. Example applications of deep learning: Non-attenuation corrected, AC: Attenuation corrected, AD: Alzheimer’s Disease

4. Best Practices

4.1 Standardized software tools

Standardized tools play an important role in facilitating universal applicability of predictive models by promoting reproducibility. Even though custom frameworks are sometimes used for performing data analysis in the field of nuclear medicine, there is a broad range of free and open-source software available that can help to improve the standardization of analysis workflows. The most commonly known AI frameworks include TensorFlow[262], Keras [263] and PyTorch for the development of DL-based predictive models[264]. For radiomics driven analysis, standardized frameworks include PyRadiomics[265],LIFEx[266], MITK[267], and MPRAD[268].Additionally, there is a variety of tools and libraries for general-purpose ML including Scikit-learn[269]for python and rpart[270] as well as caret[271] for R. Oftentimes, custom code is required to use and extend preexisting, standardized frameworks. In order to make maximum use of these implementations, they should be documented thoroughly and shared with the research community.

4.2 Standardized imaging protocols

Another equally important target for standardization are (multi-centric) imaging protocols as the repeatability of the extracted ML features can only be guaranteed if a unified and AI-friendly protocol is being followed on image acquisition. As an example, optimal PET protocol settings that minimize multi-centric variations of radiomic features have been presented in[272]. Furthermore, ComBat feature-domain harmonization was proposed in to deal with multi-centric radiomic variations[273]. Besides embracing existing EANM Research Ltd. (EARL) guidelines, future EARL guidelines shall focus on pursuing such AI-driven requirements as well.

4.3 Data management

In addition, there are several data science-based tools available which can be used to improve the quality of data management for ML based analyses. For example, some tools allow to identify patterns more easily within the given data, allowing a better framework for the ML algorithms. These tools include feature selection[274], image standardization[214], class-balancing[275,276], and outlier detection[277].

4.4 Handling of Limited Data

Some tools focus on handling small amounts of data and improving the generalizability of the created predictive models. Data augmentation[225], for example, consists in generating additional synthetic data with the same patterns as innative images. Simple data augmentation techniques include procedures like flipping, rotation, and translation of the input images. More sophisticated techniques incorporate methods like generative adversarial networks (GANs)to create completely new synthetic images with respect to key patterns[215,278,279]. It should be noted that data augmentation has to be restricted to images that are used to train a predictive model, not for its validation or testing. Another often successfully applied technique suitable for small amounts of data is transfer learning[280,281]. Transfer learning is a general ML concept that is the especially useful adaptation of a DL model that had been previously trained data with a larger amount of data. The principle is to reuse the first layers of the network trained with a large amount of data, since the features extracted by these first layers highlight generalizable patterns such as dots or edges, even across domains (including non-medical to medical images). Shin et al. demonstrated the benefit of relying on transfer learning arising from non-medical images for computer-aided detection (CADe) problems, and consistently achieved better performance compared to training the networks from scratch [104]. Depending on similarity of and the previous model, different numbers of layers may be transferred.

4.5 Data sharing

Even when applying the above-mentioned techniques, the amount of available data to build a predictive model might not be sufficient. DL algorithms are often applied on tens of thousands and sometimes even millions of images. These figures are hard to obtain in the field of medical imaging. To counteract the scarcity of data in nuclear medicine, data should be made available to other researchers. The most prominent platform that supports this kind of data sharing is the cancer imaging archive (TCIA)[282].

Choosing the most appropriate ML algorithm is then a critical point in the design of an analysis pipeline. One of the most important criteria for selecting an algorithm is the size of the available dataset. While DL algorithms tend to be very scalable and can deliver excellent performances on large datasets, they often struggle when generalizing from a few images[283]. Traditional ML approaches such as the ones applied for radiomics, are less scalable but perform comparatively well on small datasets. Depending on the choice of the ML strategy used for an analysis, distinct algorithm-specific aspects must be taken into consideration. An example for DL based algorithms includes whether transfer learning models are available and might be appropriate for the given application. Iteratively trainable algorithms should be considered when medical data cannot be shared between centres, since each centre can iteratively train and update the model with their data onsite. For any algorithms using manually selected or handcrafted features, standardization guidelines such as the imaging biomarker standardization initiative should be taken into consideration[284]. Furthermore, techniques such as ComBat and radiomics quality scores may be used to increase the robustness of such handcrafted features[285]. Also, the different types of algorithms provide distinct techniques for enabling interpretability and, therefore, some might be more suitable to answer specific questions[286,287].

4.6 Explainable AI

Overall, the field of AI is currently shifting from the use of black-box models to interpretable analysis pipelines. Current techniques to undercover features from predictive models include activation maps, filter visualizations, maximum activation maps, and feature weighting[287]. However, care must be taken when interpreting the results of these techniques alone[286,287].

4.7 Performance evaluation scheme

The choice of performance metrics is critical for communicating and comparing outcomes of ML based studies. Oftentimes, the most effective choice is to report multiple metrics, such as AUC, (balanced) accuracy, sensitivity, specificity, positive predictive value, and negative predictive value to show the models capabilities from as many angles as possible. In addition, it should be reported how these metrics were obtained, such as the cross-validation scheme. In an ideal case with a large enough dataset, the most ideal evaluation scheme requires the separation of the available dataset into three groups: a training set, a validation, and an independent test set. The training set is used for building the model. Several models with distinct hyperparameters can be built using this training set and the resulting models can be validated by obtaining the predictive performance using the validation data set. However, as knowledge from the validation set is incorporated into the model, another independent test set has to be employed. Consequently, the performance of the model with the best validation performance is then evaluated with the independent test set. The resulting model must not be tuned any further based on the performance of the given test set as this would lead to an overfitting towards the test set and consequently to an overestimation of the model performance. If the model should be improved any further, another data set must be added for its evaluation.

Figure 1: Typical radiomics and deep learning challenges (left) and possible solutions (right) associated with AI-driven nuclear medicine applications through (top-bottom) the processing steps of imaging, data preparation, feature generation, analysis and validation as well as interpretation. The section number in which the given step is detailed is indicated on the right side of each step.

5. Social and Ethical Considerations

AI methods seek to solve individual problems within one specific task. While they may excel in interpreting image and contextual information, they are so far not able to make associations the way a human brain does and cannot replace clinicians for all tasks they perform[217]. Visvikis et al. also conclude that AI does not yet have achieved the same level of performance than a human expert in all situations, and therefore, a full artificial nuclear medicine physician still belongs to the domain of science fiction. However, the role of physicians as well as nuclear medicine physicians is likely to evolve as these new techniques are integrated into their practice[288].

5.1 Improved quality of diagnostics and therapy through CDSS

AI models are now being developed to be less and less black boxes that lack interpretability and transparency, which formerly was the most important reason for patients and clinicians to have a sceptical attitude towards this technology. It is indeed comprehensible to mistrust unfamiliar interfaces and have a hesitancy to give a machine or mathematical algorithm the responsibility of making life-critical decisions[241,289]. This is also a reason why current research is focusing on supportive systems rather than systems that autonomously make decisions such as self-driving cars. Unless a lot of comparisons between physicians and predictive machines are made, in medicine it is the human plus the machine rather versus the machine[290]. For this reason, a Clinical Decision Support System (CDSS) should be seen as an extended tool such as a stethoscope for patient diagnostics that a clinician can utilize to judge on a therapeutic decision.

Ribeiro et al. demonstrated that model explanations are very useful in trust-related tasks in the textual and image domains for both expert and non-expert users (e.g. deciding between models, assessing trust, improving or rejecting untrustworthy models, and getting meaningful insights into predictions)[291]. However, interpreting a model solely on a technical level is not that same than interpreting its decision on the underlying biology and therapeutic consequences. It is nonetheless a good start for clinicians and patients to find explanations and gain trust into almost inevitable AI model predictions[292]. An important aspect that might be taken into consideration for an extension is the combination of AI approaches and traditional research-oriented mechanistic models (e.g. in vivo mice models and in vitro cell experiments) that are used to also identify the origin of a disease and not only predict its outcome, because for reliable decisions it is necessary to properly investigate its causes[293].

5.2 Changing the physician-patient relationship with AI supported decisions

The great aim is that an integrative AI will allow clinicians to spend more time on personal discussions with patients, while leaving time-consuming statistical calculations and predictions to the CDSS[294].  Having more time on the patient side could therefore lead to better care, which enhances patient trust that is foundational to the relationship between medical practitioners and patients[295]. However, physicians also need to take care that the AI-assisted CDSS does not obstruct the patient-physician relationship, because they have to realize that the legal and moral responsibility for decisions made, are still with them. Thus, implementers may need to ensure that physicians are adequately trained on the benefits and pitfalls of AI-assisted CDSS and apply them in practice to augment rather than replace their clinical decision-making capabilities and duties to patients[295].

Figure 2: A suitable AI assisted decision making process based on a common predictive performance of the clinician and the Clinical Decision Support System (CDSS) to evaluate and integrate heterogeneous patient data. The key parameters range between 0 (no to less information yield) and + (high information yield) representing the varying performance of the individual Clinician and CDSS.

To be successful and accepted a full degree of information transparency should be provided to patients about the involved features, limitations, and suggestions of the AI systems that are assisting clinicians with their decision-making[296]. This would be an extension of current classic formulations of the informed consent that reflects a disclosure of all relevant information during the decision process (e.g. information at hand to accept or reject a diagnosis and consent to a proposed therapy plan).

However, using these benefits will require a free and rapid flow of information from the EHR to the CDSS platform and into reportable outputs that can be validated and disseminated also to others outside the patient-physician relationship. This will require fundamental trade-offs with the control und supervision that patients have regarding the information that is contained in the HER [295]. To circumvent this, researchers and administrators could use aggregated, de-identified data to undertake their analysis. However, it must be noted that no data can be truly de-identified, especially in the era of high quality imaging and molecular deep sequencing[297].

5.3 Improved patient data security is needed

For the digital transformation of medicine to build stronger bridges between healthcare teams and patients, there needs to be a heightened emphasis on data security and transparency [294]. If patients cannot trust that their data will be secure, transparent, and accessible to them, patient data collection might increase suspicion of healthcare systems rather than enhance trust. Simply providing technological advancement with no societal oversight to ensure that powerful tools are used to improve well-being carries considerable risk.

The digitalization and accessibility of EHR data that are used extensively by AI methods are not trivial[206]. Harrer et al. see both tasks as challenging for contrary reasons: on the one hand a lack of regulatory frameworks on data collection causes EHR formats to differ widely, to be incompatible with each other or not digital at all, and to reside in a decentralized ecosystem without established data exchange or access gateways. Due to their opinion, a strongly regulated legal environment strictly limits third-party access to patient data and even makes it difficult for patients themselves to access their own data. This can be called “EHR interoperability dilemma” that is being recognized as major hurdle to facilitate healthcare systems to be more efficient, which is why huge investments are being made by governments and medical institutions towards overcoming this hurdle[298]. In addition, the EU General Data Protection Regulation as an example of a legal framework continues to evolve as a governing and protecting institution for sensitive health data, which quickly becomes an increasingly complex endeavour in the growing network of devices, data owners, and service providers[299].

Since there is no central hospital management and security system, all medical data is still dependent on the individual clinical data security protocol[300]. Such resulting medical cybersecurity capabilities include a variety of programs, behaviours, and technologies that a hospital employs to improve cyber resiliency, which are not yet standardized and not all of these capabilities are self-sustaining, and they may erode over time[300]. If properly adopted, implemented, and maintained, however, they will have a positive impact on the hospital’s data security in which a CDSS could be easily incorporated and protected.

5.4 Humans plus or versus machines?

AI will not be a panacea and if used improperly, these systems can replicate or increase bad practices rather than improve clinical decisions[301]. Each AI application that is being potentially used in the clinics has to be investigated and thoroughly questioned to be able to reveal its huge potential in many clinical areas, where high dimensional and non-linear data is mapped to a simple classification and for which datasets are potentially stable over extended periods[296].

One of the biggest concerns being expressed is that the AI might become more intelligent than humans reaching a state called “superintelligence”[302]. This of course may lead to accelerated technological advancement by surpassing human control with goals that may not fit with current societal norms, which is why there is a growing branch of prevention for AI safety that concerns itself with AI containment[303]. Another concerning ethical problem will be who will set the rules and chooses the data that guides AI decisions, because DL algorithms are able to work from vast repositories of information, so the final rationale for their decisions might not be  understood completely in the end, very much like our own intuition[304,305].

Healthcare involves quick decisions that both machines and humans can make utilize and of course generate errors, so iterative systems are needed to detect, prevent, and correct these. Interplay of humans and smart algorithms can therefore be a great chance for such a system. We should all be willing to question and change our own practice, improve standardization, and adhere to guidelines where possible[303]. It will be inevitable for healthcare professionals and patients to become familiar with upcoming and already existing AI technologies to ensure that CDSS and medical “traffic-light-systems” are used appropriately and the decisions made are as transparent as possible.