### 1. Sample

To examine the magnitude of 16 risk factors on cost accumulation, we used a random sample (n = 363) of AUD patients identified through EHRs based on alcohol-related ICD-10 (the International Statistical Classification of Diseases and Related Health Problems, 10th revision) codes.

Figure 1 presents the research flow. The study cohort was randomly sampled from the regional EHR system in the North Karelia region in Finland based on the following alcohol-related ICD-10 diagnosis codes: G312, G405, G4050, G4051, G4052, G621, I426, K292, F100, F101, F102, F103, F104, F105, F106, F108, F109, K860, K700, K701, K702, K703, K704, K709, T510, T511, T512, T513, T518, T519, X45, and X69 (see

Appendix 1 for more detailed information). Retrospective sampling included the years 2011 and 2012. Of the identified overall AUD population of (n = 6,246) individuals, we first formed a random cohort of 396 individuals by using Excel random sampling, and their health service use cost data were retrieved from the EHRs for the years from 2011 to 2016. We then excluded individuals who died or remitted in 2011 because we were not able to explicitly identify which costs in 2011 were caused before remission and which were caused after. Thus, the final study sample included 363 individuals. Based on the manual assessment of the EHR data conducted by two reviewers, the principal researcher and research assistant, we identified that the cohort represented individuals with a severe form of AUD. AUD was defined according to the Diagnostic and Statistical Manual of Mental Disorders (DSM-V) and ICD-10 to include both harmful use and alcohol dependence.

### 2. Measurement

The examined outcome was the total cost of care. Data on the cost of care were used for the years from 2012 to 2016, and cost data from 2011 were used as

*a priori* information. Specialized care costs were retrieved from the hospital EHR system, including all hospitalizations, outpatient visits and admissions, and their costs derived from the hospital’s cost accounting systems. Primary care costs were retrieved from the outpatient EHR system but were underestimates of the true costs, as the primary care database did not include e.g. private health service use costs (see

Appendix 2 for more information). Total costs were discretized to quartiles. In the assessment of the causal effect of AUD remission on patients’ costs of care, those with continual AUD were used as a reference, and those who died before the year 2012 were excluded from analysis.

We identified 16 factors associated with AUD trajectories and their costs based on the literature [

14,

15,

22], including socioeconomic variables encompassing age, gender, marital status, unemployment status, and social problems like homelessness, illicit drug use, criminal record, and drunk driving. Data on drinking status and socioeconomic variables were manually collected from EHRs and the municipal social services database mainly as dichotomous variables. In addition, clinical variables included the number of ICD-10 diagnoses of chronic conditions (i.e., permanent diagnoses). Diagnoses were classified into three groups, according to number: (1) none, (2) one, and (3) two or more. Mental health diagnoses included ICD-10 codes F00 to F99 (mental and behavioral disorders), excluding F10 codes. Drinking status was defined as continual AUD or stable AUD remission. Stable remission was defined as sustained abstinence or managed use that lasted until the end of the follow-up period, with a minimum duration of 6 months. Time estimate in AUD remission was based on health professionals’ objective notes and diagnosis information. Individuals with any shorter abstinence periods were included in the continual AUD group.

### 4. Statistical Analysis

We performed the statistical analysis using the Bayesian network approach with the BayesiaLab 9.0 tool [

23]. The visual form of a Bayesian network is a directed acyclic graph (DAG), from which direct and indirect effects, common causes, and effects can be discovered and mathematically expressed. A DAG consists of nodes presenting random variables

*X*_{i}, and arcs or lines presenting associations between a pair of variables. A DAG defines a factorization of joint probability of a Bayesian network into a product of local probability distributions, one for each variable:

where pa_{Xi} are parents of a variable *X*_{i}. This type of representation enables both deductive and abductive inference from the model, allowing fixing of (controlling for) one or several variables’ probability distributions for inference of the direct or total effect of the variables of interest on the target variable.

Bayesian networks are used for both non-causal (predictive or explanatory) and causal modeling. In the non-causal model, the arc describes probabilistic relationships between the parent variable(s) and the child variable(s), whereas in the causal model it describes the existence of a direct causal dependence between two variables. A Bayesian network structure is constructed by using a bottom-up modeling approach (i.e., using structural and parameter learning from data), a top-down approach (i.e., manual construction based on existing expert knowledge), or a hybrid of the bottom-up and top-down methods. Multiple algorithms exist for structural learning. A supervised learning method with a minimum description length (MDL) score [

24] uses a naive structure, such as augmented naive Bayesian (ANB) and tree augmented naive Bayesian (TAN), whereas an unsupervised learning method uses greedy search (e.g., maximum spanning tree, taboo, and hill climbing) with MDL scoring to construct a non-naive Bayesian network. Supervised learning is used mainly for predictive modeling, and unsupervised learning is adapted for clustering and for the construction of a causal Bayesian network. However, human intervention is required to verify the correctness of causal directions.

The MDL score optimizes the model complexity against the model fit to data and can be expressed at a high level as

where BN is a Bayesian network including parameters, DL is a description length in bits, G is the graph part of a BN, CPTs are conditional probability tables for each variable

*X*_{i} in the model, and SC is the structural coefficient. With the SC, the effect of the complexity of the network to the score can be increased (SC < 1) or decreased (SC > 1). A more detailed level MDL equation is provided in

Appendix 3.

A true causal network between the variables and the target variables is hard to estimate. Especially in settings with numerous variables, information of a complete causal structure is often unknown. Nevertheless, causality can be estimated by applying van der Weele and Shiptser’s modified disjunctive confounder criteria (DCC) for calculating the direct causal effect of a variable on the target variable from a non-causal Bayesian network [

25,

26]. According to the DCC, correctly selected confounders are the key for successful blocking of all backdoor and frontdoor paths between the treatment and the target variables in a Bayesian network. Van der Weele and Shiptser [

25] defined the original DCC as “controlling for each variable that is a cause of the treatment, or of the target, or both”. Van der Weele [

26] added two additional qualifications to the DCC for practical use for confounder controlling and re-named it the modified disjunctive confounding criterion, called in this article modified DCC. Additional definitions are (1) discarding any variable known to be an instrumental variable and (2) including variables that do not satisfy criteria but are good proxies for unmeasured common causes of treatment.

Continuous variables were discretized using a convenience distribution for the variable age with 10-year intervals. The variables implying costs were discretized to quarters having 25% of observations in each class. The outcome variable was cumulative healthcare costs (totalcost_2012–2016), which was discretized into equal quartiles, qualitatively described as “low cost” (≤€4,486.54), “medium cost” (€4,486.55–€15,746.10), “high cost” (€15,746.11–€46,864.36), and “very high cost” (€46,864.37–€1,180,863.75).

Supervised ANB learning was used in the study to construct a Bayesian network. To find the optimal complexity of the model in the ANB learning phase, an SC analysis was performed as part of MDL scoring, and the value SC = 0.6 was used in the analysis.

The result was a non-causal ANB network model with 16 independent variables. BayesiaLab allows every variable and their combinations to be fixed to certain values. For example, the variable “status2012” can be fixed to the value “remitted” = fixed to 100%. Then the model gives the values of the outcome in that hypothetical case that individuals had an AUD remission. We analyzed the probabilistic effect of independent variables by fixing each variable’s values separately to be 100%.

Following the modified DCC, we examined the effect of AUD remission in 2012 (continuous drinking vs. remission) by fixing marginal distributions of all other independent variables except drinking status (status2012). We analyzed the variables associated with the variable status2012 (continuous AUD/AUD remission) by a semi-structured search with status2012 as the target. The following variables were associated with status2012: drug use (strongest effect), homelessness, criminal background, gender, marital status, and income support, fulfilling the criteria of the DCC. In a similar analysis, we found the following variables to be associated with the outcome totalcost_2012–2016: number of somatic diagnoses, age, income support, municipality, and specialized care costs 2011. The variable “income support” was the only variable associated with both the outcome and index variable status2012. We also used the variable “number of psychiatric diagnoses” in ANB modeling for a measurement of psychiatric background.

We used Jouffe’s proprietary likelihood matching (PLM) algorithm, which implements the modified DCC and allowed us to estimate the independent variables’ causal effect on the target while holding others constant [

27].

An SA among variables allows the identification of combinations of variable values that have the maximum or minimum effect on the target variable. SA was performed twice using hard evidence, showing first the maximum and then the minimum effect on costs (target variable totalcost_2012–2016).

A tornado diagram is a design for SA. The diagram consists of two-sided horizontal bars to visualize the factors with the largest impact (positive or negative) on the outcome variable. The widest bar showing the largest impact is placed at the top. Bars to the right of the midline show the positive effect on the outcome variable, whereas bars to the left represent a negative effect. The diagram is presented separately for each value of the outcome variable.