# Using Internally Developed Risk Models to Assess Heterogeneity in Treatment Effects in Clinical Trials

## Jump to

## Abstract

**Background—**Recent proposals suggest that risk-stratified analyses of clinical trials be routinely performed to better enable tailoring of treatment decisions to individuals. Trial data can be stratified using externally developed risk models (eg, Framingham risk score), but such models are not always available. We sought to determine whether internally developed risk models, developed directly on trial data, introduce bias compared with external models.

**Methods and Results—**We simulated a large patient population with known risk factors and outcomes. Clinical trials were then simulated by repeatedly drawing from the patient population assuming a specified relative treatment effect in the experimental arm, which either did or did not vary according to a subject’s baseline risk. For each simulated trial, 2 internal risk models were developed on either the control population only (internal controls only) or the whole trial population blinded to treatment (internal whole trial). Bias was estimated for the internal models by comparing treatment effect predictions to predictions from the external model. Under all treatment assumptions, internal models introduced only modest bias compared with external models. The magnitude of these biases was slightly smaller for internal whole trial models than for internal controls only models. Internal whole trial models were also slightly less sensitive to bias introduced by overfitting and less sensitive to falsely identifying the existence of variability in treatment effect across the risk spectrum compared with internal controls only models.

**Conclusions—**Appropriately developed internal models produce relatively unbiased estimates of treatment effect across the spectrum of risk. When estimating treatment effect, internally developed risk models using both treatment arms should, in general, be preferred to models developed on the control population.

## Introduction

Randomized controlled trials are generally considered positive if the mean outcome in treated patients is superior to the mean outcome in the control arm. Although this approach has the virtue of simplicity, it has been criticized for offering limited guidance for whether individual patients ought to be treated. In some strongly positive trials, it is possible that a majority of patients receive little to no benefit, or even harm, from treatment.^{1,2} Unfortunately, patients have so many characteristics that might potentially influence the benefit of therapy, examining each using traditional subgroup analyses is often not informative. In addition to the well-appreciated issues of multiplicity leading to false-positives and low power leading to false-negatives, traditional subgroup analyses may fail to detect clinically relevant differences in treatment effects between groups that can be identified only when a combination of clinical variables is considered.^{3–5}

It is increasingly recognized that a patient’s baseline risk is a fundamental determinant of treatment benefit. Indeed, many treatment guidelines have adopted risk-sensitive treatment recommendations. Perhaps the most notable examples within the field of preventive cardiology are recommendations to initiate statin^{6,7} or aspirin^{8} therapy contingent on the baseline risk of cardiovascular disease. Although there are many examples in other fields as well, empirical evidence supporting these guidelines is not always clear, and details on how to implement guidelines is often lacking.^{9} A better evidence base for risk-based treatment recommendations might be available if clinical trials were analyzed using multivariable risk prediction methods, and it has recently been proposed that such methods be routinely applied.^{10} This approach can provide better estimates of how the degree of benefit or harm of a medical intervention varies across patients in the trial based on their overall risk of the study’s outcomes. This technique relies on a modeling approach that separately estimates baseline risk, typically, although not necessarily,^{11,12} using an independently derived predictive model to estimate the baseline outcome risk for individual study subjects (eg, Framingham risk score for the prediction of stroke/myocardial infarction risk), and then estimates treatment effect as a function of this baseline risk. Such a model tests whether lower-risk subjects receive similar proportional benefit as higher-risk subjects using a treatment–baseline risk interaction term. Although not exploring every potentially clinically relevant hypothesis regarding heterogeneity of treatment effect (HTE), this approach examines the influence of a highly relevant mathematical determinant of treatment effect that integrates many patient characteristics in a multivariable framework, greatly mitigating some of the limitations of conventional subgroup analyses by minimizing the risk of multiple comparisons producing spurious findings,^{5,13} as well as delivers individualized estimates of treatment effect.^{14–17} An additional virtue of this approach is that even when the proportional treatment effect does not differ across the spectrum of baseline risk (ie, relative risk is constant across the baseline risk spectrum), this approach is well suited to showing how absolute treatment benefit varies based on baseline risk, which is often the most relevant metric for clinical decision making.

One proposal suggests that externally developed risk prediction tools be routinely used for multivariable risk assessment of HTE.^{10} However, some clinical trials lack validated pre-existing risk models for their main outcome (particularly composite outcomes), or the data may in other respects not be fully compatible with a published model (eg, model variables may be differently defined). In cases where no well-accepted external model exists to estimate baseline outcome risk, it is tempting to use the randomized controlled trial data to design a baseline risk model and examine treatment effect across risk defined by that internal risk model, an approach recently applied to the Justification for the Use of Statins in Prevention: An Intervention Trial Evaluating Rosuvastatin (JUPITER) study.^{17} Yet whether internally developed risk tools may introduce bias in the estimation of treatment effect across different risk categories has not been formally evaluated. An additional important question is what are the relative merits and risks of developing internal models on the whole trial population, including those patients who receive treatment, versus on the untreated (control) arm only. Deriving a model using only the control arm has intuitive appeal because arguably such a model best represents the patients’ pretreatment baseline risk, yet this approach may also lead to differences in model fitness between trial arms, thus inducing bias. Using simulation analyses based on cardiovascular disease prevention trials, we sought to examine whether using internal models, developed either on the whole trial or in the controls only, compared with using an external model, would result in biased estimates of the relationship between baseline risk and treatment effect.

## Methods

### Overview

To compare internal and external model bias and determine whether internal whole trial (IWT) or internal controls only (ICO) models lead to less bias and under what circumstances, we compared ICO, IWT, and external models under a variety of treatment effect scenarios. Specifically, we designed 4 different treatment effect scenarios that varied the magnitude of treatment effect and whether there was a main treatment effect (ie, average positive effect in the study population representing a constant relative risk reduction), risk-stratified HTE (ie, treatment effect varied as a function of a subject’s baseline risk for the study’s outcome), or both. For each of the 4 scenarios, we simulated a large study population with risk factors that predicted outcomes in a prespecified pattern (true risk factors), other potential risk factors that correlated with those risk factors and with outcomes (confounders), and prespecified treatment effects. Thereby, we were able to create an external model with optimal validity (ie, a model that would result from a large unbiased sample of the parent population) to estimate baseline risk for each scenario. Next, we repeatedly randomly sampled, with replacement, from each of the 4 parent populations to determine the statistical power and potential estimation biases in examining risk-stratified HTE in trials of varying sizes if we used an IWT model to predict baseline risk, versus an ICO model, versus an optimized valid external model.

We chose a simulation approach for this analysis instead of estimating these effects in an actual trial because of the intrinsic challenge of determining ground truth when applications of different models arrive at discordant conclusions. Given that this simulation approach requires some assumptions, we also applied a similar analytic approach to an actual trial where a widely used external risk model could be compared with an internal model. Specifically, we compared the Framingham risk score^{18} (external model) to internal models in the Lipid Research Clinics Coronary Primary Prevention Trial (LRC-CPPT)^{19} to determine whether widely different interpretation may arise, thereby calling into question simulation assumptions (see the Data Supplement). The LRC-CPPT analysis plan was reviewed and approved by the Tufts Institutional Review Board. The authors had complete access to all LRC-CPPT data.

### Simulated Treatment Scenarios/Patient Populations

Treatment scenarios were conceptually formulated as medication-based cardiovascular disease prevention trials. In all of the scenarios, patients were at risk for cardiovascular events, and the magnitude of this risk could be predicted based on the presence or absence of 6 specific binary risk factors. These risk factors independently predicted cardiovascular events with odds ratios between 1.5 and 3.0, approximating the associations for binary variable in commonly used risk prediction models,^{18,20} for each of the 200 000 simulated patients in each of the parent populations. Treated patients had a fixed low risk of competing treatment-related adverse outcomes that was independent of all risk factors. Finally, 6 additional variables designed to simulate potential confounders were generated, which had variable correlations with both true risk factors as well as cardiovascular outcomes.

The 4 treatment scenarios were differentiated based on the prespecified treatment effects. Two different types of treatment effects were specified. First, there was a mean overall treatment effect—whether the treatment group had, on the whole, favorable outcomes compared with the control group. This effect represented the standard definition of whether a trial is positive. Second, we specified HTE—varying treatment benefit based on baseline risk. The 4 treatment scenarios explored different combinations of these treatment effects (Table 1). Treatment scenario populations were developed using previously described methods based on Monte Carlo simulation.^{5}

### Trial Simulations

Individual trials, and their results, were simulated by repeated random draws with replacement from each scenario’s large parent population. In the base case, we drew 3000 intervention subjects and 3000 control subjects for each of the 4 scenarios. This sample size was selected for 2 reasons. First, it represented a well-powered study for detecting a main absolute outcome difference of 1.5% between the control and treatment arms. Second, it conformed to an oft-cited heuristic for adequate sample size to enable development of an ICO model without significant overfitting because it included ≈10 events per predictor variable (EPV)^{21} in the control arm. For each treatment scenario, we performed 1000 simulated trials and analyzed each trial for the treatment’s main effect and its interaction with baseline risk using treatment outcome models developed in each trial.

### Model Development

Before performing the simulated trials, we developed an optimized valid external baseline risk model for the large parent population using only the 6 true risk factors in each of the treatment scenario populations using logistic regression. Results generated using this external model serve as the gold standard for comparison against results generated using the internal models. Then within each of the simulated trials, we used logistic regression to develop ICO and IWT baseline risk models predicting cardiovascular outcomes using both the 6 true risk factors as well as the 6 confounders as predictor variables. The ICO model was derived on the control trial population only, whereas the IWT model was derived on the whole trial—both the control and treatment arms. For each subject in each simulated trial, we were then able to develop separate risk estimates from each of the 3 models: external, IWT, and ICO.

### Estimating Treated and Untreated Outcomes

Mean treatment effect and HTE were evaluated using logistic regression including 3 terms: predicted risk from the baseline risk model (IWT, ICO, external), a treatment indicator variable, and an interaction between treatment and baseline risk. The treatment indicator coefficient measures whether, and to what extent, a constant relative risk reduction exists across the risk spectrum, whereas the treatment–baseline risk interaction term measures whether differential treatment benefit exists on the part of the risk spectrum (eg, higher relative risk reduction in high- versus low-risk patients). These models enabled the estimation of an individual’s risk of the overall outcome with and without treatment for each of the 3 baseline risk models. For analyses that compare regression coefficients, baseline risk information was included in each treatment model using the percentile rank of risk from each baseline risk model to standardize baseline risks across models. Without such standardization, baseline risk would be systematically lower or higher for the IWT group compared with the external and ICO models depending on the direction of treatment effect.

### Estimating Model Bias and Statistical Power

Bias was estimated for each of the 5 trial scenarios using all 3 models using 3 separate approaches. First, the proportion of cases where the internal model outcome risk confidence interval included the optimized valid external point estimate was tabulated. Second, the average point estimate was estimated for each model type over each risk decile and compared with the estimated actual risk reduction (untreated risk–treated risk) across the series of simulated trials. Finally, the treatment and HTE regression coefficients and the degree of confidence interval coverage of the internal models were compared with the external model.

Power was estimated by the percentage of statistically significant main effects and interaction effects, when these effects were present in the parent population. Sensitivity, specificity, and intermodel reliability using Cohen κ were estimated by comparing the classifications of both coefficients of the ICO and IWT models to those using the optimized valid external model.

### Sensitivity to Overfitting

Our base case trial scenarios were designed so that the ratio of EPV included in the model was slightly >1:10 to limit bias due to overfitting.^{21} In real-world scenarios, it may be that there are a large number of potentially important predictors or a relatively small number of outcomes, and thus the models will be more susceptible to overfitting. To test how ICO and IWT models perform in these contexts, we set up a separate series of trials by varying the number of patients to adjust the EPV from 5 to 15.

## Results

### Predictiveness of Baseline Risk Models

The predictiveness of the 3 baseline risk models is displayed in Table 2. As expected, the IWT and ICO models were slightly more predictive of outcomes in the simulated trials compared with the external model when <10 predictor variables per outcome were included. This increased predictiveness, however, is merely an indication of modest overfitting in the IWT and ICO models because the external model represents an optimized valid model, and predictiveness for both IWT and ICO models decreased as overfitting was reduced.

### Model Bias—Treatment Effect Comparisons Across Model Types

In the base case, which had 10 control EPVs, we found minimal bias in treatment outcome risk predictions when comparing the true treatment effects (those found using an optimized valid external model), compared with either internal model. The confidence intervals for point estimates from both internal models included the treatment outcome predicted by the external models for every treatment scenario in nearly all cases—≥99.7% of ICO and 99.9% of IWT outcome risk predictions included the external risk point estimate within their confidence intervals.

Both ICO and IWT models produced, on average, minimal bias in the estimation of mean treatment effect compared with the optimal valid external model. The distribution of mean IWT biases was slightly narrower than the distribution of mean ICO biases, and these results for >1000 simulations are displayed in Figure 1.

The minimally biased predicted treatment effects of IWT and ICO models hold across the risk spectrum—neither IWT nor ICO resulted in substantial bias in either high- or low-risk patients regardless of treatment status or the presence/absence of HTE. IWT models predicted a fairly similar treatment effect size across the risk spectrum compared with external models. Conversely, ICO models tended to slightly underpredict treatment effect in lower-risk patients and overpredict treatment effect size in the highest-risk patients. This effect is because of the model fit being better in the control group, potentially inducing a spurious risk-by-treatment interaction. However, the magnitude of both effects is modest in our simulations (Figure 2). Bias was further quantified by comparing the regression treatment–risk interaction and treatment indicator coefficients from the internal treatment outcome models to the external models, and the results are summarized in the Data Supplement.

### Model Power

Because the overall treatment effect and HTE were specified in each of our treatment scenarios, we were able to determine how often the 3 modeling approaches correctly identified significant treatment indicator variables and treatment–interaction effects in >1000 simulated trials. Both IWT and ICO models had excellent agreement with external models on the treatment indicator variables (all κ values >0.80). Conversely, ICO models had higher sensitivity (ie, a lower false-negative rate) but at a cost of a higher false-positive rate (ie, finding a significant effect when there was no HTE). For example, the false-negative rate in scenario 1, in which HTE was present, was only 25.3% by ICO versus 47.3% by IWT, but in scenario 3, in which there was no HTE, ICO models had a false-positive HTE of 14% compared with 5.4% for IWT models (Table 3).

### Sensitivity to Overfitting

Risk of overfitting is an intrinsic weakness of ICO models compared with IWT models, given that they are derived on a smaller population. Potentially more important, differential fitting between treatment arms might induce or exaggerate HTE compared with an IWT model developed blinded to treatment. To explore the magnitude of these effects, we built IWT and ICO models with varying degrees of overfitting, ranging from highly overfit models (5 EPVs) to relatively conservative models (15 EPVs). For both models, the treatment effect bias was more tightly distributed around zero as the EPV increased. For all scenarios and for a given EPV, there was a slightly wider distribution of bias in ICO models than in IWT models (Figure 3). As anticipated, the direction of this bias tended to exaggerate HTE, with more benefit seen in higher-risk patients because of increased risk heterogeneity in the control arm.

## Discussion

This study has 2 key findings. First, we found that when internal models are developed appropriately (having ≥10 control group EPVs), they produced relatively unbiased estimates of treatment effect across the spectrum of baseline risk. Second, we found that IWT models should, in general, be preferred to ICO models as they offer narrower distribution of treatment effect biases, slightly less sensitivity to overfitting, and less susceptibility to false-positive identification of HTE.

Under a range of treatment effect scenarios, both ICO and IWT models arrived at similar estimates of the study population’s heterogeneity of baseline risk and how the relative and absolute treatment effect varied as a function of baseline risk (risk-based HTE), just as long as model development considered no more than 1 risk factor for every 10 outcomes present in the control group (10 EPVs). This implies that the development of internal models is a practical strategy to test for HTE. This has the potential to greatly increase the degree to which clinical trial results can be used to better inform individual patient treatment decisions in instances in which external risk models do not already exist for a study’s outcome measure. We would still argue that when valid externally developed prediction tools are available, their use should be preferred, not because of greater internal validity but because the use of an external tool is a better test of the external validity of HTE results, likely offers superior calibration, and permits easier translation into practice through a variety of techniques.^{10,22} In particular, to make risk-based treatment decisions, the risk model must be well calibrated to the target population. Given the differences between patients that enroll in trials and patients in the broader population,^{23} this is an important limitation of directly applying internal risk models to the broader target population. Consequently, if clinically important HTE is discovered with an internally developed model, it should motivate the more difficult step of developing and validating an appropriate risk model for clinical translation.

When developing internal models, this study suggests that IWT models be favored over ICO models. This finding was mainly because of IWT models being insensitive to overfitting, whereas overfitting ICO models differentially across study arms leads to more bias in the magnitude of risk-based HTE. Furthermore, even in the absence of overfitting, ICO models are intrinsically more susceptible to falsely identifying risk-based HTE when none exists because of an underestimation of the true standard error.

In addition, this study illustrates the importance of avoiding overfitting by ensuring an adequate number of events per variable in the baseline risk model. We observed an increase in bias in both IWT and ICO models as EPV decreased. Our data suggest that applying the conventional rule of thumb (≥10 predictor variables for each binary outcome in the data set) results in the introduction of minimal bias into treatment effect estimates across the spectrum of baseline risk, although this should ideally be tested in other scenarios. Although this might create a problem in some small studies or studies with a small number of outcomes, for most clinical trials, following this rule of thumb will merely require only considering risk factors for which there is strong empirical or theoretical reasons to suspect that they are substantially associated with the risk of the outcome.

This study is limited by the assumptions of our simulation environment. Although this study explored a wide variety of treatment effects, in order to maintain a comprehensible degree of complexity, we made some potentially limiting assumptions in setting up our treatment scenarios. Most notably, we modeled a disease–treatment context with relatively low outcome risk, relying on moderately predictive models in the context of moderate degrees of confounding and assuming a constant risk of baseline adverse effects. Although assuming a constant absolute risk of adverse effects across patients at different outcome risks may seem like a substantive limitation, it is a commonly applied assumption,^{24,25} and studies that have modeled differential risks of benefits and harms have not found substantial divergence.^{26,27} Regardless, when modeling actual trials, the possibility of differential harms across risk strata should be explicitly explored empirically. Although it is uncertain whether our conclusions would hold under differing assumptions, these parameters were chosen because they were thought to be applicable to a wide variety of real-world clinical contexts.

Although simulation studies require potentially limiting assumptions, those assumptions were necessary in this context because of the intrinsic challenges of studying this question using data from actual trials. If internal and external models applied to an actual trial were discordant, it would be impossible to determine whether that disagreement reflected limited calibration of the external model in the trial population or biases in the internal model. Thus, this study provides important evidence that using carefully developed internal models results in similar estimates of treatment effect size and similar estimates of how treatment effect varies as an effect of risk compared with optimized valid external models. Internal models are powerful tools to estimate treatment effect across the risk spectrum. Because these models seem to introduce minimal bias compared with external models, they should have an important role in targeting cardiovascular treatments to appropriate treatment populations when external models are not available.

#### WHAT IS KNOWN

For many treatments, there is known variability in the response to treatment.

Even in some positive clinical trials, many patients receive little to no benefit, and some may even receive net harm.

Identifying which patients are most likely to benefit from therapy is challenging.

Multivariable risk prediction is a promising technique to determine whether treatment effect differs across patients (ie, whether heterogeneity of treatment effect exists), which relies on estimating treatment effect based on an individual’s baseline outcome risk. This technique requires a model to estimate a patient’s baseline risk.

#### WHAT THE STUDY ADDS

In many cases, a valid external risk model may not exist for a given outcome, and it is tempting to develop such a model in a trial.

In this study, we found that under a variety of treatment assumptions, internal risk models that are appropriately developed within a clinical trial introduce little bias compared with optimal external models.

Internal models should have an important role in targeting cardiovascular treatments to appropriate treatment populations when external models are not available.

## Acknowledgments

This article was prepared using LRC-CPPT research materials obtained from the National Heart, Lung, and Blood Institute Biologic Specimen and Data Repository Information Coordinating Center and does not necessarily reflect the opinions or views of the LRC-CPPT or the National Heart, Lung, and Blood Institute. We also thank Tim Hofer for his assistance in developing the code to develop the simulated populations.

## Sources of Funding

Funded by a Patient Centered Outcome Research Institute Methodology Grant. J.F.B. was funded by an Advanced Fellowship from the Department of Veteran’s Affairs and by National Institute of Neurological Disorders and Stroke grant number 1K08NS082597.

## Disclosures

None.

## Footnotes

The Data Supplement is available at http://circoutcomes.ahajournals.org/lookup/suppl/doi:10.1161/CIRCOUTCOMES.113.000497/-/DC1.

- Received July 23, 2013.
- Accepted December 6, 2013.

- © 2014 American Heart Association, Inc.

## References

- 1.↵
- Kent DM,
- Hayward RA,
- Griffith JL,
- Vijan S,
- Beshansky JR,
- Califf RM,
- Selker HP

- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- Grundy SM,
- Cleeman JI,
- Merz CN,
- Brewer HB Jr.,
- Clark LT,
- Hunninghake DB,
- Pasternak RC,
- Smith SC Jr.,
- Stone NJ

- 7.↵
- Robson J

- 8.↵
- Becker RC,
- Meade TW,
- Berger PB,
- Ezekowitz M,
- O’Connor CM,
- Vorchheimer DA,
- Guyatt GH,
- Mark DB,
- Harrington RA

^{th}Edition). Chest. 2008;133(6 suppl):776S–814S. - 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- Kent DM,
- Ruthazer R,
- Griffith JL,
- Beshansky JR,
- Concannon TW,
- Aversano T,
- Grines CL,
- Zalenski RJ,
- Selker HP

- 16.↵
- Kent DM,
- Selker HP,
- Ruthazer R,
- Bluhmki E,
- Hacke W

- 17.↵
- Dorresteijn JA,
- Visseren FL,
- Ridker PM,
- Wassink AM,
- Paynter NP,
- Steyerberg EW,
- van der Graaf Y,
- Cook NR

- 18.↵
- Wilson PW,
- D’Agostino RB,
- Levy D,
- Belanger AM,
- Silbershatz H,
- Kannel WB

- 19.↵
- 20.↵
- Ridker PM,
- Paynter NP,
- Rifai N,
- Gaziano JM,
- Cook NR

- 21.↵
- 22.↵
- Ridker PM,
- MacFadyen JG,
- Fonseca FA,
- Genest J,
- Gotto AM,
- Kastelein JJ,
- Koenig W,
- Libby P,
- Lorenzatti AJ,
- Nordestgaard BG,
- Shepherd J,
- Willerson JT,
- Glynn RJ

- 23.↵
- 24.↵
- Dorresteijn JA,
- Boekholdt SM,
- van der Graaf Y,
- Kastelein JJ,
- LaRosa JC,
- Pedersen TR,
- DeMicco DA,
- Ridker PM,
- Cook NR,
- Visseren FL

- 25.↵
- Dorresteijn JA,
- Visseren FL,
- Ridker PM,
- Paynter NP,
- Wassink AM,
- Buring JE,
- van der Graaf Y,
- Cook NR

- 26.↵
- Sussman JB,
- Vijan S,
- Choi H,
- Hayward RA

- 27.↵
- Whiteley WN,
- Adams HP Jr.,
- Bath PM,
- Berge E,
- Sandset PM,
- Dennis M,
- Murray GD,
- Wong KSL,
- Sandercock PAG

## This Issue

## Jump to

## Article Tools

- Using Internally Developed Risk Models to Assess Heterogeneity in Treatment Effects in Clinical TrialsJames F. Burke, Rodney A. Hayward, Jason P. Nelson and David M. KentCirculation: Cardiovascular Quality and Outcomes. 2014;7:163-169, originally published January 21, 2014https://doi.org/10.1161/CIRCOUTCOMES.113.000497
## Citation Manager Formats

## Share this Article

- Using Internally Developed Risk Models to Assess Heterogeneity in Treatment Effects in Clinical TrialsJames F. Burke, Rodney A. Hayward, Jason P. Nelson and David M. KentCirculation: Cardiovascular Quality and Outcomes. 2014;7:163-169, originally published January 21, 2014https://doi.org/10.1161/CIRCOUTCOMES.113.000497