Clinical risk scores can cheat the AUC when the time that an adverse event occurs in a positive-class stay tends to exceed the duration of a negative-class stay. Risk scores should be compared to uniformly random scores that match their timing.
Informally, area under the (receiver operating characteristic) curve (AUC) cheating occurs when risk scores that are i.i.d. standard uniform random variables result in an AUC greater than $0.5$.
AUC cheating ๐
Imagine that, at the beginning of every hour of each patient’s hospital stay, you estimate the probability that they experience sepsis onset during their stay (a positive-class stay). At the first time that this estimate exceeds a predetermined threshold $t \in [0,1]$, you produce an alert. If this estimate never exceeds $t$ before the end of their stay, then you never produce an alert.
Define $\tau_{onset}$ to be the time of sepsis onset, which is $\infty$ if the patient does not get sepsis during their stay (a negative-class stay), and $\tau_{end}$ to be the time that the stay ends. If you estimate the probability of sepsis onset by generating an i.i.d. sequence of standard uniform random variables, then the true and false positive rates are
$$ TPR (t) = 1 - E_+ (t^{\tau_{onset}}) \quad \text{and} \quad FPR (t) = 1 - E_- (t^{\tau_{end}}),$$
where $E_+$ and $E_-$ denote expectation with respect to $P_+$ and $P_-$, the conditional distributions of positive- and negative-class stays. A straightforward calculation shows that
$$ AUC = E_{\tau_{onset} \sim P_+, \tau_{end} \sim P_-} \left( \frac{\tau_{onset}}{\tau_{onset} + \tau_{end}} \right).$$
In other words, the AUC is the expected ratio of $\tau_{onset}$ to $\tau_{onset} + \tau_{end}$, where $\tau_{onset}$ is the sepsis onset time of a random positive-class stay and $\tau_{end}$ is the stay length of a random negative-class stay.
Example. If sepsis onset always occurs at the beginning of the $n$th hour of a positive-class stay and every negative-class stay lasts $m$ hours, then hourly, i.i.d., standard uniform risk scores result in an AUC of $\frac{n}{n+m},$ which exceeds $0.5$ when $n>m$.
Some observations:
-
The AUC can be cheated when the number of scores generated before the time of an adverse event (in a positive-class stay) tends to be greater than the number of scores generated before the end of a negative-class stay.
-
When the opposite is true, uniformly random risk scores result in an alert system with an AUC below $0.5$. (Note that, in this case, the curve cannot be “flipped” along the diagonal to give an alert system with AUC above 0.5.)
-
AUC cheating is not possible when at most one score is generated for each stay.
Practical implications ๐
There are several practical implications for the development and evaluation of risk scores.
- The evaluation of a dynamic risk score should include a comparison with i.i.d. standard uniform risk scores generated at the same times.
- Alert systems that can generate more than one score per stay may have a structural advantage or disadvantage relative to those that generate at most one score per stay, in terms of AUC.
- For example, it may be better to generate a score at most once, at a carefully chosen time, than it is to generate the score hourly.
- There are situations in which a “smart” score, generated at most once, performs worse than a “dumb” score, generated many times.
- Since inclusion criteria affect the distributions of onset times and stay lengths, they can affect the potential for AUC cheating.
- For example, suppose that you go from excluding stays shorter than 24 hours, to excluding stays shorter than 12 hours. This could disproportionately affect negative-class stays, making them relatively shorter, thereby increasing the potential for AUC cheating.
- Differences between internal and external validation performance may be attributable to differences in the distributions of onset times and stay lengths.