An Investigation of Confidence Intervals for Binary
Transcription
An Investigation of Confidence Intervals for Binary
An Investigation of Confidence Intervals for Binary Prediction Performance with Small Sample Size Xiaoyu Jiang, Steve Lewitzky, Martin Schumacher BioMarker Development Novartis Institutes for BioMedical Research Ascona Workshop 2013 Statistical Genomics and Data Integration for Personalized Medicine BioMarker Development at Novartis Identify, develop, and deliver biomarkers for customized therapy (General Medicine) Biomarker uses • Predict patient outcome • Monitor patient progress; Identify disease subtypes; Understand drug mechanism Patient outcomes of interest • Drug response and patient stratification • Risk of adverse drug reaction (ADR) • Optimal dosing Biomarker types • Genetic (DNA) • RNA (mRNA, miRNA) • Protein • Metabolite • Imaging • Others (e.g., clinical chemistry) Statistics methodologies • Statistical predictive modeling 2 • Resampling Overview of Workflow Binary predictive performance estimation (when independent sample is unavailable) Repeatedly divide dataset into training set and test set Feature selection on training set Train classifier on training set Resampling 3 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013 Predict test samples Evaluate predictions Immediate Issues for Estimating Predictive Performance In the Context of Small PoC Trials for Biomarker Discovery Is there a robust method for generating point estimate(s) for predictive performance? • Many methods (leave-one-out cross-validation, repeated k-fold cross-validation, bootstrapping) - Molinaro et al. (2005); Varma and Simon (2006); Jiang and Simon (2007) • Mostly for misclassification rate How to evaluate the variability of such point estimate(s) (confidence intervals)? • Fewer methods • Only for misclassification rate 4 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013 Binary Predictive Model Performance Evaluation Moving Parts Linear Discriminant Analysis Univeriate Welch t-test Sample size Classifier Feature selection Objectives of simulation study : Ratio of class sizes Predictive Performance # of truly predictive markers Effect Size (ES) Resampling methods # of candidate markers 5 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013 • To understand the behavior of confidence interval estimation method • Identify a resampling method for point estimates in our setting Resampling Methods for Point Estimate Repeated Stratified k-fold Cross-Validation 1 2 3 4 5 6 7 8 9 10 4 6 3 8 7 1 10 2 9 5 Test set • Leave-One-Out Cross-Validation (LOOCV): k = n. • LOOCV estimate is unbiased but has large variance. • Publications have shown that repeated 5-fold and 10-fold CV produce good point estimates of misclassification error rate. 6 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013 5-fold Resampling Methods for Point Estimate Stratified bootstrapping with 0.632 correction 1 2 3 4 6 5 7 9 8 10 Sampling with replacement 1 6 1 3 10 7 3 6 9 8 7 2 5 4 Test Training • 𝑃(observing a subject in bootstrap sample) = lim (1 − 1 − • 0.632 correction: 𝑛→∞ 1 𝑛 ) 𝑛 θ632+ = 0.632 ∙ θBootstrap + 0.368 ∙ θResubstitution 7 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013 = 1 − 𝑒 −1 = 0.632 Resampling Method for Confidence Intervals Bootstrap Case Cross-Validation with Bias Correction (Jiang, Varma and Simon 2007) Need double resampling to estimate the sampling distribution of 𝜃 Bootstrap LOOCV ... LOOCV 8 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013 ... ... Bootstrap LOOCV Data 𝜃1 𝜃𝐵 θ𝐿𝑂𝑂𝐶𝑉 − θ𝐵𝐶𝐶𝑉 (1-2α)% empirical confidence interval: (θα ,θ1−α ) Bias correction Simulation Study Predictive Performance Metrics for Binary Prediction Predictive performance measures • Positive Predictive Value (PPV), Negative Predictive Value (NPV), Sensitivity, Specificity • Overall measure: Accuracy, Area Under ROC Curve (AUC) Evaluation of point estimate • Bias, variance, RMSE Evaluation of confidence interval • Empirical coverage probability 9 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013 Simulation Study Simulating Continuous Biomarker Data 𝑛 samples from class 1 and class 0 with ratio 𝑟 • 𝑛 = 20, 30, 40 … • 𝑟 = 1: 1, 1: 2, 1: 3 … 𝑝 = 1000 continuous biomarker variables follow a multivariate normal (MVN) distribution: 𝒙𝑝×1 ~𝑀𝑉𝑁(𝝁𝑝×1 , Σ𝑝×𝑝 ) 10 predictive markers with mean of 1 for class 1 subjects, mean 0 for class 0 subjects; other markers have mean 0 for all subjects. Include the top 2 markers from feature selection in the model. Varying diagonal elements in Σ to change effect size for predictive markers. Predictive markers are correlated; others have low to none correlation. • Correlation coefficient among predictive markers is 0.4. 10 500 simulations Simulation Study Calculating the True Value of Predictive Performance Metrics What is the true AUC of the predictive model built upon a dataset of sample size n=40 with class 1 and class 0 ratio of 1:1? To assess bias of the resampling-based estimates Generate an independent dataset Dlarge of large size (n=10000) from the same simulation scheme Train the predictive model on dataset of n=40 Predict the class labels of Dlarge Evaluate against the true class labels of the large dataset DLarge 11 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013 Simulation Results Point Estimate AUC (n=80, ES=1.4) AUC (1:1, ES=1.4) AUC (n=80, 1:1) LOOCV 5-fold CV B632 RMSE RMSE RMSE LOOCV 5-fold CV B632 LOOCV 5-fold CV B632 Sample size • • Ratio Effect Size Three methods perform similarly based on point estimate bias. 0.632 Bootstrapping has the smallest variance and hence the smallest RMSE. 12 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013 Simulation Results Empirical Coverage Probability with Varying Sample size Ratio =1:1 ES = 1.4 n=20 n=40 n=60 n=80 n=120 13 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013 Simulation Results Empirical Coverage Probability with Varying Ratio n = 80 ES = 1.4 C1:C0=1:1 C1:C0=1:2 C1:C0=1:3 C1:C0=1:4 C1:C0=1:5 14 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013 Simulation Results Empirical Coverage Probability with Varying Effect Size n = 80 Ratio = 1:1 ES=0.8 ES=1 ES=1.4 ES=2 15 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013 Summary and Discussion 0.632 corrected bootstrapping has the lowest RMSE for point estimation. As expected, confidence interval coverage is the best for larger sample sizes and class size ratio of 1:1. • Coverage appears reasonable for sample size >= 60. Small effect size leads to over coverage; large effect size leads to under coverage. Need to better understand • Different behavior of AUC relative to other metrics • Resampling method BCCV-BC Challenges and opportunities for new statistical methods • Robust way of integrating different marker modalities • Explore new methods for confidence interval estimation 16 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013