Sudat, S., UC Berkeley. ProQuest ID: Sudat_berkeley_0028E_12729. Merritt ID: ark:/13030/m5xw4pw5. Retrieved from https://escholarship.org/uc/item/9z03362s, 2012 Jun 01
Causal inference-inspired semi-parametric methods of measuring variable importance are well designed to answer questions of interest in health settings. Unlike traditional regression approaches, such variable importance measures are based on causal parameters that have straightforward real-world definitions, regardless of the approach used to estimate them.
Parameters of regression models, in contrast, are not at all straightforward to interpret in real-world settings, because their definition relies completely on the correctness of the pre-specified model. Prediction-focused machine learning methods can avoid the issues of model pre-specification, but still do not provide estimates of variable importance that can be easily interpreted; the set of predictors chosen can also be highly variable. Semi-parametric methods combine the best of both approaches, and are able to utilize data-adaptive estimation algorithms while still returning a parameter estimate that is meaningful and can be simply understood.
In this dissertation, semi-parametric methods to assess variable importance are applied to three real-world health applications: the relationship between types of water contact and the prevalence of schistosomiasis infection in rural China; HIV-1 treatment regimen genotype susceptibility scores and their relationship with the rate of virologic suppression; and the impact of a telemanagement program on and the association of multiple risk factors with the rates of hospital readmission for heart failure patients. Emphasized are (1) the choice of parameter of interest as motivated by the research question, (2) estimator choice based on a consideration of theoretical properties and performance under non-ideal conditions, and (3) the use during the estimation process of machine learning algorithms and algorithms that utilize multiple candidate models. Four different causal parameters are defined and described, and multiple estimators are considered.
Each data analysis presents different opportunities to investigate aspects of causal inference-based semi-parametric methods. In the schistosomiasis analysis, a traditional regression approach is compared with semi-parametric methods. Estimator performance is compared in the HIV analysis, particularly in the context of the observed extreme violations of the experimental treatment assignment (ETA) assumption. The G-computation estimator, the inverse-probability-of-censoring-weighted (IPCW), its double-robust counterpart (DR-IPCW), and the targeted maximum likelihood estimator (TMLE), are included in this comparison. The heart failure analysis addresses differences in causal parameter definition for a community-level treatment, and the related assumptions that must be added to the typical theoretical framework. Also included in this analysis is a comparison of super learning with traditional regression in terms of predictive performance.