I am an incoming assistant professor of biostatistics at the Vaccine and Infectious Disease Division of the Fred Hutchinson Cancer Center. I obtained my PhD from the Department of Statistics and Data Science, The Wharton School, University of Pennsylvania under the supervision of Dr. Dylan Small. I am broadly interested in the design and analysis of observational studies, instrumental variables, and application of causal inference and statistical modeling in medicine and public health in general. I have been collaborating actively with researchers across multiple disciplines, in particular cardiac surgery and cerebral Malaria in sub-Saharan Africa.
Download my resumé.
Link to my Google Scholar page.
PhD in Statistics, 2022
University of Pennsylvania
BS in Mathematics, 2017
Harvey Mudd College
In malaria endemic areas, a high proportion of children have detectable parasitemia but show no clinical symptoms. When comatose from a cause other than malaria, this group confounds the cerebral malaria (CM) definition, making accurate diagnosis challenging. One important biomarker of CM is malarial retinopathy, a set of specific features visible in the ocular fundus. In this study, we quantified the contribution of malarial retinopathy in discriminating malaria-caused coma from non-malaria-caused coma. We estimated that 10% of our study cohort of n = 1,192 patients who met the WHO clinical definition of CM in Malawi had non-malarial coma based on a Gaussian mixture model using the parasite protein Plasmodium falciparum histidine-rich protein-2 (PfHRP2). A classification based on platelets, white blood cells and retinopathy significantly improved the discriminative power of a previously established model including only platelets plus white blood cells (AUROC: 0.89 vs. 0.75, p-value < 0.001). We conclude that malarial retinopathy is highly predictive of malaria-caused vs. non-malaria-caused coma and recommend an ocular funduscopic examination to determine malarial retinopathy status be included in the assessment of parasitemic comatose African children.
The systolic, diastolic, and mean CBFVs and Windkessel notch explain more than 98% of total TCD data variation. Statistical learning algorithm identified 7 phenotypic clusters for the MCA TCD data.Phenotypic clusters are not associated with demographics, clinical features or laboratory measurements. TCD measurements are associated with survival hazards rates after adjusting for age and gender.
One central goal of design of observational studies is to embed non-experimental data into an approximate randomized controlled trial using statistical matching. Researchers then make the randomization assumption in their downstream, outcome analysis. For matched pair design, the randomization assumption states that the treatment assignment across all matched pairs are independent, and that the probability of the first subject in each pair receiving treatment and the other control is the same as the first receiving control and the other treatment. In this article, we develop a novel framework for testing the randomization assumption based on solving a clustering problem with side-information using modern statistical learning tools. Our testing framework is nonparametric, finite-sample exact, and distinct from previous proposals in that it can be used to test a relaxed version of the randomization assumption called the biased randomization assumption. One important by-product of our testing framework is a quantity called residual sensitivity value (RSV), which quantifies the level of minimal residual confounding due to observed covariates not being well matched. We advocate taking into account RSV in the downstream primary analysis. The proposed methodology is illustrated by re-examining a famous observational study concerning the effect of right heart catheterization (RHC) in the initial care of critically ill patients.
When drawing causal inference from observational data, there is almost always concern about unmeasured confounding. One way to tackle this is to conduct a sensitivity analysis. One widely-used sensitivity analysis framework hypothesizes the existence of a scalar unmeasured confounder U and asks how the causal conclusion would change were U measured and included in the primary analysis. Works along this line often make various parametric assumptions on U, for the sake of mathematical and computational simplicity. In this article, we substantively further this line of research by developing a valid sensitivity analysis that leaves the distribution of U unrestricted. Our semiparametric estimator has three desirable features compared to many existing methods in the literature. First, our method allows for a larger and more flexible family of models, and mitigates observable implications (Franks et al., 2019). Second, our method works seamlessly with any primary analysis that models the outcome regression parametrically. Third, our method is easy to use and interpret. We construct both pointwise confidence intervals and confidence bands that are uniformly valid over a given sensitivity parameter space, thus formally accounting for unknown sensitivity parameters. We apply our proposed method on an influential yet controversial study of the causal relationship between war experiences and political activeness using observational data from Uganda.
Among 261,860 patients (123,702 valve and 138,158 isolated CABG), the GLMM analysis demonstrated that the strongest predictor for intraoperative TEE use was the hospital where the surgery occurred (MOR for TEE of 2.57 in valve and 4.16 in isolated CABG). The TEE staffing variable reduced the previously unexplained across-hospital variability by 9% in valve and 21% in isolated CABG, and hospitals with anesthesiologist TEE staffing (vs mixed) were more likely to use TEE in both valve (MOR for TEE of 1.21 in valve and 1.84 in isolated CABG). Hospital practice was the strongest predictor for TEE use overall, and in isolated CABG surgery, hospitals with anesthesiologist TEE staffing was a primary predictor for TEE use.
Subclassification and matching are often used to adjust for observed covariates in observational studies; however, they are largely restricted to relatively simple study designs with a binary treatment. One important exception is Lu et al.(2001), who considered optimal pair matching with a continuous treatment dose. In this article, we propose two criteria for optimal subclassification/full matching based on subclass homogeneity with a continuous treatment dose, and propose an efficient polynomial-time algorithm that is guaranteed to find an optimal subclassification with respect to one criterion and serves as a 2-approximation algorithm for the other criterion. We discuss how to incorporate treatment dose and use appropriate penalties to control the number of subclasses in the design. Via extensive simulations, we systematically examine the performance of our proposed method, and demonstrate that combining our proposed subclassification scheme with regression adjustment helps reduce model dependence for parametric causal inference with a continuous treatment dose. We illustrate the new design and how to conduct randomization-based statistical inference under the new design using Medicare and Medicaid claims data to study the effect of transesophageal echocardiography (TEE) during CABG surgery on patients' 30-day mortality rate.
Cerebral malaria is still a major cause of death in children in sub-Saharan Africa. Among survivors, debilitating neurological sequelae can leave children with permanent cognitive impairments and societal stigma, resulting in taxing repercussions for their families. This study investigated the effect of delay in presentation to medical care on outcome in children with cerebral malaria in Malawi.
Social distancing is widely acknowledged as an effective public health policy combating the novel coronavirus. But extreme forms of social distancing like isolation and quarantine have costs and it is not clear how much social distancing is needed to achieve public health effects. In this article, we develop a design-based framework to test the causal null hypothesis and make inference about the dose-response relationship between reduction in social mobility and COVID-19 related public health outcomes. We first discuss how to embed observational data with a time-independent, continuous treatment dose into an approximate randomized experiment, and develop a randomization-based procedure that tests if a structured dose-response relationship fits the data. We then generalize the design and testing procedure to accommodate a time-dependent treatment dose in a longitudinal setting. Finally, we apply the proposed design and testing procedures to investigate the effect of social distancing during the phased reopening in the United States on public health outcomes using data compiled from sources including Unacast, the United States Census Bureau, and the County Health Rankings and Roadmaps Program. We rejected a primary analysis null hypothesis that stated the social distancing from April 27, 2020, to June 28, 2020, had no effect on the COVID-19-related death toll from June 29, 2020, to August 2, 2020 (p-value < 0.001), and found that it took more reduction in mobility to prevent exponential growth in case numbers for non-rural counties compared to rural counties.
We examine the role of textual data as study units when conducting causal inference by drawing parallels between human subjects and organized texts. We elaborate on key causal concepts and principles, and expose some ambiguity and sometimes fallacies. To facilitate better framing a causal query, we discuss two strategies: (i) shifting from immutable traits to perceptions of them, and (ii) shifting from some abstract concept/property to its constituent parts, i.e., a constructivist perspective of an abstract concept. We hope this article would raise the awareness of the importance of articulating and clarifying fundamental concepts before delving into developing methodologies when drawing causal inference using textual data.
Nearly 150,000 patients undergo open cardiac valve or aortic surgery each year in the US. Intraoperative transesophageal echocardiography (TEE) is used frequently during cardiac surgery, but there is a lack of evidence associating TEE use to improved clinical outcomes. This matched, retrospective cohort study used national registry data from the Society of Thoracic Surgeon (STS), Adult Cardiac Surgery Database (ACSD) between 2011–2019 to compare clinical outcomes among patients undergoing cardiac valve or aortic surgery with vs without intraoperative TEE. Statistical analyses consisted of multiple matched comparisons (including within-hospital and within-surgeon matches), a negative control outcome analysis, and sensitivity analyses.
Multivariate matching has two goals: (i) to construct treated and control groups that have similar distributions of observed covariates, and (ii) to produce matched pairs or sets that are homogeneous in a few key covariates. When there are only a few binary covariates, both goals may be achieved by matching exactly for these few covariates. Commonly, however, there are many covariates, so goals (i) and (ii) come apart, and must be achieved by different means. As is also true in a randomized experiment, similar distributions can be achieved for a high-dimensional covariate, but close pairs can be achieved for only a few covariates. We introduce a new polynomial-time method for achieving both goals that substantially generalizes several existing methods; in particular, it can minimize the earthmover distance between two marginal distributions. The method involves minimum cost flow optimization in a network built around a tripartite graph, unlike the usual network built around a bipartite graph. In the tripartite graph, treated subjects appear twice, on the far left and the far right, with controls sandwiched between them, and efforts to balance covariates are represented on the right, while efforts to find close individual pairs are represented on the left. In this way, the two efforts may be pursued simultaneously without conflict. The method is applied to our on-going study in the Medicare population of the relationship between superior nursing and sepsis mortality. The match2C package in R implements the method.
Coronary artery bypass graft (CABG) surgery is the most widely performed cardiac surgery in the United States. Transesophageal echocardiography (TEE) is frequently used in a variety of cardiac surgical procedures, but its clinical benefit in isolated CABG surgery is unclear, and guidelines remain indeterminate. The aim of this study was to compare clinical outcomes among patients undergoing isolated CABG surgery with versus without TEE in order to test the hypothesis that TEE would be associated with improved clinical outcomes after CABG surgery.
This paper proposes to embed a class of observational IV data into a cluster-randomized encouragement experiment using nonbipartite matching. Potential outcomes and causal assumptions underpinning the design are formalized and examined. Testing procedures for two commonly used estimands, Fisher’s sharp null hypothesis and the pooled effect ratio (PER), are extended to the current setting. We then introduce a novel cluster-heterogeneous proportional treatment effect model and the relevant estimand: the average cluster effect ratio. This new estimand allows treatment heterogeneity, and is advantageous over the PER estimand in that it does not suffer from Simpson’s paradox. We develop an asymptotically valid randomization-based testing procedure for this new estimand based on solving a mixed-integer quadratically constrained optimization problem.
We propose a general framework of approaching the optimal individualized treatment rules (ITR) estimation problem when a valid IV is allowed to only partially identify the treatment effect. We introduce a novel notion of optimality called ‘IV-optimality’. A treatment rule is said to be IV-optimal if it minimizes the maximum risk with respect to the putative IV and the set of IV identification assumptions. We derive a bound on the risk of an IV-optimal rule that illuminates when an IV-optimal rule has favourable generalization performance. We propose a classification-based statistical learning method that estimates such an IV-optimal rule, design computationally efficient algorithms, and prove theoretical guarantees.
How many healthcare workers have lost their lives fighting coronavirus disease (COVID-19)? We estimate using the capture–recapture method.
A commonly used sensitivity analysis for matched observational studies adopts a worst-case perspective, which assumes that, in each matched set, the unmeasured confounder U is allocated to make the bias worst. This worst-case allocation of U does not correspond to any realistic distribution of U in the population and is difficult to compare with observed covariates. We proposed a new sensitivity analysis method that addresses these concerns. We apply the new method to a study of second-hand smoking and blood lead levels in children and find that, to explain away the association between second-hand smoke exposure and blood lead levels as non-causal, the unmeasured confounder would have to be a bigger confounder than any measured confounder.
It is common to compare individualized treatment rules based on the value function, which is the expected potential outcome under the treatment rule. Although the value function is not point-identified when there is unmeasured confounding, it still defines a partial order among the treatment rules under Rosenbaum’s sensitivity analysis model. We first consider how to compare two treatment rules with unmeasured confounding in the single-decision setting and then use this pairwise test to rank multiple treatment rules. We consider how to, among many treatment rules, select the best rules and select the rules that are better than a control rule. The proposed methods are illustrated using two real examples, one about the benefit of malaria prevention programs to different age groups and another about the effect of late retirement on senior health in different gender and occupation groups. Supplementary materials for this article are available online.
Cerebral malaria (CM) remains a leading cause of mortality and morbidity in children in sub-Saharan Africa. Recent studies using brain magnetic resonance imaging have revealed increased brain volume as a major predictor of death. Similar morphometric predictors of morbidity at discharge are lacking. The aim of this study was to investigate the utility of serial cranial cisternal cerebrospinal fluid (CSF) volume measurements in predicting morbidity at discharge in pediatric CM survivors.