Research Projects

Current work

Selective inference

Principal component analysis identifies directions of maximum variation in the data. We provide inference guarantees on the extent of this variation, including when the number of principal components is selected using data-driven methods such as the “elbow rule”.

Numerous selective inference methods symmetrically widen classical confidence intervals to provide valid inference. We demonstrate that this approach is sub-optimal compared to various conditional approaches.

Miscellaneous

Investigators often wish to understand the robustness of their analyses to unobserved variables. Here, we provide insight on when and how an unobserved variable can make your previously insignificant result significant.

Prior work

Causal discovery from multi-environment data

Causal graphs are typically identifiable only up to an equivalence class under i.i.d. data. We prove non-parameteric identifiability from heterogeneous data with natural (unknown) distribution shifts if causal mechanism shfits are sparse.

Random forests

Posterior probabilities from machine learning classifies are typically overconfidant. We study multiple calibration approaches to the random forest classifier across OpenML-CC18 datasets, in particular honest random forests for which we provide multiclass consistency guarantees and applications to high-dimensional hypothesis testing via mutual information estimation.

Although random forest classifiers are extremely successful for tabular data, they are not state of the art for structured data. We develop a random forest algorithm better-suited for such data as images and time series by using structured projections of features which take into account the data geometry.

fMRI data analysis

Neuroscience collaborators wished to determine if there existed any differences between novice and expert meditators across meditation tasks and resting state. We provided (i) computationally efficient dimensionality reduction approaches via generalized CCA to reduce the spatial time series to interpretable spatial gradients (ii) high dimensional distance correlation hypothesis tests with novel permutation strategies to account for implicit multilevel dependencies between scans of the same subject.

As part of this project, we realized there was no existing reliable code for the multiview methods we needed to use. So, we developed an open-source Python package for multiview machine learning methods, featuring a unified API and easy integration with scikit-learn.

University projects

Also, for fun, check out some of the interesting projects completed as part of my university classes.