Note: Page last updated October 2022
Current
Principal Component Pursuit (PCP) for Environmental Epidemiology
I am working to adapt and extend Principal Component Pursuit (PCP), a robust dimensionality reduction technique from computer vision, for pattern recognition with environmental mixtures data. The motivation is for PCP to aid in epidemiological studies by decomposing an exposure matrix into: (1) a low rank component encoding the consistent patterns of exposure; and (2) a sparse component capturing unique and outlying exposure events not explained by the identified exposure patterns. These two pieces can then be used in health models to assess possible associations between environmental exposures and various health outcomes. In this work, we engineer and apply PCP to a variety of public health domains and data, including ambient air pollution, exposomic, and metabolomic data. We are exploring both convex and non-convex optimization approaches to PCP. We are also in the middle of developing an open source R package so other researchers can apply PCP to their studies (see code linked below).
Bayesian Non-parametric Ensemble Model (BNE) for Uncertainty Characterization in PM2.5 Predictions
I am working to develop a Bayesian Non-parametric Ensemble model for uncertainty characterization in PM2.5 predictions across the contiguous United States. For more information see an old paper by some of my colleagues below.
Computer Vision for the Assessment of Policy Impacts on Urban Communities
Using computer vision algorithms and convolutional neural network (CNN) architectures to quantitatively characterize changes in urban communities in response to policy developments and the COVID-19 pandemic. For example, has the COVID-19 pandemic affected vehicle traffic patterns in the city? Have policies like NYC's Open Streets program changed how / when pedestrians use the streets? And can we answer these kinds of questions with computer vision? Code to come soon.
Past
Evidence based Automatic Fact-Checking with RoBERTa
In this work, I helped develop, train, and evaluate a RoBERTa-based fact-checking model to combat misinformation surrounding: (1) COVID-19; (2) climate-change; and (3) the 2020 US presidential election. To train and test the COVID-19-specific model, I compiled a COVID-19-specific dataset by crawling the web and scraping millions of online news articles for claims regarding COVID-19 before mapping them to relevant scientific papers. My colleagues and I then wrote an IRB protocol to receive approval for human annotators to help tag the dataset with information needed for the fact-checking model. Once our annotators were cleared to begin work, I trained and supervised the annotation team. I also assisted in the development, implementation, and maintenance of a user-friendly, web-based annotation interface to facilitate the annotations, using the popular NoSQL database MongoDB to do so. Primarily, I explored various NLP approaches to the fact-checking problem and pipeline in Python, including but not limited to: BERT-based architectures, claim detection, unsupervised data augmentation, semi-supervised learning, transfer learning, named entity recognition, TF-IDF, few-shot learning, and more.
Scraping Georgia Jails for Georgia Get Out The Vote
This was a project I completed as a volunteer for Georgia's Get Out the Vote campaign. I wrote a Python web-crawler to scrape a few of Georgia's county jails for relevant contact information in order to help send voter registration and ballot information to every incarcerated person in Georgia before the 2020 Georgia State Senate runoff elections.
RoBERTa for Claim Detection
In this project, I wrote ClaimDetective, a Python class that allows the user to rank a list of sentences (i.e. potential claims) in order of most check-worthy to least check-worthy, i.e. the priority with which they should be fact-checked. ClaimDetective was built with a deep-learning model that fine-tunes RoBERTa under-the-hood to identify and rank claims that are worth fact-checking. I implemented ClaimDetective with PyTorch and Scikit-learn. This work was done during my time as a natural language processing research assistant.
Automatic Diagnosis of COVID-19 Chest X-rays with Neural Nets
At the beginning of the COVID-19 pandemic, I designed, trained, and evaluated a convolutional neural network (CNN) that classified chest x-ray images into one of four classes: Viral Pneumonia, Bacterial Pneumonia, COVID-19, and Healthy. I implemented the CNN in Python with TensorFlow and Keras. I also wrote a 31-page report in LaTeX summarizing the results of the final model. The report included a detailed error analysis and an interpretability section that aimed to help clinicians and front-line workers learn which aspects of the chest x-rays the model used to successfully diagnose patients.
SARS-CoV-2 Sequence Analysis
I performed a preliminary RNA secondary structure analysis of SARS-CoV-2’s spike (S) gene coding region of its genome across inter- and intraspecies datasets to identify conserved structures that could be targeted when making a vaccine. My analysis identified 15 potentially conserved structures within the interspecies dataset, as well as 43 at the intraspecies level. For the project, I wrote a custom algorithm to perform the RNA secondary structural analysis identifying conserved sites. I also wrote a Bioinformatics-style Applications Note in LaTeX to summarize the project.