- Extracted billions of tweets and replies from Twitter API using SQL, Python, and web scraping techniques
- Transformed, cleansed, and normalized with Python, Pandas, and regular expressions
- Optimized data collection and transformation tasks using parallel processing, indexing, and caching
- Automated the data collection process to streamline data management processes for multiple collaborators
- Ensured data reliability and pipeline stability by developing logging and alerting mechanism to handle errors
- Collaborated with multidisciplinary team, providing insights and recommendations based on key findings
- Conducted exploratory data analysis using data visualization tools such as ggplot2 and plotly
- Identified experiment-breaking distribution error which necessitated reissuance
- Implemented data transformation techniques including variable recoding, data aggregation, and normalization
- Presented research findings in poster (displayed below) with visualizations, summary and Q & A at the Annual Conference for Political Methodology
- **Variable Recoding**: Recoded categorical variables to ensure consistency and relevance for analysis. For example, political party affiliations were standardized across datasets.
- **Data Aggregation**: Aggregated data at different levels, such as individual respondent level or ad level, to facilitate various types of analyses.
- **Normalization**: Applied normalization techniques to scale numerical variables, ensuring they were on a comparable scale.
- **Missing Data Imputation**: Employed multiple imputation techniques to handle missing data, minimizing potential biases in the analysis.
- **Model Specification**: Defined the appropriate statistical models, including linear and logistic regression models, to analyze the data.
- **Model Fitting**: Fitted the models to the data using the `lm` and `glm` functions in R, ensuring appropriate handling of predictor variables.
- **Diagnostic Checks**: Conducted diagnostic checks to validate the assumptions of the models, including checking for multicollinearity, heteroscedasticity, and influential observations.
- Formulated a compelling hypothesis on motivated reasoning and logical argument evaluation in political science
- Designed 2 large-n survey experiments generating a robust and insightful data set
- Secured ethics approval, upholding the highest research standards
- Conducted advanced data analysis with R packages, revealing key insights on argument evaluation and objectivity interventions
- Can individuals distinguish between strong (logically consistent) and weak (logically flawed) arguments?
- Are evaluations of argument quality biased by individuals’ pre-existing beliefs? competing goal of objectivity?
- Designed and implemented a beginner-friendly curriculum, tailored for students with no prior programming experience.
- Fostered an engaging and collaborative learning atmosphere by utilizing GitHub Classroom, Jupyter Lab, and the univeristy’s LMS.
- Facilitated student comprehension by providing real-world examples with publicly available data.
- Conducted in-class lectures, live coding sessions, and hands-on programming exercises to facilitate student learning
- Provided personalized feedback and support to students to enhance their comprehension and performance
- Developed and administered quizzes and assignments to evaluate student progress and adjust teaching strategies
-
Quantitative political methodology II (2020)
- Advanced course focused on sophisticated statistical analysis methods for computational scientists.
- Emphasized maximum likelihood estimation for various scenarios, including cross-sectional, time series, and non-parametric bootstrapping.
- Materials: All of Statistics: A Concise Course in Statistical Inference, Larry Wasserman; R Programming for Data Science, Roger D. Peng; R for Data Science, Garrett Grolemund and Hadley Wickham; Statistical Inference (2nd Edition), George Casella and Roger L. Berger; Bayesian Data Analysis (Third Edition), Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin.; Taught by Jacob Montgomery.
-
Computational social science (2020)
- Explored various data types in social science, including networks, text, audio, images, and videos.
- Focused on both mechanistic and probabilistic approaches to supervised and unsupervised learning.
- Materials: Pattern Recognition and Machine Learning, Christopher Bishop; A Course in Machine Learning, Hal Daumé; The Elements of Statistical Learning, Jerome Friedman, Trevor Hastie, Robert Tibshirani; Taught by Christopher Lucas.
-
Maximum likelihood estimation (2019)
- In-depth focus on MLE principles, including probability theory, likelihood functions, and properties of estimators like consistency and efficiency.
- Comprehensive study of generalized linear models using MLE, covering exponential family distributions, link functions, logistic and Poisson regression.
- Advanced MLE topics: handling categorical data, overdispersion in count data, model selection criteria (AIC, BIC), model fit assessment and diagnostics.
- Materials: Generalized Linear Models, Peter KDunn, Gordon KSmyth; Taught by Christopher Lucas.
-
Causal inference (2019)
- Deep exploration of causal inference theories, focusing on counterfactual reasoning, potential outcomes, and causal diagrams.
- Study of experimental design principles, including randomized trials, natural and field experiments.
- Exploration of observational techniques: propensity score matching, regression discontinuity, difference-in-differences, instrumental variables.
- Advanced statistical methods for causal estimation: structural equation modeling, mediation analysis, sensitivity analysis
- Taught by Julia Park.
-
Applied statistical programming (2018)
- Introduced object-oriented programming, functional programming paradigms, and efficient data manipulation.
- Covered topics such as debugging, profiling, as well as package development and contribution to open-source projects.
- Emphasized statistical meta-skills like data cleaning, transformation, visualization, and implementation of various statistical models and algorithms.
- Materials: R for Dummies, de Vries and Meys; Advanced R, Hadley Wickham; Taught by Jacob Montgomery.
-
Theories of Individual and Collective Choice I (2018)
- Study of rational choice theory, delving into strategic decision-making processes, utility maximization, and behavioral strategy.
- Game-theoretic models: extensive and normal form games, Nash equilibrium concepts, repeated and dynamic games.
- Analysis of cooperative game theory, focusing on coalition formation, bargaining theories, and the Shapley value.
- Advanced topics: evolutionary game theory, Bayesian games, and information asymmetry in strategic interactions.
- Materials: Game Theory: An Introduction, Steven Tadelis; Taught by Keith Schnakenberg.
-
Quantitative political methodology I (2017)
- Explored mathematical underpinnings of linear regression models, exploring both scalar and matrix representations.
- Covered extensive topics including estimation techniques, inference methods, assumptions of linear models, diagnostic procedures, and the implementation of these concepts in statistical computation. li>
- Special focus on understanding the Gauss-Markov theorem, least squares estimation, multicollinearity, heteroskedasticity, and model specification errors.
- Materials: Linear Models with R, Julian Faraway; Taught by Guillermo Rosas.
-
Mathematical modeling (2017)
- Explore advanced mathematical concepts, particularly matrix algebra and calculus, within the framework of economic modeling.
- Topics include matrix operations, determinants, eigenvalues and eigenvectors, and their applications in solving linear systems.
- Covered single-variable and multivariate calculus, including a detailed study of limits, continuity, differentiation, and integration.
- Materials: Mathematics for Economists, Pemberton and Rau; Taught by Randy Calvert.
-
Research design (2017)
- Explored the application of the philosophy of science in the social sciences.
- Topics included research methodologies, hypothesis formation and testing, the structure of scientific inquiry, and the principles of logical reasoning.
- Addressed the challenges of causality, including the design of experiments and observational studies, and the use of statistical methods for causal inference.
- Materials: Political Science and the Logic of Representations, Kevin A Clarke and David M Primo; The Logic of Real Arguments, Alec Fisher; Taught by Matt Gabel.
Main Contributions
Summary
The purpose of this project was to explore the ways discourse varies dependent upon the news outlet reporting. To do so, we collected Tweets shared by news organizations and their replies dating back to 2017. The data collected yielded (soon to be published) insights into variations such as sentiment and similarity. It also invited the use of techniques such as topic modeling and interrupted times series analyses surrounding the events of January 6, 2022. We sought to understand patterns related to the emotional charge and sentiment of text and how this varied across different news outlets and topics.
Methods
My major contribution to this project was the development of a reliable ETL pipeline that enabled non-methods researchers to easily load and explore the data. To do so, I first created a a Python script to parse and normalize the data. This involved pagination to fetch large volumes of data and error handling to handle API rate limits and other potential issues. The data retrieved included tweet-level information, media-level information, and context annotations. Logging was implemented to track the progress of the script and help diagnose issues that may arise during the data collection process. The main function was to allow multiple users to divide the work and run the script in parallel without repetition.
I then developed a program that allowed users to automatically extract useful metrics from the data, store the cleansed data in a SQL database and prevent duplicate data processing. This program included preprocessing the text data with methods such as removing unnecessary characters, renaming columns, extracting links, tokenization, lemmatization and calculating various text-based metrics like average word length and total word count. Other features include an unsupervised learning algorithm for obtaining vector representations of words and computing the subjectivity, polarity, and sentiment analysis scores for each tweet
Main Contributions
Summary
In this project, we investigated social media users’ perceptions of digital political ads. We measured users’ opinions on how platforms should design political ad UX and policies with the goal of establishing a baseline understanding of user opinions’ including the permissibility of political ads and microtargeting, transparency in funding.
The primary objective of this research was to understand what factors of ads (and users themselves) may contribute to their perceptions of how 'political' given digital ads are. To do this, we conducted a conjoint experiment asking respondents to compare artificial Facebook ads where we altered their source, content, and political orientation. This conjoint design allowed us to isolate the independent effects of each component on perceptions of the political.
We also conducted a within-between experiment asking respondents to evaluate real ads drawn from the Facebook Ad Library (collected by co-author). In this portion of the project, we randomly assigned respondents to view either a political or non-political advertisement and asked to rate how political they perceived it to be. Respondents rated multiple ads (within-subject variation) but the exact composition of the ads was randomized for each respondent (between-subject variation).
Overall, our conjoint analysis strongly supported our original research hypotheses showing that the source, strength, and orientation of the message all matter. We found that candidate ads seem to be viewed as inherently political, in contrast to sources such as politically active companies and advocacy organizations, where message strength appears to matter far more in order for an ad to be considered political. This differs from our finding in the conjoint analysis, where ads from companies and advocacy organizations were viewed as equally political.
Methods
I was brought into this project after the research design and implementation stages of the surveys had taken place and tasked with the responsibility of maintaining and overseeing the data for a project. I quickly acquired a working understanding of the mathematical principles and methodologies behind conjoint experiments, a less-common analytical approach in my field. Upon examination of the data and methods, I identified discrepancies in the expected number of profiles, subsequently informing my collaborators of the error which had compromised the random assignment. Consequently, the survey distributor rectified the parameters and redistributed the survey, ensuring the project’s successful progression.
I implemented analyses in this project in R, using libraries such as dplyr, magrittr, and tidyverse to analyze political ad data and examine the impact of ad orientation on political preferences. I developed an R script to clean and process the data in order to create relevant variables and handle missingness. I implemented advanced data manipulation techniques and reshaped the datasets to make them more manageable for further analysis.
I then conducted a comprehensive analysis on political advertisement data, encompassing four novel datasets. Utilizing weighted confidence intervals and an array of statistical techniques, I visualized the findings through point-range plots, effectively conveying the political nature of the ad content. Additionally, I carried out a follow-up study to further investigate the perceived political content of various advertisements, expanding the project’s scope and providing a more in-depth understanding of the relationship between ad content and political affiliation.
Initial Analyses and Descriptive Statistics
The initial analysis involved several core datasets, including `CJ.csv`, `datc.csv`, `pooled.csv`, `correction.csv`, `data.csv`, `original.csv`, and `ra.csv`. The primary focus was to understand the distribution and characteristics of the data. Descriptive statistics and exploratory data analysis were performed using the `dplyr` and `ggplot2` libraries in R.
For instance, summary statistics were computed to identify the central tendencies and dispersion of key variables. Visualization techniques, such as histograms and scatter plots, were employed to examine the data distributions and potential outliers. This initial step was crucial to ensure the quality and reliability of the data before proceeding with more complex analyses.
Data Cleaning and Transformations
Data cleaning involved handling missing values, correcting inconsistencies, and transforming variables to suitable formats for analysis. The script `rep_cleaning-data.R` was utilized to perform these tasks. Key transformations included:
Additionally, advanced data manipulation techniques were applied using `tidyverse` functions to reshape the data, such as `spread` and `gather` functions for pivoting data frames, making them suitable for subsequent analyses.
Statistical Modeling and Visualization
The statistical analysis involved using regression models to understand the relationship between ad characteristics and political perceptions. The script `rep_main-models.R` was used to build and evaluate these models. Key steps included:
Visualization of the results was performed using `ggplot2` and `plotly` libraries. The script `rep_main-plots.R` was employed to create detailed visualizations, such as point-range plots and interaction plots, to effectively communicate the findings.
Main Contributions
Summary
This data science project investigated the influence of motivated reasoning on individuals’ evaluation of logical arguments, addressing three key questions:
Utilizing R and I designed and conducted two large-n survey experiments, finding that individuals can distinguish between strong and weak arguments, but exhibit a bias favoring statements aligned with their preferences. This bias persisted across strong and weak arguments, political and non-political topics, and multiple issue areas.
The project also evaluated the effectiveness of priming objectivity goals in reducing biases in argument evaluation. The first study suggested potential improvements in weak argument evaluation accuracy, while the second study showed no measurable effect.
This research revealed the pervasiveness of argument congruency bias and demonstrated that individuals’ biases influence, but do not entirely overwhelm, their ability to accurately rate argument quality. By exploring the potential of priming objectivity as an intervention, this project contributed valuable insights into argument evaluation and strategies for reducing.