Gaurav Sood - Outputs

ECONOMETRICS, STATISTICS, AND DATA COLLECTION

ROBUST ML/STATS. AND LESS GREEDY APPROACHES

🌱 Bootstrap Consistency Regularization for Stable Neural Network Predictions
🗃 Replication Materials

🌱 Causal Debiasing for Robust Machine Learning

🌱 (Don't) Forget About It: Forgetting Penalized Supervised Learning
🗃 Replication Materials

🌱 Robust CART: Node Level CV Stability Selection

Bagged FSR: Rehabilitating Forward Stepwise Regression

📦 incline: Estimate Trend at a Particular Point in Time in a Noisy Time Series

CALIBRATION

🌱 Streaming Calibration With MWU and SGD

📦 Python Package

📦 calibre: Advanced Calibration Models

DATA COLLECTION

The Micro-Task Market for "Lemons": Collecting Data on Amazon's Mechanical Turk
With Doug Ahler and Carrie Roush. Political Science Research and Methods, 2021.
🗃 Replication Materials

While Amazon’s Mechanical Turk (MTurk) has reduced the cost of collecting original data, in 2018, researchers noted the potential existence of a large number of bad actors on the platform. To evaluate data quality on MTurk, we fielded three surveys between 2018 and 2020. While find no evidence of a “bot epidemic,” significant portions of the data—between 25%-35%—are of dubious quality. While the number of IP addresses that completed the survey multiple times or circumvented location requirements fell almost 50% over time, suspicious IP addresses are more prevalent on MTurk than on other platforms. Furthermore, many respondents appear to respond humorously or insincerely, and this behavior increased over 200% from 2018-2020. Importantly, these low-quality responses attenuate observed treatment effects by magnitudes ranging from approximately 10-30%.

Optimal Data Collection When Strata and Strata Variances Are Known
With Ken Cor.

📦 Geo-sampling: Sampling Randomly From the Streets
With Suriyan Laohaprapanon.

📦 Allocator: Optimal Itineraries For Spatially Distributed Tasks
With Suriyan Laohaprapanon.

📦 reporoulette: Randomly Sample GitHub Repositories

OTHER

🌱 Adaptive Entropy Balancing via Multiplicative Weights
🗃 Replication Materials

A Benchmark For Benchmarks

📦 pyppur: Projection Pursuit Dimension Reduction With Reconstruction Loss

pyppur is a Python package that implements projection pursuit methods for dimensionality reduction. Unlike traditional PP objectives geared toward finding 'interesting' projections (non-gaussian), pyppur focuses on finding non-linear projections by minimizing either reconstruction loss or distance distortion.

📦 rmcp: R Econometrics MCP Community Server

Guessing and Forgetting: A Latent Class Model for Measuring Learning
With Ken Cor. Political Analysis. 24(2): 226–242, 2016.
🗃 Replication Materials
REVIEW: '... a real contribution to the literature.' — Ed Haertel
RELATED: 📦 R Package for implementing the method.

Guessing on closed-ended knowledge items is common. Under likely to hold assumptions, in presence of guessing, the most common estimator of learning, difference between pre- and post-process scores, is negatively biased. To account for guessing related error, we develop a latent class model of how people respond to knowledge questions and identify the model with the mild assumption that people do not lose knowledge over short periods of time. A Monte Carlo simulation over a broad range of informative processes and knowledge items shows that the simple difference score is negatively biased and the method we develop here, unbiased. To demonstrate its use, we apply our model to data from Deliberative Polls. We find that estimates of learning once adjusted for guessing are about 13% higher. Adjusting for guessing also eliminates the gender gap in learning, and halves the pre-deliberation gender gap on political knowledge.

Measuring Learning in Informative Processes
With Robert Luskin and Ariel Helfer.

ONLINE SAFETY

Exposed: Shedding Blacklight On Online Privacy
With Lucas Shen
🗃 Replication Materials

Pwned: How Often Are Americans' Online Accounts Breached?
With Ken Cor. ACM Web Science Conference, 2019
🗃 Replication Materials
RELATED: Bob Rudis Analyzes Exposure by Breach; I Have Been Pwned: Evidence from the Florida Voter Registration Data

News about massive data breaches is increasingly common. But what proportion of Americans are exposed in these breaches is still unknown. We combine data from a large, representative sample of American adults (n = 5,000), recruited by YouGov, with data from Have I Been Pwned to estimate the lower bound of the number of times Americans’ private information has been exposed. We find that at least 82.84% of Americans have had their private information, such as account credentials, Social Security Number, etc., exposed. On average, Americans’ private information has been exposed in at least three breaches. The better educated, the middle-aged, women, and Whites are more likely to have had their accounts breached than the complementary groups.

📦 Piedomains: Predict the Kind of Content Hosted by a Domain
With Rajashekar Chintalapati.
RELATED: Domain Knowledge: Predicting the Kind of Content Hosted by a Domain.
With Suriyan Laohaprapanon. Complex, Intelligent and Software Intensive Systems (CISIS), 2020.

The package infers the kind of content hosted by a domain using the domain name, the textual content, and the screenshot of the homepage. We use domain category labels from Shallalist and build our own training dataset by scraping and taking screenshots of the homepage.

Pass-Fail: Using a Password Generator to Improve Password Strength
With Rajashekar Chintalapati

How Often is Politicians' Data Breached? Evidence from HIBP
With Lucas Shen.
🗃 Replication Materials

Bad Domains: Exposure to Malicious Content Online
With Lucas Shen.
🗃 Replication Materials

📦 Know Your IP
With Suriyan Laohaprapanon

📦 virustotal: R Client for the VirusTotal Public API 2.0

Social Proof is in the Pudding: The (Non)-Impact of Social Proof on Software Downloads
With Lucas Shen.
🗃 Replication Materials

Open-source software is widely used in commercial applications. Pair that with the fact that when choosing open-source software for a new problem, developers often use social proof as a cue. These two facts raise concern that bad actors can game social proof metrics to induce the use of malign software. We study the question using two field experiments. On the largest developer platform, GitHub, we buy ‘stars’ for a random set of GitHub repositories of new Python packages and estimate their impact on package downloads. We find no discernible impact. In another field experiment, we manipulate the number of human downloads for Python packages. Again, we find little effect.

TOOLS

📦 Lost Years: Expected Number of Years Lost
With Suriyan Laohaprapanon.

📦 repaper: convert photo of a form to a web based form or an editable pdf form
With Bhanu Teja.

📦 indicate: transliterate indic languages to english
With Rajashekar Chintalapati.

LayoutLens: AI-Assisted UI Testing

📦 BloomJoin: Bloom Filter Based Joins

📱 Advertiser: Promote Your GitHub Repositories on BlueSky

📦 🍠 tuber: Access YouTube API via R

META SCIENCE

Review of "Noise: A Flaw in Human Judgment"
With Andrew Gelman. Chance. 2024.

Significant Error: Citations to Research With Publicized Statistical Errors
With Ken Cor.
🗃 Replication Materials

Propagation of Error: Approving Citations to Problematic Research
With Ken Cor.
🗃 Replication Materials

📱 Get Notified When Cited Article is Retracted

Highlight Citations to Retracted Articles

Softverse: Auto-compute Citations to Software From Replication Files
🌐 softwarecite.com

user: Auto-compute Citations to Software From GitHub
RELATED: 📦 Python metrics

AutoSum: Summarize Publications Automatically and Discover Miscitations

By the Numbers: Toward More Precise Numerical Summaries of Results
With Andrew Guess. The Political Methodologist. 24(1): 2016

The Review: Production and Consumption of APSR Articles

superdf: Save Metadata with the Data in R and Python DataFrames

Not to Code: Evidence From Static Code Analysis of Replication Scripts

NAMES

Predicting Race and Ethnicity From Sequence of Characters in a Name
With Rajashekar Chintalapati and Suriyan Laohaprapanon. arXiv.org
RELATED: 📦 Python Package for implementing the method.
PRESS: InfoQ | AnacondaCON presentation (Video)

Sound Names: Classify Names Based on Sequence of Sounds

Graphic Names: Classify Names Using Google Image Search and Clarifai

📦 Naampy: Infer Sociodemographic Characteristics from Indian Names
With Rajashekar Chintalapati and Suriyan Laohaprapanon.

📦 Pranaam: Predict Religion From Name
With Rajashekar Chintalapati.

📦 naamkaran: a generative model for names
With Rajashekar Chintalapati.

📦 parsernaam: ML-assisted name parser
With Rajashekar Chintalapati.

📦 instate: predict spoken language from last name
With Rajashekar Chintalapati and Atul Dhingra.
RELATED: Instate: Predict the State of Residence from Last Name.
With Atul Dhingra.

DECISION MAKING

GROUP AFFECT

Inter-group Prejudice

Affect, Not Ideology: A Social Identity Perspective on Polarization
With Shanto Iyengar and Yphtach Lelkes. Public Opinion Quarterly. 76(3), 405–431, 2012.
🗃 Replication Materials
RELATED: Sort of Sorted But Definitely Cold, The Order of Feelings, Affectively Polarized?, Party Time
PRESS: The New York Times, The Washington Post, Mother Jones, Vox, etc.

The Parties in our Heads: Misperceptions About Party Composition and Their Consequences
With Doug Ahler. The Journal of Politics. 80(3), 964–981, 2018.
🗃 Replication Materials
PRESS: FiveThirtyEight, Vox, The Washington Post, The Washington Post (2), Christian Science Monitor, The Hill, PBS (Twin Cities)
RELATED: The Partisans in our Heads | Data and Scripts

Typecast: A Routine Mental Shortcut Causes Party Stereotyping
With Doug Ahler. Political Behavior. 2022.
🗃 Replication Materials | Appendix
PRESS: Heterodox Academy

All in the Eye of the Beholder: Partisan Affect and Ideological Accountability
With Shanto Iyengar.
In The Feeling, Thinking Citizen: Essays in Honor of Milton Lodge. 2018.
🗃 Replication Materials
RELATED: Still Close: Perceived Ideological Distance to Own and Main Opposing Party, 2012 Blog Post
PRESS: The New York Times

Coming to Dislike Your Opponents: The Polarizing Impact of Political Campaigns
With Shanto Iyengar.
PRESS: New York Times

Partisan Vision? Partisan Bias in Simple Visual Evaluations
With Carrie Roush and Alex Theodoridis.
🗃 Replication Materials

The Hostile Audience: The Effect of Access to Broadband Internet on Partisan Affect
With Yphtach Lelkes and Shanto Iyengar. American Journal of Political Science. 61(1): 5–20, 2017.
🗃 Replication Materials
PRESS: The Guardian

Holier Than Thou? No Large Partisan Gap in Consumption of Pornography Online
With Lucas Shen. Journal of Quantitative Description. 2024.
🗃 Replication Materials

Hidden Racial Prejudice? Impact of Social Desirability Pressures on Endorsement of Racial Stereotypes
With Jon Krosnick, Tobias Stark, and Floor van Maaren. Sociological Methods and Research.51(2), 605–631, 2019.
🗃 Replication and Supplementary Materials

INFORMATION ENVIRONMENT

What and Who is on Network Television?

Working Women on Indian TV
With Asha Sood

The Face of Crime in Prime Time: Evidence from Law and Order
With Daniel Trielli.
🗃 Replication Materials
PRESS: The Washington Post

Extreme Recall: Which Politicians Come to Mind?
With Daniel Weitzel. Journal of Elections, Public Opinion and Parties. 2024.
🗃 Replication Materials
RELATED: Extreme Recall

DELIBERATION

What Would Dahl Say? An Appraisal of the Democratic Credentials of the Deliberative Polls and Other Mini-publics
With Ian O'Flynn. Deliberative Mini-Publics. 41–58, 2014. ECPR Press.

How Can You Think That?: Deliberation and the Learning of Opposing Arguments
🗃 Replication Materials

Deliberative Distortions? Homogenization, Polarization, and Domination in Small Group Deliberations
With Robert Luskin, Kyu Hahn, and James Fishkin. The British Journal of Political Science. 52(3), 1205–1225, 2022.
🗃 Replication Materials

What Future for Kirkuk? Evidence from a deliberative intervention
With Ian O'Flynn, Jalal Mistaffa, and Nahwi Saeed. Democratization. 26(7), 1299–1317, 2019.
🗃 Replication Materials | Supporting Information

OTHER

Problem Solving

Is an Uncertain Prospect Less Preferred Than Its Worst Possible Outcome? New Evidence on the Uncertainty Effect
With Doug Ahler.
🗃 Replication Materials

Mixed Signals: Movie Quality Assessments Across Platforms

Americans' Attitudes Toward The Affordable Care Act: Would Better Public Understanding Increase or Decrease Favorability?
With Wendy Gross, Tobias Stark, Jon Krosnick, Josh Pasek, Trevor Thompson, Jennifer Agiesta, and Dennis Junius.
PRESS: Forbes, Pacific Standard, The Dish, among other outlets.

Americans' Attitudes toward the Affordable Care Act: What Role Do Beliefs Play?
With Gabriel Miao Li, Josh Pasek, Jon Krosnick, Tobias H. Stark, Jennifer Agiesta, Trevor Tompson, and Wendy Gross. Annals of the American Academy of Political and Social Science. 2022.

Revisiting a Natural Experiment: Do Legislators With Daughters Vote More Liberally on Women's Issues?
With Don Green, Oliver Hyman-Metzger, and Michelle Zee. Journal of Political Economy Microeconomics. 2023.
🗃 Replication Materials | Supporting Information
PRESS: Phys.org

MISSING WOMEN

Son Bias in the US: Evidence from Business Names
With Walter Guillioli
🗃 Replication Materials

Which Women Are Missing? Adult Sex Ratio By Last Name
With Suriyan Laohaprapanon

Missing Women on the Streets

Epic Children: Sex Ratio of Children of Key Characters in Epics

Missing Daughters of Indian Politicians

NEWS

Not News: Provision of Apolitical News in the British News Media
With Suriyan Laohaprapanon.
🗃 Replication Materials

Strength in Numbers: Multiple Measures of Media Ideology
With Philip Habel.
🗃 Replication Materials

Measuring Agendas and Positions on Agendas
With Andrew Guess.
🗃 Replication Materials

📦 Notnews: Predict the Type of News Based on Story Text and URL
With Suriyan Laohaprapanon.

Unreadable News: How Readable is American News?
With Lucas Shen.

Follow Your Ideology: A Measure of Ideological Location of Media Sources
With Pablo Barberá.

The Supply of Media Slant Across Outlets and Demand for Slant Within Outlets: Evidence from US Presidential Campaign News
With Marcel Garz, Daniel Stone, and Justin Wallace. European Journal of Political Economy.
🗃 Replication Materials

Don't Expose Yourself: Discretionary Exposure to Political Information
With Yphtach Lelkes. Oxford Research Encyclopedia of Politics. 2018.
🗃 Replication Materials
RELATED: Categorizing the Content of Domains, Measuring Selective Exposure, The Fairest of All

The Good NYT: Provision of Apolitical News in the New York Times

Hard News: The Softening of Network Television News
With Daniel Weitzel.
🗃 Replication Materials

Partisan Imbalance in Politifact?

💾 Top News! URLs from News Feeds of Major National News Sites (2022-)
With Derek Willis

💾 CNN Transcripts 2000--2025

MEASURING LEARNING, KNOWLEDGE, AND MISINFORMATION

You Cannot be Serious: The Impact of Accuracy Incentives on Partisan Bias
With Markus Prior and Kabir Khanna. Quarterly Journal of Political Science. 10(4), 489–518, 2015.
🗃 Online Appendix; Replication Materials
PRESS: Washington Monthly, Pacific Standard, The New York Times
RELATED: Partisan Gaps in Retrospection are Highly Variable; Blog Post

Motivated Responding in Studies of Factual Learning
With Kabir Khanna. Political Behavior. 40(1): 79–101, 2018.
🗃 Replication Materials
RELATED: Blog Summarizing the Paper, The Innumerate American

A Gap in Our Understanding? Reconsidering the Evidence for Partisan Knowledge Gaps
With Carrie Roush. Quarterly Journal of Political Science. 18(1), 2023.
🗃 Replication Materials
RELATED: An Unclear Gap: How Vague Response Options Produce Partisan Knowledge Gaps
PRESS: Not Another Politics Podcast (U. Chicago)

The Waters of Casablanca: On Political Misinformation
With Robert Luskin.

Misinformation About Misinformation: Of Headlines and Survey Design
With Robert Luskin, Yul Min Park, and Joshua Blank.
🗃 Replication Materials

Misinformed About the Affordable Care Act? Leveraging Certainty to Assess the Prevalence of Misinformation
With Josh Pasek and Jon Krosnick. Journal of Communication. 65(4): 660–673, 2015
🗃 Supporting Information | Replication Materials

Guessing and Forgetting: A Latent Class Model for Measuring Learning
With Ken Cor. Political Analysis. 24(2): 226–242, 2016.
🗃 Replication Materials
REVIEW: '... a real contribution to the literature.' — Ed Haertel
RELATED: 📦 R Package for implementing the method.

Measuring Learning in Informative Processes
With Robert Luskin and Ariel Helfer.

A Measurement Gap? Effect of the Survey Instrument and Scoring on the Partisan Knowledge Gap
With Lucas Shen and Daniel Weitzel. Public Opinion Quarterly. 2024.
🗃 Replication Materials

Research suggests that partisan gaps in political knowledge with partisan implications are wide and widespread. Using a series of experiments, we investigate the extent to which partisan gaps in commercial surveys are a result of differences in beliefs than motivated guessing. Knowledge items on commercial surveys often have features that encourage guessing. We find that removing such features yields scales with greater reli- ability and higher criterion validity. More substantively, partisan gaps on scales without these “inflationary” features are roughly 40% smaller. Thus, contrary to Prior, Sood and Khanna (2015), who find that the upward bias is explained by the knowledgeable deliberately marking the wrong answer (partisan cheerleading), our data suggest, in line with Bullock et al. (2015) and Graham and Yair (2023), that partisan gaps on commercial surveys are strongly upwardly biased by motivated guessing by the ignorant. Relatedly, we also find that partisans know less than what toplines of commercial polls suggest.

An Unclear Gap: How Vague Response Options Produce Partisan Knowledge Gaps
With Carrie Roush.
🗃 Replication Materials

Roush and Sood (2023) use a dataset of 162,083 responses to 187 items on 47 surveys to find that partisan gaps are smaller and less frequent than commonly understood. The average is a mere six and a half points and gaps’ “signs” run counter to expectations roughly 30% of the time. However, one exception is the size of gaps on retrospection items on the ANES, which are considerably bigger. These retrospection items use vague response options, e.g., ‘About the same.’ Vague response options can inflate partisan gaps by offering partisans the opportunity to interpret the same data differently. We test this assumption with a novel survey experiment. We present partisans data indicating a small improvement in economic indicators and manipulate the partisan tint of the change by manipulating who is responsible for the change. We find that significantly fewer partisans pick the option that ’[things] got better’ when presented with an out-partisan cue than a co-partisan cue. Our findings suggest that vague options can induce knowledge gaps even when partisans have the same information.

Measuring Perceptions of Numerical Strength of Salient and Stereotypical Groups
With Doug Ahler. Misinformation and Mass Audiences. 2018. University of Texas Press.
🗃 Appendix

PROVISION OF PUBLIC GOODS

StreetSense: Learning from Google Street View With Suriyan Laohaprapanon and Kimberly Ortleb. arXiv.org
🗃 Replication Materials

How good are the public services and the public infrastructure? Does their quality vary by income? These are vital questions—they shed light on how well the government is doing its job, the consequences of disparities in local funding, etc. But there is little good data on many of these questions. We fill this gap by describing a scalable method of getting data on one crucial piece of public infrastructure: roads. We assess the quality of roads and sidewalks by exploiting data from Google Street View. We randomly sample locations on major roads, query Google Street View images for those locations and code the images using Amazon’s Mechanical Turk. We apply this method to assess the quality of roads in Bangkok, Jakarta, Lagos, and Wayne County, Michigan. Jakarta’s roads have nearly four times the potholes than roads of any other city. Surprisingly, the proportion of road segments with potholes in Bangkok, Lagos, and Wayne is about the same, between .06 and .07. Using the data, we also estimate the relation between the condition of the roads and local income in Wayne, MI. We find that roads in more affluent census tracts have somewhat fewer potholes.

AutoSense: Automated Street Condition Assessment

Get in Line: Waiting Times at the DMV
With Noah Finberg.

CRICKET

Elo Ratings of International Cricket Teams By Format
With Derek Willis.
PRESS: The Hindu

WAR Ratings for Cricketers
With Derek Willis.

Fairly Random: The Effect of Winning the Toss on Winning the Match
With Apoorva Lal, Derek Willis, and Avidit Acharya. Journal of Sports Analytics. 2023.
🗃 Replication Materials
RELATED: Fairly Random: Impact of Winning the Toss on the Probability of Winning
With Derek Willis. | arXiv.org
PRESS: ESPN: How much does the toss really matter?
RELATED: ESPN: Why replacing the toss with an auction is the fair thing to do

OTHER

Scaling ML Products at Startups: A Practioner's Guide
With Atul Dhingra.

How do you scale a machine learning product at a startup? In particular, how do you serve a greater volume, velocity, and variety of queries cost-effectively? We break down costs into variable costs—the cost of serving the model and keeping it performant—and fixed costs—the cost of developing and training new models. We propose a framework for conceptualizing these costs, breaking them into finer categories, and limn ways to reduce costs. Lastly, since in our experience, the most expensive fixed cost of a machine learning system is the cost of identifying the root causes of failures and driving continuous improvement, we present a way to conceptualize the issues and share our methodology for the same.

The Effect of Gender Quotas on Some Qualities of Elites

The Limits of Electoral Gender Quotas in Rural Local Bodies
With Varun K. R.

The Older Half: Spousal Age Gap in India
With Suriyan Laohaprapanon

DATA, RESEARCH, AND SOFTWARE