« Back to Results

ML-Enabled Econometrics with Unstructured Data

Paper Session

Sunday, Jan. 5, 2025 10:15 AM - 12:15 PM (PST)

Hilton San Francisco Union Square, Union Square 17 and 18
Hosted By: American Economic Association
  • Chair: Szymon Sacher, Stanford University

Debiasing Machine-Learning- or AI-Generated Regressors in Partial Linear Models

Jingwen Zhang
,
University of Washington
Wendao Xue
,
University of Washington
Yifan Yu
,
University of Texas-Austin
Yong Tan
,
University of Washington

Abstract

Researchers are increasingly leveraging machine learning (ML) or artificial intelligence technologies (AI) to predict feature variables and use them as regressors in subsequent econometric models. However, because ML/AI predictions are imperfect, these generated regressors would inevitably contain measurement errors. The direct use of such regressors in subsequent econometric models can result in biased estimation, ultimately leading to inaccurate conclusions. In light of this, we examine the problem of debiasing ML/AI-generated regressors in partial linear regression models. We propose estimators that utilize Two-Stage Least Square (TSLS) and Generalized Method of Moments (GMM) under the Double Machine Learning (DML) framework. We demonstrate the asymptotic consistency and normality of our estimators. Moreover, we conduct extensive Monte Carlo simulations and empirical applications to show the outperformance of our estimators compared with other methods. Our work advances causal inference in addressing measurement error problems arising from ML/AI-generated regressors in partial linear models and hence provides valuable practical implications for designing experimental systems and overcoming ML/AI biasedness.

Inference for Regression with Variables Generated by AI or Machine Learning

Laura Battaglia
,
Oxford University
Timothy Christensen
,
University College London
Stephen Hansen
,
University College London
Szymon Sacher
,
Stanford University

Abstract

It has become common practice for researchers to use AI-powered information retrieval algorithms or other machine learning methods to estimate variables of economic interest, then use these estimates as covariates in a regression model. We show both theoretically and empirically that naively treating AI- and ML-generated variables as "data" leads to biased estimates and invalid inference. We propose two methods to correct bias and perform valid inference: (i) an explicit bias correction with bias-corrected confidence intervals, and (ii) joint maximum likelihood estimation of the regression model and the variables of interest. Through several applications, we demonstrate that the common approach generates substantial bias, while both corrections perform well.

Demand Estimation with Text and Image Data

Giovanni Compiani
,
University of Chicago
Ilya Morozov
,
Northwestern University
Stephen Seiler
,
Imperial College London

Abstract

We propose a demand estimation method that allows researchers to estimate substitution patterns from unstructured image and text data. We first employ a series of machine learning models to measure product similarity from products' images and textual descriptions. We then estimate a nested logit model with product-pair specific nesting parameters that depend on the image and text similarities between products. Our framework does not require collecting product attributes for each category and can capture product similarity along dimensions that are hard to account for with observed attributes. We apply our method to a dataset describing the behavior of Amazon shoppers across several categories and show that incorporating texts and images in demand estimation helps us recover a flexible cross-price elasticity matrix.
JEL Classifications
  • C1 - Econometric and Statistical Methods and Methodology: General
  • C5 - Econometric Modeling