« Back to Results

Generative AI and Machine Learning

Paper Session

Sunday, Jan. 4, 2026 10:15 AM - 12:15 PM (EST)

Loews Philadelphia Hotel, Commonwealth Hall B

Hosted By: American Finance Association

Chair: Junbo Wang, Louisiana State University

Chronologically Consistent Large Language Models

Songrun He

Washington University in St. Louis

Linying Lyu

Washington University in St. Louis

Asaf Manela

Washington University in St. Louis

Chun Ming Jimmy Wu

Washington University in St. Louis

Abstract

Large language models are increasingly used in social sciences, but their training data can introduce lookahead bias and training leakage. A good chronologically consistent language model requires efficient use of training data to maintain accuracy despite time-restricted data. Here, we overcome this challenge by training a suite of chronologically consistent large language models, ChronoBERT and ChronoGPT, which incorporate only the text data that would have been available at each point in time. Despite this strict temporal constraint, our models achieve strong performance on natural language processing benchmarks, outperforming or matching widely used models (e.g., BERT), and remain competitive with larger open-weight models. Lookahead bias is model and application-specific because even if a chronologically consistent language model has poorer language comprehension, a regression or prediction model applied on top of the language model can compensate. In an asset pricing application predicting next-day stock returns from financial news, we find that ChronoBERT's real-time outputs achieve a Sharpe ratio comparable to state-of-the-art models, indicating that lookahead bias is modest. Our results demonstrate a scalable, practical framework to mitigate training leakage, ensuring more credible backtests and predictions across finance and other social science domains.

Out of the (Black)Box: AI as Conditional Probability

Hui Chen

Massachusetts Institute of Technology

Antoine Didisheim

University of Melbourne

Luciano Somoza

ESSEC Business School

View Abstract

Abstract

The core technology powering modern Large Language Models (LLMs) estimates the distribution of probable answers conditional on the prompt. Using a financial news and returns dataset, we find that these conditional probabilities are interpretable and contain valuable economic information. Conversely, measures of declared confidence used in the literature are opaque, structurally biased, unstable, and more model-dependent, indicating that LLMs cannot assess their own confidence. Using conditional probabilities, we analyze LLM biases and provide insights into the internal mechanisms driving model decisions. Our results indicate that conditional probabilities provide a reliable and transparent reflection of LLM priors, particularly for economic applications.

What Does ChatGPT Make of Historical Stock Returns? Extrapolation and Miscalibration in LLM Stock Return Forecasts

Shuaiyu Chen

Purdue University

Clifton Green

Emory University

Huseyin Gulen

Purdue University

Dexin Zhou

CUNY-Baruch College

Abstract

We examine how large language models (LLMs) interpret historical stock returns and price charts when prompted to forecast returns over short horizons. While stock returns exhibit short-term reversals, LLM forecasts overextrapolate, placing excessive weight on recent performance. Simulations indicate that LLM extrapolation is stronger for less persistent series, similar to humans, and difficult to eliminate through prompt engineering. LLM forecasts also appear optimistic relative to historical and future returns. When prompted for 80% confidence interval predictions, LLM forecasts are better calibrated than survey evidence. The findings suggest LLMs manifest common behavioral biases but are better at gauging risks than humans.

The Power of the Common Task Framework

Oliver Hellum

Copenhagen Business School

Theis Jensen

Yale University

Bryan Kelly

Yale University

Lasse Pedersen

Copenhagen Business School

Abstract

"The “Common Task Framework” (CTF) is a collaborative and competitive process
in which researchers solve a task using shared data, a predefined success metric, and a leaderboard. Using an economic model, we show that the CTF incentivizes effort, increases innovation, and curbs misrepresentation by reducing research costs and improving comparability. Historical examples from computer science underscore its effectiveness. To demonstrate its broader applicability, we propose a CTF for financial economics: a platform open to all researchers designed to identify the pricing kernel and systematically evaluate asset pricing models, from traditional factor-based approaches to modern machine learning techniques."

Discussant(s)

Alejandro Lopez-Lira

University of Florida

Rohit Allena

University of Houston

Shumiao Ouyang

University of Oxford

Winston Wei Dou

University of Pennsylvania

JEL Classifications

G1 - General Financial Markets

This website uses cookies.

Generative AI and Machine Learning

Sunday, Jan. 4, 2026 10:15 AM - 12:15 PM (EST)

Chronologically Consistent Large Language Models

Abstract

Out of the (Black)Box: AI as Conditional Probability

Abstract

What Does ChatGPT Make of Historical Stock Returns? Extrapolation and Miscalibration in LLM Stock Return Forecasts

Abstract

The Power of the Common Task Framework

Abstract

Discussant(s)

JEL Classifications