« Back to Results

Generative AI and Machine Learning

Paper Session

Sunday, Jan. 4, 2026 10:15 AM - 12:15 PM (EST)

Loews Philadelphia Hotel, Commonwealth Hall B
Hosted By: American Finance Association
  • Chair: Junbo Wang, Louisiana State University

Chronologically Consistent Large Language Models

Songrun He
,
Washington University in St. Louis
Linying Lyu
,
Washington University in St. Louis
Asaf Manela
,
Washington University in St. Louis
Chun Ming Jimmy Wu
,
Washington University in St. Louis

Abstract

Large language models are increasingly used in social sciences, but their training data can introduce lookahead bias and training leakage. A good chronologically consistent language model requires efficient use of training data to maintain accuracy despite time-restricted data. Here, we overcome this challenge by training a suite of chronologically consistent large language models, ChronoBERT and ChronoGPT, which incorporate only the text data that would have been available at each point in time. Despite this strict temporal constraint, our models achieve strong performance on natural language processing benchmarks, outperforming or matching widely used models (e.g., BERT), and remain competitive with larger open-weight models. Lookahead bias is model and application-specific because even if a chronologically consistent language model has poorer language comprehension, a regression or prediction model applied on top of the language model can compensate. In an asset pricing application predicting next-day stock returns from financial news, we find that ChronoBERT's real-time outputs achieve a Sharpe ratio comparable to state-of-the-art models, indicating that lookahead bias is modest. Our results demonstrate a scalable, practical framework to mitigate training leakage, ensuring more credible backtests and predictions across finance and other social science domains.

Out of the (Black)Box: AI as Conditional Probability

Hui Chen
,
Massachusetts Institute of Technology
Antoine Didisheim
,
University of Melbourne
Luciano Somoza
,
ESSEC Business School

Abstract

The core technology powering modern Large Language Models (LLMs) estimates the distribution of probable answers conditional on the prompt. Using a financial news and returns dataset, we find that these conditional probabilities are interpretable and contain valuable economic information. Conversely, measures of declared confidence used in the literature are opaque, structurally biased, unstable, and more model-dependent, indicating that LLMs cannot assess their own confidence. Using conditional probabilities, we analyze LLM biases and provide insights into the internal mechanisms driving model decisions. Our results indicate that conditional probabilities provide a reliable and transparent reflection of LLM priors, particularly for economic applications.

What Does ChatGPT Make of Historical Stock Returns? Extrapolation and Miscalibration in LLM Stock Return Forecasts

Shuaiyu Chen
,
Purdue University
Clifton Green
,
Emory University
Huseyin Gulen
,
Purdue University
Dexin Zhou
,
CUNY-Baruch College

Abstract

We examine how large language models (LLMs) interpret historical stock returns and price charts when prompted to forecast returns over short horizons. While stock returns exhibit short-term reversals, LLM forecasts overextrapolate, placing excessive weight on recent performance. Simulations indicate that LLM extrapolation is stronger for less persistent series, similar to humans, and difficult to eliminate through prompt engineering. LLM forecasts also appear optimistic relative to historical and future returns. When prompted for 80% confidence interval predictions, LLM forecasts are better calibrated than survey evidence. The findings suggest LLMs manifest common behavioral biases but are better at gauging risks than humans.

The Power of the Common Task Framework

Oliver Hellum
,
Copenhagen Business School
Theis Jensen
,
Yale University
Bryan Kelly
,
Yale University
Lasse Pedersen
,
Copenhagen Business School

Abstract

"The “Common Task Framework” (CTF) is a collaborative and competitive process
in which researchers solve a task using shared data, a predefined success metric, and a leaderboard. Using an economic model, we show that the CTF incentivizes effort, increases innovation, and curbs misrepresentation by reducing research costs and improving comparability. Historical examples from computer science underscore its effectiveness. To demonstrate its broader applicability, we propose a CTF for financial economics: a platform open to all researchers designed to identify the pricing kernel and systematically evaluate asset pricing models, from traditional factor-based approaches to modern machine learning techniques."

Discussant(s)
Alejandro Lopez-Lira
,
University of Florida
Rohit Allena
,
University of Houston
Shumiao Ouyang
,
University of Oxford
Winston Wei Dou
,
University of Pennsylvania
JEL Classifications
  • G1 - General Financial Markets