« Back to Results

Deep Learning Methods for Data Curation in Economics

Paper Session

Sunday, Jan. 9, 2022 3:45 PM - 5:45 PM (EST)

Hosted By: Econometric Society
  • Chair: Melissa Dell, Harvard University

Partisanship and Economic Beliefs

Suproteem Sarkar
,
Harvard University
Johnny Tang
,
Harvard University

Abstract

A growing body of evidence suggests that partisanship affects how people interpret the world around them. Using text data from cable news broadcasts and corporate earnings calls, we study the influence of partisanship on how people make sense of economic developments. We find evidence suggesting that partisans focus on favorable narratives and selectively interpret new information. Finally, we discuss the relationships between partisan language and executives' business decisions.

Constructing a Historical Nordic Human Capital Database: An End-to-End Machine Learning Approach

Christian Dahl
,
University of Southern Denmark
Christian Westermann
,
University of Southern Denmark

Abstract

Automatic and robust transcription of scanned documents, especially those of lesser quality, is still a difficult area. Despite the recent and large advancements in the fields of machine learning and deep learning, the large heterogeneity in historical document layouts constrains the ability to generalize. We demonstrate that by approaching the problem from a lower level, a level where generalization is very feasible, results in robust and effective transcription of historical documents. We showcase an end-to-end pipeline of novel machine learning techniques which is utilized to transcribe around 700 and 450 Norwegian and Danish historical documents respectively. The outcome is a Nordic human capital database with detailed individual-level data and with massive potential.

Origins of Serial Sovereign Default

Sasha Indarte
,
University of Pennsylvania
Chenzi Xu
,
Stanford University

Abstract

Text-based methods and textual analysis have become increasingly common in economics research, lending novel insights into concepts that are difficult to observe and measure using only quantitative data. Applying these methods depends on having high quality text, and economic historians often face barriers to using the available historical text because of mistakes and losses introduced in the OCR process. This limitation can severely limit using textual analysis even though sometimes the most readily available data sources are qualitative information. In this project, we study sovereign debt default and ask why some countries experience repeated cycles of borrowing and default. We rely on the universe of historical financial newspapers published in Britain to better understand investor views and motives, where we develop new machine learning tools to process the text and make it usable for our purposes. We explore how the political and economic environment in which countries borrow and default affects their likelihood of becoming a ``serial defaulter'' during the period from 1820 to 1939. We first categorize countries as serial defaulters based on the frequency and duration of their defaults. We then document how the circumstances of default and borrowing differ for serial defaulters. We aim to use our textual data to investigate how the political and economic circumstances of borrowing and default -- and investors' perceptions of the sovereign's motives -- influence who becomes a serial defaulter.

Applications of Machine Learning in Document Digitisation

Christian Dahl
,
University of Southern Denmark
Torben Johansen
,
University of Southern Denmark
Emil Sørensen
,
University of Bristol
Christian Westermann
,
University of Southern Denmark
Simon Wittrock
,
University of Southern Denmark

Abstract

Data acquisition forms the primary step in all empirical research. The availability of data directly impacts the quality and extent of conclusions and insights. In particular, larger and more detailed datasets provide convincing answers even to complex research questions. The main problem is that “large and detailed” usually implies “costly and difficult”, especially when the data medium is paper and books. Human operators and manual transcription have been the traditional approach for collecting historical data. We instead advocate the use of modern machine learning techniques to automate the digitisation process. We give an overview of the potential for applying machine digitisation for data collection through two illustrative applications. The first demonstrates that unsupervised layout classification applied to raw scans of nurse journals can be used to construct a treatment indicator. Moreover, it allows an assessment of assignment compliance. The second application uses attention-based neural networks for handwritten text recognition in order to transcribe age and birth and death dates from a large collection of Danish death certificates. We describe each step in the digitisation pipeline and provide implementation insights.
JEL Classifications
  • C0 - General