%0 Journal Article %J International Conference on Learning Representations %D Forthcoming %T Noise-Robust De-Duplication at Scale %A Emily Silcock %A Luca D’Amico-Wong %A Jinglin Yang %A Melissa Dell %X Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora. Across these diverse applications, the overwhelming majority of work relies on N-grams. Limited efforts have been made to evaluate how well N-gram methods perform, in part because it is unclear how one could create an unbiased evaluation dataset for a massive corpus. This study uses the unique timeliness of historical news wires to create a 27,210 document dataset, with 122,876 positive duplicate pairs, for studying noise-robust de-duplication. The time-sensitivity of news makes comprehensive hand labelling feasible - despite the massive overall size of the corpus - as duplicates occur within a narrow date range. The study then develops and evaluates a range of de-duplication methods: hashing and N-gram overlap (which predominate in the literature), a contrastively trained bi-encoder, and a ``re-rank'' style approach combining a bi- and cross-encoder. The neural approaches significantly outperform hashing and N-gram overlap. We show that the bi-encoder scales well, de-duplicating a 10 million article corpus on a single GPU card in a matter of hours. We also apply our pre-trained model to the RealNews and patent portions of C4 (Colossal Clean Crawled Corpus), illustrating that a neural approach can identify many near duplicates missed by hashing, in the presence of various types of noise. The public release of our NEWS-COPY de-duplication dataset, de-duplicated RealNews and patent corpuses, and the pre-trained models will facilitate further research and applications.  %B International Conference on Learning Representations %G eng %0 Journal Article %J EMNLP Computational Social Science Workshop %D Forthcoming %T OLALA: Object-Level Active Learning Based Layout Annotation %A Zejiang Shen %A Zhao, Jian %A Weining Li %A Yaoliang Yu %A Melissa Dell %X Layout detection is an essential step for accurately extracting structured contents from historical documents. The intricate and varied layouts present in these document images make it expensive to label the numerous layout regions that can be densely arranged on each page. Current active learning methods typically rank and label samples at the image level, where the annotation budget is not optimally spent due to the overexposure of common objects per image. Inspired by recent progress in semi-supervised learning and self-training, we propose Olala, an Object-Level Active Learning framework for efficient document layout Annotation. Olala aims to optimize the annotation process by selectively annotating only the most ambiguous regions within an image, while using automatically generated labels for the rest. Central to Olala is a perturbation-based scoring function that determines which objects require manual annotation. Extensive experiments show that Olala can significantly boost model performance and improve annotation efficiency, facilitating the extraction of masses of structured text for downstream NLP applications. %B EMNLP Computational Social Science Workshop %P 2023 %G eng %0 Journal Article %J International Conference on Document Analysis and Recognition %D 2021 %T LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis %A Zejiang Shen %A Ruochen Zhang %A Melissa Dell %A Benjamin Lee %A Jacob Carlson %A Weining Li %X Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of important innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applications. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout detection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digitization pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io . %B International Conference on Document Analysis and Recognition %P 131--146 %8 2021 %G eng %U https://arxiv.org/pdf/2103.15348.pdf %0 Journal Article %J IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops %D 2020 %T A Large Dataset of Historical Japanese Documents with Complex Layouts %A Zejiang Shen %A Kaixuan Zhang %A Melissa Dell %X
Deep learning-based approaches for automatic document layout analysis and content extraction have the potential to unlock rich information trapped in historical documents on a large scale. One major hurdle is the lack of large datasets for training robust models. In particular, little training data exist for Asian languages. To this end, we present HJDataset, a Large Dataset of Historical Japanese Documents with Complex Layouts. It contains over 250,000 layout element annotations of seven types. In addition to bounding boxes and masks of the content regions, it also includes the hierarchical structures and reading orders for layout elements. The dataset is constructed using a combination of human and machine efforts. A semi-rule based method is developed to extract the layout elements, and the results are checked by human inspectors. The resulting large-scale dataset is used to provide baseline performance analyses for text region detection using state-of-the-art deep learning models. And we demonstrate the usefulness of the dataset on real-world document digitization tasks. The dataset is available at this https URL.
%B IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops %P 548-559 %G eng %U https://dell-research-harvard.github.io/HJDataset/ %0 Journal Article %J Review of Economic Studies %D 2020 %T The Development Effects of the Extractive Colonial Economy: The Dutch Cultivation System in Java %A Melissa Dell %A Olken, Benjamin %B Review of Economic Studies %V 87 %P 164-203 %G eng %U https://www.dropbox.com/sh/xpfzjx5pzfzgktv/AABDJli9oT88-fTEscyVIdfKa?dl=0 %N 1 %0 Journal Article %J Conference on Neural Information Processing Systems Document Intelligence Workshop %D 2019 %T Information Extraction from Text Regions with Complex Tabular Structure %A Kaixuan Zhang %A Zejiang Shen %A Jie Zhou %A Melissa Dell %B Conference on Neural Information Processing Systems Document Intelligence Workshop %G eng %0 Journal Article %J American Economic Review: Insights %D 2019 %T The Violent Consequences of Trade-Induced Worker Displacement in Mexico %A Melissa Dell %A Benjamin Feigenberg %A Kensuke Teshima %B American Economic Review: Insights %V 1 %P 43-58 %G eng %N 1 %0 Journal Article %J Quarterly Journal of Economics %D 2018 %T Nation Building Through Foreign Intervention: Evidence from Discontinuities in Military Strategies %A Melissa Dell %A Pablo Querubin %B Quarterly Journal of Economics %V 133 %P 701-764 %G eng %N 2 %0 Journal Article %J Econometrica %D 2018 %T The Historical State, Local Collective Action, and Economic Development in Vietnam %A Melissa Dell %A Nathan Lane %A Pablo Querubin %B Econometrica %V 86 %P 2083-2121 %G eng %N 6 %0 Journal Article %J American Economic Review %D 2015 %T Trafficking Networks and the Mexican Drug War %A Melissa Dell %B American Economic Review %V 105 %P 1738-1779 %G eng %N 6 %9 (Revise and Resubmit, American Economic Review) %0 Journal Article %J Journal of Economic Literature %D 2014 %T What Do We Learn from the Weather? The New Climate-Economy Literature %A Dell, M. %A Jones, B. %A Olken, B. %B Journal of Economic Literature %G eng %0 Journal Article %J American Economic Journal: Macroeconomics %D 2012 %T Temperature Shocks and Economic Growth: Evidence from the Last Half Century %A Dell, M. %A Jones, B. %A Olken, B. %B American Economic Journal: Macroeconomics %V 4 %P 66-95 %G eng %N 3 %0 Journal Article %J American Economic Journal: Macroeconomics %D 2010 %T Productivity Differences Between and Within Countries %A Dell, M. %A D Acemoglu %B American Economic Journal: Macroeconomics %V 2 %P 169–188 %G eng %N 1 %0 Journal Article %J Econometrica %D 2010 %T The Persistent Effects of Peru's Mining Mita %A Dell, M. %B Econometrica %V 78 %P 1863-1903 %G eng %N 6 %0 Journal Article %J American Economic Review Papers and Proceedings %D 2009 %T Temperature and Income: Reconciling New Cross-Sectional and Panel Estimates %A Dell, M. %A Jones, B. %A Olken, B. %B American Economic Review Papers and Proceedings %V 99 %P 198-204 %G eng %N 2 %0 Generic %D Working Paper %T LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models %A Abhishek Arora %A Melissa Dell %X Linking information across sources is fundamental to a variety of analyses in social science, business, and government. While large language models (LLMs) offer enormous promise for improving record linkage in noisy datasets, in many domains approximate string matching packages in popular softwares such as R and Stata remain predominant. These packages have clean, simple interfaces and can be easily extended to a diversity of languages. Our open-source package LinkTransformer aims to extend the familiarity and ease-of-use of popular string matching methods to deep learning. It is a general purpose package for record linkage with transformer LLMs that treats record linkage as a text retrieval problem. At its core is an off-the-shelf toolkit for applying transformer models to record linkage with four lines of code. LinkTransformer contains a rich repository of pre-trained transformer semantic similarity models for multiple languages and supports easy integration of any transformer language model from Hugging Face or OpenAI. It supports standard functionality such as blocking and linking on multiple noisy fields. LinkTransformer APIs also perform other common text data processing tasks, e.g., aggregation, noisy de-duplication, and translation-free cross-lingual linkage. Importantly, LinkTransformer also contains comprehensive tools for efficient model tuning, to facilitate different levels of customization when off-the-shelf models do not provide the required accuracy. Finally, to promote reusability, reproducibility, and extensibility, LinkTransformer makes it easy for users to contribute their custom-trained models to its model hub. By combining transformer language models with intuitive APIs that will be familiar to many users of popular string matching packages, LinkTransformer aims to democratize the benefits of LLMs among those who may be less familiar with deep learning frameworks. %G eng %0 Generic %D Working Paper %T American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers %A Melissa Dell %A Jacob Carlson %A Tom Bryan %A Emily Silcock %A Abhishek Arora %A Zejiang Shen %A Luca D'Amico-Wong %A Quan Le %A Pablo Querubin %A Leander Heldring %X Existing full text datasets of U.S. public domain newspapers do not recognize the often complex layouts of newspaper scans, and as a result the digitized content scrambles texts from articles, headlines, captions, advertisements, and other layout regions. OCR quality can also be low. This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images and applies it to the nearly 20 million scans in Library of Congress's public domain Chronicling America collection. The pipeline includes layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes. To achieve high scalability, it is built with efficient architectures designed for mobile phones. The resulting American Stories dataset provides high quality data that could be used for pre-training a large language model to achieve better understanding of historical English and historical world knowledge. The dataset could also be added to the external database of a retrieval-augmented language model to make historical information - ranging from interpretations of political events to minutiae about the lives of people's ancestors - more widely accessible. Furthermore, structured article texts facilitate using transformer-based methods for popular social science applications like topic classification, detection of reproduced content, and news story clustering.  Finally, American Stories provides a massive silver quality dataset for innovating multimodal layout analysis models and other multimodal applications.  %G eng %0 Generic %D Working Paper %T A Massive Scale Semantic Similarity Dataset of Historical English %A Emily Silcock %A Melissa Dell %X  A diversity of tasks use language models trained on semantic similarity data. While there are a variety of datasets that capture semantic similarity, they are either constructed from modern web data or are relatively small datasets created in the past decade by human annotators. This study utilizes a novel source, newly digitized articles from off-copyright, local U.S. newspapers, to assemble a massive-scale semantic similarity dataset spanning 70 years from 1920 to 1989 and containing nearly 400M positive semantic similarity pairs. Historically, around half of articles in U.S. local newspapers came from newswires like the Associated Press. While local papers reproduced articles from the newswire, they wrote their own headlines, which form abstractive summaries of the associated articles. We associate articles and their headlines by exploiting document layouts and language understanding. We then use deep neural methods to detect which articles are from the same underlying source, in the presence of substantial noise and abridgement. The headlines of reproduced articles form positive semantic similarity pairs. The resulting publicly available HEADLINES dataset is significantly larger than most existing semantic similarity datasets and covers a much longer span of time. It will facilitate the application of contrastively trained semantic similarity models to a variety of tasks, including the study of semantic change across space and time. 
  %G eng %0 Generic %D Working Paper %T Quantifying Character Similarity with Vision Transformers %A Xinmei Yang %A Abhishek Arora %A Shao-Yu Jheng %A Melissa Dell %X Record linkage is a bedrock of quantitative social science, as analyses often require linking data from multiple, noisy sources. Off-the-shelf string matching methods are widely used, as they are straightforward and cheap to implement and scale. Not all character substitutions are equally probable, and for some settings there are widely used handcrafted lists denoting which string substitutions are more likely, that improve the accuracy of string matching. However, such lists do not exist for many settings, skewing research with linked datasets towards a few high-resource contexts that are not representative of the diversity of human societies. This study develops an extensible way to measure character substitution costs for OCR'ed documents, by employing large-scale self-supervised training of vision transformers (ViT) with augmented digital fonts. For each language written with the CJK script, we contrastively learn a metric space where different augmentations of the same character are represented nearby. In this space, homoglyphic characters - those with similar appearance such as ``O'' and ``0'' - have similar vector representations. Using the cosine distance between characters' representations as the substitution cost in an edit distance matching algorithm significantly improves record linkage compared to other widely used string matching methods, as OCR errors tend to be homoglyphic in nature. Homoglyphs can plausibly capture character visual similarity across any script, including low-resource settings. We illustrate this by creating homoglyph sets for 3,000 year old ancient Chinese characters, which are highly pictorial. Fascinatingly, a ViT is able to capture relationships in how different abstract concepts were conceptualized by ancient societies, that have been noted in the archaeological literature.  %G eng %0 Generic %D Working Paper %T Linking Representations with Multimodal Contrastive Learning %A Abhishek Arora %A Xinmei Yang %A Shao Yu Jheng %A Melissa Dell %X

Many applications require grouping instances contained in diverse document datasets into classes. Most widely used methods do not employ deep learning and do not exploit the inherently multimodal nature of documents. Notably, record linkage is typically conceptualized as a string-matching problem. This study develops CLIPPINGS, (Contrastively Linking Pooled Pre-trained Embeddings), a multimodal framework for record linkage. CLIPPINGS employs end-to-end training of symmetric vision and language bi-encoders, aligned through contrastive language-image pre-training, to learn a metric space where the pooled image-text representation for a given instance is close to representations in the same class and distant from representations in different classes. At inference time, instances can be linked by retrieving their nearest neighbor from an offline exemplar embedding index or by clustering their representations. The study examines two challenging applications: constructing comprehensive supply chains for mid-20th century Japan through linking firm level financial records - with each firm name represented by its crop in the document image and the corresponding OCR - and detecting which image-caption pairs in a massive corpus of historical U.S. newspapers came from the same underlying photo wire source. CLIPPINGS outperforms widely used string matching methods by a wide margin and also outperforms unimodal methods.  Moreover, a purely self-supervised model trained on only image-OCR pairs also outperforms popular string-matching methods without requiring any labels.

%G eng %0 Generic %D Working Paper %T Efficient OCR for Building a Diverse Digital History %A Jacob Carlson %A Tom Bryan %A Melissa Dell %X Thousands of users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history. The sequence-to-sequence architecture typically used for optical character recognition (OCR) – which jointly learns a vision and language model - is poorly extensible to low-resource document collections, as learning a language-vision model requires extensive labeled sequences and compute. This study models OCR as a character level image retrieval problem, using a contrastively trained vision encoder. Because the model only learns characters’ visual features, it is more sample efficient and extensible than existing architectures, enabling accurate OCR in settings where existing solutions fail. Crucially, the model opens new avenues for community engagement in making digital history more representative of documentary history. %G eng %0 Generic %D Working Paper %T Path Dependence in Development: Evidence from the Mexican Revolution %A Dell, M. %G eng