Awesome Visually Rich Documents

Ritesh Sarkhel

A lot of interesting works have been published on Visually Rich Documents (VRD) in recent years. What makes tasks defined on a VRD often more challenging than its plain-text counterpart is its multimodal nature. We provide a list of works on various problems related to VRD here, with the hope that this page will act as a resource of reference for researchers interested in this domain.

The following list presents, in no particular order, a number of intersting papers on VRD published in recent years. You will also find a single bibtex file (vrd.bib) containing all the papers listed in this repository.

Disclaimer: Please note that this is **not** an exhaustive list. If you think a work should feature on this list, please submit a pull request or email me. We look forward to your contribution to keep this page up-to-date with the latest developments in our community.

Content Overview

We list papers from five different directions in this page. Click the + sign next to each paper to read a brief description of its main contributions.

Information Extraction
Language Model
Layout Analysis
Classification
Resource Papers

Information Extraction

Fonduer: Knowledge base construction from richly formatted data, Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, and Christopher Ré SIGMOD 2018

+
✍ "We introduce Fonduer, a machine-learning-based KBC system for richly formatted data. Fonduer presents a new data model that accounts for three challenging characteristics of richly formatted data: (1) prevalent document-level relations, (2) multimodality, and (3) data variety. Fonduer uses a new deep-learning model to automatically capture the representation (i.e., features) needed to learn how to extract relations from richly formatted data. Finally, Fonduer provides a new programming model that enables users to convert domain expertise, based on multiple modalities of information, to meaningful signals of supervision for training a KBC system. "
Visual segmentation for information extraction from heterogeneous visually rich documents, Ritesh Sarkhel, Arnab Nandi SIGMOD 2019

+
✍ " We propose VS2, a generalized approach for information extraction from heterogeneous visually rich documents. There are two major contributions of this work. First, we propose a robust segmentation algorithm that decomposes a visually rich document into a bag of visually isolated but semantically coherent areas, called logical blocks. Document type agnostic low-level visual and semantic features are used in this process. Our second contribution is a distantly supervised search-and-select method for identifying the named entities within these documents by utilizing the context boundaries defined by these logical blocks."
Improving Information Extraction from Visually Rich Documents using Visual Span Representations, Ritesh Sarkhel, Arnab Nandi VLDB 2021

+
✍ " In this paper, we present Artemis - a visually aware, machine-learning-based IE method for heterogeneous visually rich documents. Artemis represents a visual span in a document by jointly encoding its visual and textual context for IE tasks. Our main contribution is two-fold. First, we develop a deep-learning model that identifies the local context boundary of a visual span with minimal human-labeling. Second, we describe a deep neural network that encodes the multimodal context of a visual span into a fixed-length vector by taking its textual and layout-specific features into account. It identifies the visual span(s) containing a named entity by leveraging this learned representation followed by an inference task. "
CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web, Colin Lockard, Xin Luna Dong, Arash Einolghozati, and Prashant Shiralkar VLDB 2018

+
✍ " In this paper we present a new method for automatic extraction from semi-structured websites based on distant supervision. We automatically generate training labels by aligning an existing knowledge base with a website and leveraging the unique structural characteristics of semi-structured websites. We then train a classifier based on the potentially noisy and incomplete labels to predict new relation instances. "
Glean: Structured Extractions from Templatic Documents, Sandeep Tata, Navneet Potti, James B. Wendt, Lauro Beltrão Costa, Marc Najork, and Beliz Gunel VLDB 2021

+
✍ " Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. "
FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents, Bill Yuchen Lin, Ying Sheng, Nguyen Vo, and Sandeep Tata SIGKDD 2020

+
✍ " Extracting structured data from HTML documents is a long-studied problem with a broad range of applications like augmenting knowledge bases, supporting faceted search, and providing domain-specific experiences for key verticals like shopping and movies. Previous approaches have either required a small number of examples for each target site or relied on carefully handcrafted heuristics built over visual renderings of websites. In this paper, we present a novel two-stage neural approach, named FreeDOM, which overcomes both these limitations. The first stage learns a representation for each DOM node in the page by combining both the text and markup information. The second stage captures longer range distance and semantic relatedness using a relational neural network. By combining these stages, FreeDOM is able to generalize to unseen sites after training on a small number of seed sites from that vertical without requiring expensive hand-crafted features over visual renderings of the page. "
One-shot Text Field Labeling using Attention and Belief Propagation for Structure Information Extraction, Mengli Cheng, Minghui Qiu, Xing Shi, Jun Huang, and Wei Lin MM 2020

+
✍ " Existing learning based methods for text labeling task usually require a large amount of labeled examples to train a specific model for each type of document. However, collecting large amounts of document images and labeling them is difficult and sometimes impossible due to privacy issues. Deploying separate models for each type of document also consumes a lot of resources. Facing these challenges, we explore one-shot learning for the text field labeling task. Existing one-shot learning methods for the task are mostly rule-based and have difficulty in labeling fields in crowded regions with few landmarks and fields consisting of multiple separate text regions. To alleviate these problems, we proposed a novel deep end-to-end trainable approach for one-shot text field labeling, which makes use of attention mechanism to transfer the layout information between document images. We further applied conditional random field on the transferred layout information for the refinement of field labeling. "
Combining visual and textual features for information extraction from online flyers, Emilia Apostolova, Noriko Tomuro EMNLP 2014

+
✍ " Information in visually rich formats such as PDF and HTML is often conveyed by a combination of textual and visual features. In particular, genres such as marketing flyers and info-graphics often augment textual information by its color, size, positioning, etc. As a result, traditional text-based approaches to information extraction (IE) could underperform. In this study, we present a supervised machine learning approach to IE from online commercial real estate flyers. We evaluated the performance of SVM classifiers on the task of identifying 12 types of named entities using a combination of textual and visual features. "
Chargrid: Towards Understanding 2D Documents, Anoop R Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Hohne, and Jean Baptiste Faddoul EMNLP 2018

+
✍ " We introduce a novel type of text representation that preserves the 2D layout of a document. This is achieved by encoding each document page as a two-dimensional grid of characters. Based on this representation, we present a generic document understanding pipeline for structured documents. This pipeline makes use of a fully convolutional encoder-decoder network that predicts a segmentation mask and bounding boxes. "
Graph Convolution for Multimodal Information Extraction from Visually Rich Documents, Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao NAACL 2019

+
✍ " In this paper, we introduce a graph convolution based model to combine textual and visual information presented in VRDs. Graph embeddings are trained to summarize the context of a text segment in the document, and further combined with text embeddings for entity extraction. "
Zeroshotceres: Zero-shot relation extraction from semi-structured webpages, Colin Lockard, Prashant Shiralkar, Xin Luna Dong, Hannaneh Hajishirzi ACL 2020

+
✍ " In this work, we propose a solution for “zero-shot” open-domain relation extraction from webpages with a previously unseen template, including from websites with little overlap with existing sources of knowledge for distant supervision and websites in entirely new subject verticals. Our model uses a graph neural network-based approach to build a rich representation of text fields on a webpage and the relationships between them, enabling generalization to new templates. "
Representation learning for information extraction from form-like documents, Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, Marc Najork ACL 2020

+
✍ " We propose a novel approach using representation learning for tackling the problem of extracting structured information from form-like document images. We propose an extraction system that uses knowledge of the types of the target fields to generate extraction candidates, and a neural network architecture that learns a dense representation of each candidate based on neighboring words in the document. "
End-to-End Extraction of Structured Information from Business Documents with Pointer-Generator Networks, Clément Sage, Alex Aussem, Véronique Eglin, Haytham Elghazel, and Jérémy Espinas Workshop on Structured Prediction, ACL 2020

+
✍ " The predominant approaches for extracting key information from documents resort to classifiers predicting the information type of each word. However, the word level ground truth used for learning is expensive to obtain since it is not naturally produced by the extraction task. In this paper, we discuss a new method for training extraction models directly from the textual value of information. The extracted information of a document is represented as a sequence of tokens in the XML language. We learn to output this representation with a pointer-generator network that alternately copies the document words carrying information and generates the XML tags delimiting the types of information. "
Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models, Mengxi Wei, Yifan He, and Qiong Zhang SIGIR 2020

+
✍ " We study the problem of information extraction from visually rich documents (VRDs) and present a model that combines the power of large pre-trained language models and graph neural networks to efficiently encode both textual and visual information in business documents. We further introduce new fine-tuning objectives to improve in-domain unsupervised fine-tuning to better utilize large amount of unlabeled in-domain data. "
BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding, Timo I. Denk, Christian Reisswig Workshop on Document Intelligence, NeurIPS 2019

+
✍ " Our novel BERTgrid, which is based on Chargrid by Katti et al. (2018), represents a document as a grid of contextualized word piece embedding vectors, thereby making its spatial structure and semantics accessible to the processing neural network. The contextualized embedding vectors are retrieved from a BERT language model. We use BERTgrid in combination with a fully convolutional network on a semantic instance segmentation task for extracting fields from invoices. "
PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks, Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, and Rong Xiao *Pre-print*

+
✍ " In this paper, we introduce PICK, a framework that is effective and robust in handling complex documents layout for Key Information Extraction by combining graph learning with graph convolution operation, yielding a richer semantic representation containing the textual and visual features and global layout without ambiguity. "
Spatial Dependency Parsing for Semi-Structured Document Information Extraction, Wonseok Hwang, Jinyeong Yim, Seunghyun Park, Sohee Yang, and Minjoon Seo *Pre-print*

+
✍ " Information Extraction (IE) for semi-structured document images is often approached as a sequence tagging problem by classifying each recognized input token into one of the IOB (Inside, Outside, and Beginning) categories. However, such problem setup has two inherent limitations that (1) it cannot easily handle complex spatial relationships and (2) it is not suitable for highly structured information, which are nevertheless frequently observed in real-world document images. To tackle these issues, we first formulate the IE task as spatial dependency parsing problem that focuses on the relationship among text segment nodes in the documents. Under this setup, we then propose SPADE (SPAtial DEpendency parser) that models highly complex spatial relationships and an arbitrary number of information layers in the documents in an end-to-end manner. "
Abstractive Information Extraction from Scanned Invoices (AIESI) using End-to-end Sequential Approach, Shreeshiv Patel, Dvijesh Bhatt *Pre-print*

+
✍ " Abstract Information Extraction from Scanned Invoices (AIESI) is a process of extracting information like, date, total amount, payee name, and etc from scanned receipts. In this paper we proposed an improved method to ensemble all visual and textual features from invoices to extract key invoice parameters using Word wise BiLSTM. "
VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach, Mohamed Kerroumi, Othmane Sayem, and Aymen Shabou *Pre-print*

+
✍ " We introduce a novel approach for scanned document representation to perform field extraction. It allows the simultaneous encoding of the textual, visual and layout information in a 3D matrix used as an input to a segmentation model. We improve the recent Chargrid and Wordgrid models in several ways, first by taking into account the visual modality, then by boosting its robustness in regards to small datasets while keeping the inference time low. "
Wordgrid: Extending Chargrid with Word-level Information, Timo I. Denk *Pre-print*

+
✍ " For embedding words with semantically meaningful vectors, we propose a novel method for estimating dense word vectors, called word2vec-2d. It is a fork of word2vec that is trained on 2D document corpora rather than 1D text sequences. The notion of context is redefined to be the variablysized set of words that are spatially located within a certain distance to the center word. "
Spatial Dual-Modality Graph Reasoning for Key Information Extraction, Hongbin Sun, Zhanghui Kuang, Xiaoyu Yue, Chenhao Lin and Wayne Zhang *Pre-print*

+
✍ " In this paper, we propose an end-toend Spatial Dual-Modality Graph Reasoning method (SDMGR) to extract key information from unstructured document images. We model document images as dual-modality graphs, nodes of which encode both the visual and textual features of detected text regions, and edges of which represent the spatial relations between neighboring text regions. The key information extraction is solved by iteratively propagating messages along graph edges and reasoning the categories of graph nodes. In order to roundly evaluate our proposed method as well as boost the future research, we release a new dataset named WildReceipt, which is collected and annotated tailored for the evaluation of key information extraction from document images of unseen templates in the wild. "

Language Model

LayoutLM: Pre-training of Text and Layout for Document Image Understanding, Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou SIGKDD 2020

+
✍ " In this paper, we propose the LayoutLM to jointly model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for document-level pre-training. "
Self-Supervised Representation Learning on Document Images, Adrian Cosma, Mihai Ghidoveanu, Michael Panaitescu-Liess, and Marius Popescu Workshop on Document Analysis Systems, 2020

+
✍ " This work analyses the impact of self-supervised pre-training on document images in the context of document image classification. While previous approaches explore the effect of self-supervision on natural images, we show that patch-based pre-training performs poorly on document images because of their different structural properties and poor intra-sample semantic information. We propose two context-aware alternatives to improve performance on the Tobacco-3482 image classification task. We also propose a novel method for self-supervision, which makes use of the inherent multi-modality of documents (image and text). "
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding, Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou *Pre-print*

+
✍ " In this paper, we present LayoutLMv2 by pre-training text, layout and image in a multi-modal framework, where new model architectures and pre-training tasks are leveraged. Specifically, LayoutLMv2 not only uses the existing masked visual-language modeling task but also the new text-image alignment and textimage matching tasks in the pre-training stage, where cross-modality interaction is better learned. Meanwhile, it also integrates a spatial-aware selfattention mechanism into the Transformer architecture, so that the model can fully understand the relative positional relationship among different text blocks. "
LAMBERT: Layout-Aware (Language) Modeling for information extraction, Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, Michał Turski, and Filip Graliński *Pre-print*

+
✍ " We introduce a new simple approach to the problem of understanding documents where non-trivial layout influences the local semantics. To this end, we modify the Transformer encoder architecture in a way that allows it to use layout features obtained from an OCR system, without the need to re-learn the language semantics from scratch. We augment the input of the model only with the coordinates of token bounding boxes, avoiding the use of raw images. This leads to a layout-aware language model which can be then fine-tuned on downstream tasks. "
Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning, Subhojeet Pramanik, Shashank Mujumdar, Hima Patel *Pre-print*

+
✍ " In this paper, we propose a multi-task learning-based framework that utilizes a combination of self-supervised and supervised pre-training tasks to learn a generic document representation. We design the network architecture and the pretraining tasks to incorporate the multi-modal document information across text, layout, and image dimensions and allow the network to work with multi-page documents. We showcase the applicability of our pre-training framework on a variety of different real-world document tasks. "
LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding, Te-Lin Wu, Cheng Li, Mingyang Zhan, Tao Chen, Spurthi Amba Hombaiah, Michael Bendersky *Pre-print*

+
✍ " We parse a document into content blocks (e.g. text, table, image) and propose a novel layout-aware multimodal hierarchical framework, LAMPreT, to model the blocks and the whole document. Our LAMPreT encodes each block with a multimodal transformer in the lower-level, and aggregates the block-level representations and connections utilizing a specifically designed transformer at the higher-level. We design hierarchical pretraining objectives where the lower-level model is trained with the standard masked language modeling (MLM) loss and the image-text matching loss, and the higher-level model is trained with three layout-aware objectives: (1) block-order predictions, (2) masked block predictions, and (3) image fitting predictions. "
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer, Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Michał Pietruszka, and Gabriela Pałka *Pre-print*

+
✍ " We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. "

Layout Analysis

Form2Seq : A Framework for Higher-Order Form Structure Extraction, Milan Aggarwal, Hiresh Gupta, Mausoom Sarkar, Balaji Krishnamurthy EMNLP 2020

+
✍ " We propose Form2Seq, a novel sequenceto-sequence (Seq2Seq) inspired framework for structure extraction using text, with a specific focus on forms, which leverages relative spatial arrangement of structures. We discuss two tasks; 1) Classification of low-level constituent elements (TextBlock and empty fillable Widget) into ten types such as field captions, list items, and others; 2) Grouping lower-level elements into higher-order constructs, such as Text Fields, ChoiceFields and ChoiceGroups, used as information collection mechanism in forms. To achieve this, we arrange the constituent elements linearly in natural reading order, feed their spatial and textual representations to Seq2Seq framework, which sequentially outputs prediction of each element depending on the final task. We modify Seq2Seq for grouping task and discuss improvements obtained through cascaded end-to-end training of two tasks versus training in isolation. "
DocStruct: A Multimodal Method to Extract Hierarchy Structure in Document for General Form Understanding, Zilong Wang, Mingjie Zhan, Xuebo Liu, and Ding Liang EMNLP Findings 2020

+
✍ " We consider the form structure as a tree-like or graph-like hierarchy of text fragments. The parent-child relation corresponds to the key-value pairs in forms. We utilize the state of-the-art models and design targeted extraction modules to extract multimodal features from semantic contents, layout information, and visual images. A hybrid fusion method of concatenation and feature shifting is designed to fuse the heterogeneous features and provide an informative joint representation. We adopt an asymmetric algorithm and negative sampling in our model as well. "
Multi-Modal Association based Grouping for Form Structure Extraction, Milan Aggarwal, Mausoom Sarkar, Hiresh Gupta, and Balaji Krishnamurthy WACV 2020

+
✍ " In this work, we present a novel multi-modal approach for form structure extraction. Given simple elements such as textruns and widgets, we extract higher-order structures such as TextBlocks, Text Fields, Choice Fields, and Choice Groups, which are essential for information collection in forms. To achieve this, we obtain a local image patch around each low-level element (reference) by identifying candidate elements closest to it. We process textual and spatial representation of candidates sequentially through a BiLSTM to obtain context-aware representations and fuse them with image patch features obtained by processing it through a CNN. Subsequently, the sequential decoder takes this fused feature vector to predict the association type between reference and candidates. These predicted associations are utilized to determine larger structures through connected components analysis. "
Simplified DOM Trees for Transferable Attribute Extraction from the Web, Yichao Zhou, Ying Sheng, Nguyen Vo, Nick Edmonds, and Sandeep Tata The Web Conference, 2021

+
✍ " In this paper, we propose a novel transferable method, Simplified DOM Trees for Attribute Extraction (SimpDOM), to tackle the problem by efficiently retrieving useful context for each node by leveraging the tree structure. We study two challenging experimental settings: (i) intra-vertical few-shot extraction, and (ii) cross-vertical fewshot extraction with out-of-domain knowledge, to evaluate our approach. "
Few-shot prototype alignment regularization network for document image layout segementation, Yujie Li, Pengfei Zhang, Xing Xua, Yi Lai, Fumin Shen, Lijiang Chen, and Pengxiang Gao Pattern Recognition 2021

+
✍ " In this paper, we propose a novel method dubbed Few-Shot Prototype Alignment Regularization Network (FS-PARN). The FS-PARN method is inspired by recent studies in both metric learning and few-shot segmentation, which just need a few annotated images to solve the above two difficulties. Our FS-PARN method can make better use of the information of the support set by metric learning and have a better effect on image segmentation. It learns classification prototype within an embedding space and then completes pixel classification by matching each pixel on the query image with the learned prototype. "
Vips: a vision-based page segmentation algorithm, Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma Technical Report

+
✍ " A new web content structure analysis based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception. Comparing to other existing techniques, our approach is independent to underlying documentation representation such as HTML and works well even when the HTML structure is far different from layout structure. "
DocParser: Hierarchical Document Structure Parsing from Renderings, Johannes Rausch, Octavio Martinez, Fabian Bissig, Ce Zhang, and Stefan Feuerriege *Pre-print*

+
✍ " We developed “DocParser”: an end-to-end system for parsing the complete document structure – including all text elements, nested figures, tables, and table cell structures. Our second contribution is to provide a dataset for evaluating hierarchical document structure parsing. Our third contribution is to propose a scalable learning framework for settings where domain-specific data are scarce, which we address by a novel approach to weak supervision that significantly improves the document structure parsing performance. "

Classification

Deterministic Routing between Layout Abstractions for Multi-Scale classification of Visually Rich Documents, Ritesh Sarkhel, Arnab Nandi IJCAI 2019

+
✍ " Classifying heterogeneous visually rich documents is a challenging task. Difficulty of this task increases even more if the maximum allowed inference turnaround time is constrained by a threshold. The increased overhead in inference cost, compared to the limited gain in classification capabilities make current multi-scale approaches infeasible in such scenarios. There are two major contributions of this work. First, we propose a spatial pyramid model to extract highly discriminative multi-scale feature descriptors from a visually rich document by leveraging the inherent hierarchy of its layout. Second, we propose a deterministic routing scheme for accelerating end-to-end inference by utilizing the spatial pyramid model. A depth-wise separable multi-column convolutional network is developed to enable our method. "
Cutting the error by half: Investigation of very deep cnn and advanced training strategies for document image classification, Muhammad Zeshan Afzal, Andreas Kölsch, Sheraz Ahmed, and Marcus Liwicki ICDAR 2017

+
✍ " The contribution of the paper is threefold: First, it investigates recently introduced very deep neural network architectures (GoogLeNet, VGG, ResNet) using transfer learning (from real images). Second, it proposes transfer learning from a huge set of document images, i.e. 400,000 documents. Third, it analyzes the impact of the amount of training data (document images) and other parameters to the classification abilities. "
Real-time document image classification using deep CNN and extreme learning machines, Andreas Kölsch, Muhammad Zeshan Afzal, Markus Ebbecke, and Marcus Liwicki ICDAR 2017

+
✍ " This paper presents an approach for real-time training and testing for document image classification. In production environments, it is crucial to perform accurate and time-efficient training. Existing deep learning approaches for classifying documents do not meet these requirements, as they require much time for training and fine-tuning the deep architectures. Motivated from Computer Vision, we propose a two-stage approach. The first stage trains a deep network that works as feature extractor and in the second stage, Extreme Learning Machines (ELMs) are used for classification. "
Analysis of convolutional neural networks for document image classification, Chris Tensmeyer, Tony Martinez ICDAR 2017

+
✍ " Convolutional Neural Networks (CNNs) are state-of-the-art models for document image classification tasks. However, many of these approaches rely on parameters and architectures designed for classifying natural images, which differ from document images. We question whether this is appropriate and conduct a large empirical study to find what aspects of CNNs most affect performance on document images. "
Document Image Classification with Intra-Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks, Arindam Das, Saikat Roy, Ujjwal Bhattacharya, and Swapan K. Parui ICPR 2018

+
✍ " The contribution of this work involves efficient training of region based classifiers and effective ensembling for document image classification. A primary level of `inter-domain' transfer learning is used by exporting weights from a pre-trained VGG16 architecture on the ImageNet dataset to train a document classifier on whole document images. Exploiting the nature of region based influence modelling, a secondary level of intra-domain transfer learning is used for rapid training of deep learning models for image segments. Finally, a stacked generalization based ensembling is utilized for combining the predictions of the base deep neural network models. "
Multimodal Document Image Classification, Rajiv Jain, Curtis Wigington ICDAR 2019

+
✍ " State-of-the-art methods for document image classification rely on visual features extracted by deep convolutional neural networks (CNNs). These methods do not utilize rich semantic information present in the text of the document, which can be extracted using Optical Character Recognition (OCR). We first study the performance of state-of-the-art text classification approaches when applied to noisy text obtained from OCR. We then show that fusing this textual information with visual CNN methods produces state-of-the-art results. "
Two Stream Deep Network for Document Image Classification, Muhammad Nabeel Asim, Muhammad Usman Ghani Khan, Muhammad Imran Malik, Khizar Razzaque, Andreas Dengel, and Sheraz Ahmed ICDAR 2019

+
✍ " This paper presents a novel two-stream approach for document image classification. The proposed approach leverages textual and visual modalities to classify document images into ten categories, including letter, memo, news article, etc. In order to alleviate dependency of textual stream on performance of underlying OCR (which is the case with general content based document image classifiers), we utilize a filter based feature-ranking algorithm. This algorithm ranks the features of each class based on their ability to discriminate document images and selects a set of top 'K' features that are retained for further processing. In parallel, the visual stream uses deep CNN models to extract structural features of document images.Finally, textual and visual streams are concatenated together using an average ensembling method. "
Unsupervised exemplar-based learning for improved document image classification, Sherif Abuelwafa, Marco Pedersoli, and Mohamed Cheriet IEEE Access 2019

+
✍ " In this paper, we present an approach for learning visual features for document analysis in an unsupervised way, which improves the document image classification performance without increasing the amount of annotated data. The proposed approach trains a neural network model on an auxiliary task in which every training example is associated with a different label (exemplar) and expanded to multiple images through a data augmentation technique. Thus, the learned model, which is trained in an unsupervised way, is used to boost the document classification performance. In fact, this learned model has proved to be consistently efficient in two different settings: i) as an unsupervised feature extractor to represent document images for an unsupervised classification task (i.e., clustering); and ii) in the parameters initialization of a supervised classification task trained with a small amount of annotated data. "
Visual and Textual Deep Feature Fusion for Document Image Classification, Souhail Bakkali, Zuheng Ming, Mickael Coustaty, and Marc al Rusinol CVPR Workshop 2020

+
✍ " In this work, a two-stream neural architecture is proposed to perform the document image classification task. We conduct an exhaustive investigation of nowadays widely used neural networks as well as word embedding procedures used as backbones, in order to extract both visual and textual features from document images. Moreover, a joint feature learning approach that combines image features and text embeddings is introduced as a late fusion methodology. Both the theoretical analysis and the experimental results demonstrate the superiority of our proposed joint feature learning method comparatively to the single modalities. "
Structural similarity for document image classification and retrieval, Jayant Kumar, Peng Ye, and David Doermann Pattern Recognition Letters 2014

+
✍ " This paper presents a novel approach to defining document image structural similarity for the applications of classification and retrieval. We first build a codebook of SURF descriptors extracted from a set of representative training images. We then encode each document and model the spatial relationships between them by recursively partitioning the image and computing histograms of codewords in each partition. A random forest classifier is trained with the resulting features, and used for classification and retrieval. "

Resource Papers

DocBank: A Benchmark Dataset for Document Layout Analysis, Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou COLING 2020

+
✍ " In this paper, we present DocBank, a benchmark dataset that contains 500K document pages with fine-grained tokenlevel annotations for document layout analysis. DocBank is constructed using a simple yet effective way with weak supervision from the LATEX documents available on the arXiv.com. With DocBank, models from different modalities can be compared fairly and multi-modal approaches will be further investigated and boost the performance of document layout analysis. We build several strong baselines and manually split train/dev/test sets for evaluation. "
Evaluation of deep convolutional nets for document image classification and retrieval, Muhammad Zeshan Afzal, Andreas Kölsch, Sheraz Ahmed, and Marcus Liwicki ICDAR 2015

+
✍ " This work makes available a new labelled subset of the IIT-CDIP collection, containing 400,000 document images across 16 categories. "
ICDAR2017 Competition on Layout Analysis for Challenging Medieval Manuscripts, Fotini Simistira, Manuel Bouillon, Mathias Seuret, Marcel Würsch, Michele Alberti, Rolf Ingold, and Marcus Liwicki ICDAR 2017

+
✍ " This paper reports on the ICDAR2017 Competition on Layout Analysis for Challenging Medieval Manuscripts (HisDoc-Layout-Comp) and provides further details and discussions. In this competition we introduce a new challenging dataset and state-of-the-art benchmark results for pixel-labelling and text line segmentation. The DIVA-HisDB comprises medieval manuscripts with complex layout in contrast to previous datasets, where rectangular text blocks and only a few decorative elements exist. In particular, the images of this competition contain many interlinear and marginal glosses as well as texts in various sizes and decorated letters. This makes the distinction of the four target labels (text, comment, decoration, and background) more difficult. In addition, to reflect the needs of scholars in the humanities, we request multi-labeling of certain regions (decorated text as text and decoration). Furthermore, we measure not just the accuracy, but the Intersection over Union (IU) of pixel sets, which better reflects the real performance. "
DocVQA: A Dataset for VQA on Document Images, Minesh Mathew1 Dimosthenis Karatzas, and C.V. Jawahar WACV 2021

+
✍ " We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images. Detailed analysis of the dataset in comparison with similar datasets for VQA and reading comprehension is presented. We report several baseline results by adopting existing VQA and reading comprehension models. Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy). The models need to improve specifically on questions where understanding structure of the document is crucial. The dataset, code and leaderboard are available at docvqa.org. "
ICDAR2019 competition on scanned receipt ocr and information extraction, Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and C. V. Jawahar ICDAR 2019

+
✍ " The ICDAR 2019 Challenge on "Scanned receipts OCR and key information extraction" (SROIE) covers important aspects related to the automated analysis of scanned receipts. The SROIE tasks play a key role in many document analysis systems and hold significant commercial potential. One of the key contributions of SROIE to the document analysis community is to offer a first, standardized dataset of 1000 whole scanned receipt images and annotations, as well as an evaluation procedure for such tasks. The Challenge is structured around three tasks, namely Scanned Receipt Text Localization (Task 1), Scanned Receipt OCR (Task 2) and Key Information Extraction from Scanned Receipts (Task 3). "
FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents, Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran ICDAR Workshop 2019

+
✍ " We present a new dataset for form understanding in noisy scanned documents (FUNSD) that aims at extracting and structuring the textual content of forms. The dataset comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary widely in appearance, making form understanding (FoUn) a challenging task. The proposed dataset can be used for various tasks, including text detection, optical character recognition, spatial layout analysis, and entity labeling/linking. We also present a set of baselines and introduce metrics to evaluate performance on the FUNSD dataset, which can be downloaded at https://guillaumejaume.github.io/FUNSD. "
Kleister: A novel task for Information Extraction involving Long Documents with Complex Layout, Filip Graliński, Tomasz Stanisławek, Anna Wróblewska, Dawid Lipiński, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, and Przemysław Biecek *Pre-print*

+
✍ " State-of-the-art solutions for Natural Language Processing (NLP) are able to capture a broad range of contexts, like the sentence-level context or document-level context for short documents. But these solutions are still struggling when it comes to longer, real-world documents with the information encoded in the spatial structure of the document, such as page elements like tables, forms, headers, openings or footers; complex page layout or presence of multiple pages. To encourage progress on deeper and more complex Information Extraction (IE) we introduce a new task (named Kleister) with two new datasets. "
WebSRC: A Dataset for Web-Based Structural Reading Comprehension, Lu Chen, Xingyu Chen, Zihan Zhao, Danyang Zhang Jiabao Ji, Ao Luo, Yuxuan Xiong, and Kai Yu *Pre-print*

+
✍ " Web search is an essential way for human to obtain information, but it’s still a great challenge for machines to understand the contents of web pages. In this paper, we introduce the task of web-based structural reading comprehension. Given a web page and a question about it, the task is to find an answer from the web page. This task requires a system not only to understand the semantics of texts but also the structure of the web page. Moreover, we proposed WebSRC, a novel Web-based Structural Reading Comprehension dataset. WebSRC consists of 0.44M question-answer pairs, which are collected from 6.5K web pages with corresponding HTML source code, screenshots, and metadata. Each question in WebSRC requires a certain structural understanding of a web page to answer, and the answer is either a text span on the web page or yes/no. "
VisualMRC: Machine Reading Comprehension on Document Images, Ryota Tanaka, Kyosuke Nishida, and Sen Yoshida *Pre-print*

+
✍ " . In this study, we introduce a new visual machine reading comprehension dataset, named VisualMRC, wherein given a question and a document image, a machine reads and comprehends texts in the image to answer the question in natural language. Compared with existing visual question answering (VQA) datasets that contain texts in images, VisualMRC focuses more on developing natural language understanding and generation abilities. It contains 30,000+ pairs of a question and an abstractive answer for 10,000+ document images sourced from multiple domains of webpages. "

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
LICENSE		LICENSE
Readme.md		Readme.md
github_vrd_example.jpg		github_vrd_example.jpg
vrd.bib		vrd.bib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Visually Rich Documents

Ritesh Sarkhel

Content Overview

Information Extraction

Language Model

Layout Analysis

Classification

Resource Papers

About

Releases

Packages

Languages

License

sarkhelritesh/vrd_resource

Folders and files

Latest commit

History

Repository files navigation

Awesome Visually Rich Documents

Ritesh Sarkhel

Content Overview

Information Extraction

Language Model

Layout Analysis

Classification

Resource Papers

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages