LiLT A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding

author: Jiapeng Wang insitution: INTSIG-SCUT 上海合合信息 conference: ACL 2022 github: https://github.com/jpwang/lilt

Abstract

Most existing related models can only deal with the document data of specific language(s)
LiLT can be pretrained on the structured documents of a single language and then directly fine-tuned on other languages

Introduction

When the layout structure remains unchanged, the substitution of language does not make obvious unnaturalness.

Figure 1: The substitution of language does not appear obviously unnatural when the layout structure remains unchanged, as shown in a (a) form/(b) receipt. The detailed content has been re-synthesized to avoid the sensitive information leak. Best viewed in zoomed-in.
Based on this inspiration, in this paper, we propose a simple yet effective Language-independent Layout Transformer (LiLT) for structured document understanding.
We propose BiACM to provide languageindependent cross-modality interaction, along with an effective asynchronous optimization strategy for textual and non-textual flows in pre-training.
Moreover, we present two new pre-training objectives, namely KPL and CAI.

LiLT

Figure 2: The overall illustration of our framework. Text and layout information are separately embedded and fed into the corresponding flow. BiACM is proposed to accomplish the cross-modality interaction. At the model output, text and layout features are concatenated for the self-supervised pre-training or the downstream fine-tuning. $N_{l}$ is the number of Transformer layers. The red $_{M}/_{R}$ indicates the randomly masked/replaced item for pre-training. $t$, $b$ and $r$ represent $token$, $box$ and $region$, respectively.

Model Architecture

The whole framework can be regarded as a parallel dual-stream Transformer.
The layout flow shares a similar structure as text flow, except for the $reduced hidden size$ and $intermediate size$ to achieve computational efficiency.

Text Embedding

All text strings in the OCR results are first tokenized and concatenated as a sequence $S_{t}$ by sorting the corresponding text bounding boxes from the top-left to bottom-right.

Layout Embedding

we normalize and discretize all box coordinates to integers in the range [0, 1000], and use four embedding layers to generate x-axis, y-axis, height, and width features separately.

BiACM

Pre-training Tasks

Masked Visual-Language Modeling

We mask 15% text tokens, among which 80% are replaced by the special token [MASK], 10% are replaced by random tokens sampled from

Key Point Location

KPL equally divides the entire layout into several regions (we set 7×7=49 regions by default) and randomly masks some of the input bounding boxes. The model is required to predict which regions the key points (top-left corner, bottom-right corner, and center point) of each box belong to using separate heads.
We mask 15% boxes, among which 80% are replaced by (0,0,0,0,0,0), 10% are replaced by random boxes sampled from the same batch, and 10% remain the same. Crossentropy loss is adopted.

We collect those encoded features of token-box pairs that are masked and further replaced (misaligned) or kept unchanged (aligned) by MVLM and KPL, and build an additional head upon them to identify whether each pair is aligned.

LiLT A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding

Abstract

Introduction

LiLT

Model Architecture

Text Embedding

Layout Embedding

BiACM

Pre-training Tasks

Masked Visual-Language Modeling

Key Point Location

Cross-modal Alignment Identification