The DocBank Dataset

DocBank is a new large-scale dataset that is constructed using a weak supervision approach. It enables models to integrate both the textual and layout information for downstream tasks. The current DocBank dataset totally includes 500K document pages, where 400K for training, 50K for validation and 50K for testing.

Here are Example annotations of the DocBank.

The colors of semantic structure labels are:

Abstract Author Caption Equation Figure Footer List Paragraph Reference Section Table Title

DocBank is a natural extension of the TableBank (repo, paper) dataset.
LayoutLM (repo, paper) is an effective pre-training method of text and layout and archives the SOTA result on DocBank.
For more details, please refer to our GitHub page: https://github.com/doc-analysis/DocBank.

Download

In order to reduce the loss caused by download interruption, we divided "DocBank_500K_ori_img.zip" into 10 parts, and after downloading all of them, use the decompression software to decompress them together.
File Size md5sum
DocBank_500K_txt.zip 3,167,771,976B (2.95GB) f1e37183d43709b44b334385684fc343
DocBank_500K_ori_img.zip
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
50,907,670,187B (47.4GB) -
MSCOCO_Format_Annotation.zip 208,973,824B (199 MB) 02b77f1eed22a576bd0eef660823d511

Annotation Format

Each line contains a token and the following information of it:

  • bounding box ((x0, y0), (x1, y1)) - > (x0, y0, x1, y1)
  • color (R, G, B)
  • font
  • label
Index 0 1 2 3 4 5 6 7 8 9
Content token x0 y0 x1 y1 R G B font name label

Citation

If you use this dataset, please cite our paper:

DocBank: A Benchmark Dataset for Document Layout Analysis

Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, Ming Zhou

Bibtex format:

@misc{li2020docbank,
    title={DocBank: A Benchmark Dataset for Document Layout Analysis},
    author={Minghao Li and Yiheng Xu and Lei Cui and Shaohan Huang and Furu Wei and Zhoujun Li and Ming Zhou},
    year={2020},
    eprint={2006.01038},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}