The TableBank Dataset

TableBank is a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet, contains 417K high-quality labeled tables.

Here are Example annotations of the TableBank.

Table Detection Task

A new benchmark dataset DocBank (repo, paper) is now available for document layout analysis.
Our paper has been accepted in LREC 2020.
For more details, please refer to our GitHub page: https://github.com/doc-analysis/TableBank.

Statistics on Train/Val/Test sets

Table Detection

Source Train Val Test
Latex 187199 7265 5719
Word 73383 2735 2281
Total 260582 10000 8000

Table Structure Recognition

Source Train Val Test
Latex 79486 6075 3036
Word 50977 3925 1964
Total 130463 10000 5000

Download

In order to reduce the loss caused by download interruption, we divided "TableBank.zip" into 5 parts, and after downloading all of them, use the decompression software to decompress them together.
File Size md5sum
TableBank.zip
[1] [2] [3] [4] [5]
24,897,840,399B (23.1GB) -

Annotation Format

The annotation of the Table Detection task uses the format of the MS COCO dataset. For specific format information, please refer to the website: https://cocodataset.org/#format-data. Besides, our data annotations can be loaded through COCO API.

The annotation of the Table Recognition task is HTML tag sequences. The tags are <tabular>, </tabular>, <thead>, </thead>, <tbody>, </tbody>, <tr>, </tr>, <td>, </td>, <tdy>, <tdn>.

Citation

If you use this dataset, please cite our paper:

TableBank: A Benchmark Dataset for Table Detection and Recognition

Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, Zhoujun Li

Bibtex format:

@misc{li2019tablebank,
    title={TableBank: A Benchmark Dataset for Table Detection and Recognition},
    author={Minghao Li and Lei Cui and Shaohan Huang and Furu Wei and Ming Zhou and Zhoujun Li},
    year={2019},
    eprint={1903.01949},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}