TableBank is a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet, contains 417K high-quality labeled tables.
Here are Example annotations of the TableBank.
A new benchmark dataset DocBank (repo, paper) is now available for document layout
analysis.
Our paper has been accepted in LREC 2020.
For more details, please refer to our GitHub page: https://github.com/doc-analysis/TableBank.
Source | Train | Val | Test |
---|---|---|---|
Latex | 187199 | 7265 | 5719 |
Word | 73383 | 2735 | 2281 |
Total | 260582 | 10000 | 8000 |
Source | Train | Val | Test |
---|---|---|---|
Latex | 79486 | 6075 | 3036 |
Word | 50977 | 3925 | 1964 |
Total | 130463 | 10000 | 5000 |
File | Size | md5sum |
---|---|---|
TableBank.zip
[1] [2] [3] [4] [5] |
24,897,840,399B (23.1GB) | - |
The annotation of the Table Detection task uses the format of the MS COCO dataset. For specific format information, please refer to the website: https://cocodataset.org/#format-data. Besides, our data annotations can be loaded through COCO API.
The annotation of the Table Recognition task is HTML tag sequences. The tags are <tabular>, </tabular>, <thead>, </thead>, <tbody>, </tbody>, <tr>, </tr>, <td>, </td>, <tdy>, <tdn>.
If you use this dataset, please cite our paper:
TableBank: A Benchmark Dataset for Table Detection and Recognition
Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, Zhoujun Li
Bibtex format:
@misc{li2019tablebank,
title={TableBank: A Benchmark Dataset for Table Detection and Recognition},
author={Minghao Li and Lei Cui and Shaohan Huang and Furu Wei and Ming Zhou and Zhoujun Li},
year={2019},
eprint={1903.01949},
archivePrefix={arXiv},
primaryClass={cs.CV}
}