You are here
Semi-Automated Ground-Truth Data Collection and Annotation for Journal Figure Analysis.
Open-i is an online service provided by the National Library of Medicine to enable search and retrieval of abstracts and images from 1.2 million PubMed Central® articles. An important preprocessing step of building Open-i backend is to automatically segment figures into panels and recognize panel labels, such that figure captions of individual panels can be linked to the figure panels and more precise features may be extracted. Existing panel segmentation and label recognition algorithms [1,2,3] are developed based on a tiny set of 448 figures. Due to the lack of training samples, the algorithms have to rely on many hand-crafted rules, which are not able to accommodate the large variations of the figures need to be processed.
This project creates a workflow pipeline, in an attempt to collect a significantly larger ground-truth annotated figure dataset efficiently. The annotation of a figure includes the style of the figure (single-panel, multi-panel or stitched multi-panel), rectangular bounding boxes of the panels, rectangular bounding boxes of the panel labels, and the panel labels. The work flow starts with running automated methods, and then the automated annotation is reviewed and fixed by humans. In order to ensure the annotation quality, a verification algorithm is developed to check the consistency of the annotations. The humans then review the suspicious annotations reported by the verification program.