Check Tags MeSH Terms Indexing Research Project.

Le D, Mork J

A report to the Applied Clinical Informatics Branch.


MEDLINE® is the largest bibliographic database of life sciences and biomedical information created and maintained by the National Library of Medicine (NLM). The database contains over 30 million citations indexed with NLM Medical Subject Headings (MeSH®). MEDLINE documents are indexed using about 30,000 MeSH terms by the NLM. Among these MeSH terms, a small subset of 40 most frequently indexed MeSH terms known as Check Tags help identify age groups, human or animal, males or females, historical periods, and pregnancy that are mentioned in almost every article. This project describes an ongoing effort at the NLM to automate the indexing of 40 Check Tags MeSH terms (CTMTs) based on titles and abstracts in the MEDLINE literature using various techniques and algorithms in Deep Learning, Ensemble Random Forest Bagging Machine Learning, and Natural Language Processing. Over the years, MeSH indexing for MEDLINE was done mostly by highly trained human indexers who read the full text of journal articles and assign appropriate MeSH terms to the articles. In April 2022, NLM decided to go with the full automation of indexing for all journals indexed for MEDLINE. The automated indexing of MEDLINE citations “provides users with timely access to MeSH indexing metadata and allow NLM to scale MeSH indexing for MEDLINE to the increasing volume of published biomedical literature” [1]. In the recent NLM MeSH indexing for MEDLINE report, it showed that the automated system had resolved backlogs of citations needing to be indexed in MEDLINE, reduced the cost of indexing, and can add MeSH indexing to articles within 24 hours. A distinctive feature of using MeSH terms to search is that users can find all articles related to MeSH terms’ concepts, regardless of the terms or words used in the articles. This is different from Internet search engines like Google, Microsoft Bing, or Yahoo, which search based on the same words. Automated indexing of MEDLINE citations with MeSH terms is a challenging multi-label classification problem due to the large number of labels (MeSH Headings) and very imbalanced datasets. Regarding data used between manual indexing and automatic methods for assigning MeSH terms, there is another challenging problem where NLM human indexers have access to the full text while automated indexing methods only use title and abstract [2]. Several studies addressing these challenging problems have been reported in the Natural Language Processing (NLP) literature. For example, the well-known Medical Text Indexer (MTI) [3] machine learning system developed by NLM is a rule-based automated indexing system that processes an article title and abstract and recommends MeSH terms to human indexers. MetaLabeler [4] used the MetaLabeler algorithm proposed by Tang et al. [5] to handle the MeSH indexing challenge. MeSH Now [6], MeSHLabeler [7] and DeepMeSH [8] incorporated the learning-to-rank approaches for improving the results of automatic MeSH indexing. Recently, due to the popularity of Deep Learning, AttentionMeSH [9], “Convolutional Neural Network for Automatic MeSH Indexing” [10], and MeSHProbeNet [11] were designed based on the Deep Learning neural network multi-label classification approaches for automatic MeSH indexing. In this manuscript, we describe a project to index 40 Check Tag MeSH terms using Deep Learning neural network multi-label classifiers, followed by Random Forest Bagging machine learning classifiers that combine predictions from multiple neural network classifiers to improve the system’s predictive performance. Note that for this work, the total number of labels is 40 CTMTs, but the other two challenges remain since the distributions of their CTMTs are very highly imbalanced, and the automated indexing methods can only access titles and abstracts. The features used for the Deep Learning neural networks are the combinations of open-source documents/sentences embeddings vectors and the project customized vectors. The open-source embeddings vectors include Universal Sentences Encoder vectors, Sentence Transformers Embeddings vectors, and Biomedical Sentence Embeddings vectors. The customized vectors consist of MeSH entry terms-based vectors, and word-based dictionary vectors. All vectors are generated from titles and abstracts of MEDLINE documents. Experiments conducted on several million MEDLINE citations show that our proposed approach, which is based on a two-level chained method of Deep Learning neural networks classifiers and Random Forest Bagging machine learning classifiers, has a competitive performance with 86.0% precision, 81.1% recall, and 83.5% F1-score. Note that in this manuscript, the two words "documents" and "articles" have the same meaning, so they can be used interchangeably.

Le D, Mork J. Check Tags MeSH Terms Indexing Research Project. 
A report to the Applied Clinical Informatics Branch.