You are here

Automated Indexing Research

Project information
Researchers: 

The Indexing Initiative (II) project investigates language-based and machine learning methods for the automatic selection of subject headings for use in both semi-automated and fully automated indexing environments at NLM. Its major goal is to facilitate the retrieval of biomedical information from textual databases such as MEDLINE.

Team members have developed an indexing system, Medical Text Indexer (MTI), based on two fundamental indexing methodologies. The first of these calls on the MetaMap program to map citation text to concepts in the UMLS Metathesaurus which are then restricted to MeSH headings. The second approachuses the MeSH headings from the PubMed related articles which are precomputed by PubMed. Results from the two basic methods are combined into a ranked list of recommended indexing terms, incorporating aspects of MEDLINE indexing policy in the process. The  MTI is used by NLM Indexers, Cataloging, and the NLM History of Medicine book collection. Recently MTI also became the first-line indexer for a set of 23 journals.

The II team worked closely with an NLM Associate Fellow whose ongoing project was designed to investigate the feasibility of automating the creation of functional annotations about genes, known as Gene Reference into Function (geneRIF). We have developed a prototype, the Gene Indexing Assistant (GIA), and integrated it into the Data Creation Management System used by the Indexers for testing and evaluation.

MetaMap is a critical component of the MTI system and used worldwide in bioinformatics research. MetaMap is one of the NLM resources integrated into IBM’s Watson system for healthcare applications. Recent work has improved processing speed significantly, added XML (eXtensible Markup Language) output, implemented negation identification, and enabled users to supply their own acronyms/abbreviations list. MetaMap is available on Windows, Macintosh and Linux platforms. Users can build their own data sets with the MetaMap Data File Builder and access their local version of MetaMap via either an embedded Java API (Application Programming Interface) or UIMA (Unstructured Information Management Architecture) wrapper.

Taking a Word Sense Disambiguation (WSD) approach, the context words surrounding the ambiguous word are compared to a profile built from each of the UMLS concepts linked to the ambiguous term being disambiguated. This approach has been previously used in the biomedical domain with the NLM WSD corpus. A concept profile vector has as dimensions the tokens obtained from the concept definition or definitions if available, synonyms, and related concepts excluding siblings. Stop words are discarded, and Porter stemming is used to normalize the tokens. In addition, the token frequency is normalized based on the inverted concept frequency so that terms which are repeated many times within the UMLS will have less relevance. A context vector for an ambiguous term includes the term frequency; stop words are removed and the Porter stemmer is applied. The word order is lost in the conversion. Profile Vectors of candidate concepts linked to an ambiguous word are compared to the context of the ambiguous word using cosine similarity; the concept with the highest cosine similarity is selected.

Publications/Tools: 
Jimeno-Yepes A, Wilkowski B, Mork JG, Demner-Fushman D, Aronson AR. MeSH indexing: machine learning and lessons learned. ACM SIGHIT International Health Informatics Symposium, Miami, FL, USA, 2012.
Jimeno-Yepes A, Mork J, Demner-Fushman D, Aronson AR. Automatic algorithm selection for MeSH Heading indexing based on meta-learning. International Symposium on Languages in Biology and Medicine, Singapore, December, 2011.
Jimeno-Yepes A, Aronson AR. Self-training and co-training in biomedical word sense disambiguation. BioNLP 2011 Workshop, June 2011, 182-183.
Zhang X, Zou J, Le DX, Thoma GR. A structural SVM approach for reference parsing. BMC Bioinformatics. 2011 Jun 9;12 Suppl 3:S7. doi: 10.1186/1471-2105-12-S3-S7.
Jimeno-Yepes A, McInnes BT, Aronson AR. Collocation analysis for UMLS knowledge-based word sense disambiguation. BMC Bioinformatics. 2011 Jun 9;12 Suppl 3:S4. doi: 10.1186/1471-2105-12-S3-S4.
Mork J, Peters L, Jimeno-Yepes A, Aronson AR, Bodenreider O. MetaMap in the CALBC Workshop II. CALBC Workshop II, March 2011.
Jimeno-Yepes A, Wilkowski B, Mork J, van Lenten E, Demner-Fushman D, Aronson AR. A bottom-up approach to MEDLINE indexing recommendations. AMIA Annu Symp Proc. 2011;2011:1583-92. Epub 2011 Oct 22.
Jimeno-Yepes A, Aronson AR. Knowledge-based biomedical word sense disambiguation: comparison of approaches. BMC Bioinformatics. 2010 Nov 22;11:569. doi: 10.1186/1471-2105-11-569.
Mork JG, Aronson AR. The Medical Text Indexer (MTI) system for indexing biomedical literature. book chapter submitted to monograph Indexing Specialities: Medicine. Edited by L. Pilar Wyman. Medford, NJ: Information Today, Inc., in association with the American Society of Indexers, Phoenix, AZ. 2010.
Mork JG, Bodenreider O, Demner-Fushman D, Dogan RI, Lang FM, Lu Z, Névéol A, Peters L, Shooshan SE, Aronson AR. Extracting Rx information from clinical narrative. J Am Med Inform Assoc. 2010 Sep-Oct;17(5):536-9. doi: 10.1136/jamia.2010.003970.

Pages