The SPECIALIST Lexicon

The SPECIALIST Lexicon project icon

Lexicon, UTF-8, XML, ASCII, 2024 Release:

The SPECIALIST lexicon is a large syntactic lexicon of biomedical and general English, designed/developed to provide the lexical information needed for the SPECIALIST Natural Language Processing System (NLP) which includes SemRep, MetaMap, and the Lexical Tools. It is intended to be a general English lexicon that includes many biomedical terms. Coverage includes both commonly occurring English words and biomedical vocabulary from a variety of sources. These include (not limited to) MEDLINE citation records, terms in the Dorland's Illustrated Medical dictionary, the 10,000 most frequent words listed in the American Heritage Word Frequency book and the 2,000 lexical items used in the controlled definitions of Longman's Dictionary of Contemporary English, words in WordNet. The lexicon entry for each lexical item (word or term) records the syntactic, morphological (inflection and derivation), and orthographic (spelling variants) information needed by the SPECIALIST NLP System.

The SPECIALIST LEXICON (unit lexical record formatted file) along with relational files are released annually as one of the UMLS Knowledge Sources since 1994. In addition to its distribution with the UMLS, it is available as an open source resource subject to these terms and conditions. Numbers and number words, including cardinal, ordinal and fractions, were added to the Lexicon release since 2003. The XML format of unit lexical record was first available in 2003 through LexAccess. The Lexicon migrated to Unicode and has been released in UFT-8 format since 2006. In addition, XML schemas and JAXB (Java Architecture XML Binding) APIs are released. In 2009, a pure ASCII file, LEXICON.ascii, is added to the annual release for NLP projects interests only in ASCII. In 2013, all derivations in Lexicon (including zeroD, suffixD, and prefixD) along with negation information are added to annual release (derivation.data, DM.DB) by a systematic methodology. In 2017, a new system is developed to add all synonymous terms in the Lexicon (lexSynonyms) to the synonym database file (SM.DB). In 2022, antonyms in the Lexicon are included to the antonym database file (AM.DB).

The SPECIALIST Lexical Tools utilize the SPECIALIST Lexicon data to provide a comprehensive toolset and Java APIs for NLP fundamental functions, including retrieving syntactic category, inflectional variations, spelling variations, abbreviations, acronyms, derivational variations, synonyms, antonyms, normalization, Unicode-to-ASCII conversion, tokenization, and stopword removal.