LMW Candidate Generation from WordNet
I. Introduction
The WordNet is used to enhance Lexicon for multiwords, derivations, synonyms, and antonyms. The WordNet 3.0 and JWI (Jave WordNet Interface) are used for this development.
II. Models
Implemented serveral models in ${LMW_DIR}/WordNetMw/*.java
File | word count | Description |
---|---|---|
Words from WorNet | ||
WnWords.data.3.0 | 156,584 | Words from synset, root, unique in spelling and POS |
WnIndexWords.data.3.0 | 155,287 |
|
Derivations from WordNet | ||
WnDPairs.data.3.0 | 42,475 | derivations
|
Synonyms from WordNet | ||
WnSPairs.data.3.0 | 315312 | |
Antonyms from WordNet | ||
WnAPairs.data.3.0 | 12248 |
Step | algorithm | Out file |
---|---|---|
words from WordNet | WnWords.data | |
0 | lexicon filter: filter out words are in the Lexicon |
|
1 | general filters: filter out invalid words: pipe, punc, digit, number, stopword |
|
2 | pattern filers: filter out invalid words: parAcr, indArt, colon, disChar, disPunc, imcomplete, measure |
|
3 | single words: filter out decade, ordinal, Roman, no CUI |
|
4 | multiwords: filter out Ilt, Iet, Let, Vlt, Vet, no CUI, |
|
Step | algorithm | Out file |
---|---|---|
words from WordNet | WnWords.data | |
1 |
| verbComplement.data (1,708) |
Step | algorithm | Out file | ||||||
---|---|---|---|---|---|---|---|---|
0 | Categorize dPairs to zeroD, suffixD, prefixD, Others |
|
Zero derivations in the WordNet are used to retrieve lexical multiword candidates. The precision of valid multiwords on these candidates is calculated by lowercased spelling without POS. The generated candidates from this model have high precision on valid multiwords (97.05%).
This study further categorized candidates into two groups: with and without UMLS CUIs (concept unique identifiers). The precisions are 100% and 95.53% for candidates with and without CUIs, respectively. Theoretically, all multiwords have meaning (mapped CUIs) by themselves. It is interesting that we observed no noticeable difference on precision verse CUIs. Our inference is that UMLS do not have complete concept coverage on all terms.
In conclusion,
Algorithm for this Model – multiwords from zero derivations in WordNet:
Candidate files:
Step | algorithm | Out file | ||||||
---|---|---|---|---|---|---|---|---|
0 | Categorize dPairs to zeroD, suffixD, prefixD, Others |
|
TBD
TBD
III. Processes
${LMW_DIR}/bin/12.LexAbbAcrCand <YEAR>
IV. References