SPECIALIST Lexicon

Evaluation on TtSet

We use TtSet to train and test criteria in antonym generating models. Precision, recall and F1 are used as metrics to measure performance. Retrieved instances are aPairs derived from antonym generating models, while relevant instances are canonical aPairs annotated by linguists. They are described as follows.

I. Derived criteria from training set and evaluation on test sets

There are 1000 aPairs collected in the TtSet. The TtSet are split randomly into 80% training and 20% test set (process-52). As a result, the training and test set include 799 and 201 aPair instances, respectively. We analyzed three properties of EUI, POS and synonyms on the training set (process-53). The result shows:

All antonyms are in the Lexicon, that is all antonyms in the 799 aPairs have EUIs.
97.87% antonyms have the same POS in the 799 aPairs.
None (0.00%) of antonyms are synonyms. This confirms the theory that antonyms and synonyms are similar in domain and different in polarity.

These three criteria are then evaluated on the test set. APairs were retrieved by 4 different criteria, EUI, POS, not synonym and combination of above three criteria in this evaluation (process-54). The results show precision and F1 were increased, and the recall is preserved by applying these three criteria, as shown in the Table below. We concluded these three criteria are valid and the combination of them should be used in antonym generating models.

Criteria	Precision	Recall	F1
None	0.5124	1.0000	0.6775
1. Must have EUI	0.5150	1.0000	0.6799
2. Must have the same POS	0.5337	1.0000	0.6959
3. Must not be synonyms	0.5124	1.0000	0.6776
Combination of 1, 2 & 3	0.5337	1.0000	0.6959

II. Evaluation on instances with UMLS CUI

The scope of the antonym generation task is using concepts in the UMLS-Metathesaurus because the Lexicon is one of the three major components to support NLP research using UMLS. Accordingly, one of the requirements is to limit antonyms to have valid CUIs. There are 545 aPairs which have CUIs in the TtSet. We conducted the same evaluation as above and the result is shown in Table 2 (process-55). The results confirm these three criteria improve precision and F1 while preserving recall for the scope of our task (antonyms have CUIs).

Criteria	Precision	Recall	F1
None	0.4899	1.0000	0.6576
1. Must have EUI	0.4908	1.0000	0.6584
2. Must have the same POS	0.4963	1.0000	0.6634
3. Must not be synonyms	0.4899	1.0000	0.6576
Combination of 1, 2 & 3	0.4963	1.0000	0.6634

III. Evaluation on instances with UMLS CUI on CC sources

Our goal is to find criteria to apply to the CC model to improve performance. Thus, we shifted our focus to the instances that are from CC source in the TtSet. There are 271 aPairs that have CUIS and are derived from CC in the TtSet. This set is used to evaluate criteria for the CC model.

We added a new criterion that aPairs must have same STI (semantic type). These four criteria were evaluated and the result is shown in the table below (process-56). This new criteria of having same STI increases the precision, yet drops the recall and F1 (by 0.02). In practice, we applied all 4 criteria to generate antonym candidates from CC model to increase the precision.

Criteria	Precision	Recall	F1
None	0.5129	1.0000	0.6780
1. Must have EUI	0.5148	1.0000	0.6797
2. Must have the same POS	0.5187	1.0000	0.6830
3. Must not be synonyms	0.5129	1.0000	0.6580
4. Must have the same STI	0.5497	0.7554	0.6364
Combination of 1, 2, 3 & 4	0.5556	0.7554	0.6402

IV. Evaluation of CC models (Co-occurrence in MEDLINE n-grams)

Antonyms are often co-occurring in corpora. This phenomenon is used to retrieve antonym candidates from a selected corpus in the CC model. The MEDLINE net gram set was used as a corpus for the collocates model. Antonym collocates appear in 3-grams, 4-grams and 5 grams because antonyms must be a single word. The table below shows examples of antonyms in the 3-grams, 4-grams and 5-grams (process-57). A performance evaluation on N-grams (N= 3 ~ 5) was conducted on the TtSet because aPairs from TtSet appear as collocates in different N-grams (process-58), as shown in the last table. In general, the co-occurrence instances of 5-grams are a subset of 4-grams; and the co-occurrence instances of 4-grams are a subset of 3-grams. The MEDLINE 3-grams were chosen as the corpus in CC model for the best recall and F1 performance.

aPair	increase\|decrease	alive\|dead	alive\|dead
3-grams	5934\|increase or decrease 1990\|increase and decrease 940\|decrease or increase 691\|decrease and increase 205\|decrease with increase ...	218\|dead or alive 198\|alive or dead 94\|alive and dead 45\|dead and alive ...	None
4-grams	1662\|increase or decrease in 965\|increase or decrease the 775\|an increase or decrease 693\|increase and decrease in 662\|to increase or decrease ...	None	417\|copy of the original
5-grams	528\|an increase or decrease in 439\|increase or decrease in the 342\|an increase or a decrease 291\|decrease with an increase in 277\|increase or a decrease in ...	None	387\|copy of the original print 387\|scanned copy of the original

Our antonym generation model includes source from:

lexical records with negative tag (LEX)
suffix derivations with negation (SD)
prefix derivations with negation (PD)
co-occurrence from a corpus (CC).

The CC model applies the MEDLINE 3-grams to have better recall. In our test, there are 582 aPairs (58.2%) in the TtSet that are not retrieved from our model. It is imperative to develop more models to have comprehensive coverage for antonym generation. In addition, we could use existing antonym corpus with semantic network (such as WordNet) to retrieve antonym candidates.

N-grams	Precision	Recall	F1
3-grams	0.6029	0.4922	0.5419
4-grams	0.6301	0.4258	0.5082
5-grams	0.6544	0.3477	0.4547

Please see evaluation documents for more details.

aPair	increase\|decrease	alive\|dead	alive\|dead
3-grams	5934\|increase or decrease 1990\|increase and decrease 940\|decrease or increase 691\|decrease and increase 205\|decrease with increase ...	218\|dead or alive 198\|alive or dead 94\|alive and dead 45\|dead and alive ...	None
4-grams	1662\|increase or decrease in 965\|increase or decrease the 775\|an increase or decrease 693\|increase and decrease in 662\|to increase or decrease ...	None	417\|copy of the original
5-grams	528\|an increase or decrease in 439\|increase or decrease in the 342\|an increase or a decrease 291\|decrease with an increase in 277\|increase or a decrease in ...	None	387\|copy of the original print 387\|scanned copy of the original

The SPECIALIST Lexicon