The SPECIALIST Lexicon

Evaluation on TtSet

We use TtSet to train and test criteria in antonym generating models. Precision, recall and F1 are used as metrics to measure performance. Retrieved instances are aPairs derived from antonym generating models, while relevant instances are canonical aPairs annotated by linguists. They are described as follows.

I. Derived criteria from training set and evaluation on test sets

There are 1000 aPairs collected in the TtSet. The TtSet are split randomly into 80% training and 20% test set (process-52). As a result, the training and test set include 799 and 201 aPair instances, respectively. We analyzed three properties of EUI, POS and synonyms on the training set (process-53). The result shows:

  • All antonyms are in the Lexicon, that is all antonyms in the 799 aPairs have EUIs.
  • 97.87% antonyms have the same POS in the 799 aPairs.
  • None (0.00%) of antonyms are synonyms. This confirms the theory that antonyms and synonyms are similar in domain and different in polarity.

These three criteria are then evaluated on the test set. APairs were retrieved by 4 different criteria, EUI, POS, not synonym and combination of above three criteria in this evaluation (process-54). The results show precision and F1 were increased, and the recall is preserved by applying these three criteria, as shown in the Table below. We concluded these three criteria are valid and the combination of them should be used in antonym generating models.

CriteriaPrecisionRecallF1
None0.51241.00000.6775
1. Must have EUI0.51501.00000.6799
2. Must have the same POS0.53371.00000.6959
3. Must not be synonyms0.51241.00000.6776
Combination of 1, 2 & 30.53371.00000.6959

II. Evaluation on instances with UMLS CUI

The scope of the antonym generation task is using concepts in the UMLS-Metathesaurus because the Lexicon is one of the three major components to support NLP research using UMLS. Accordingly, one of the requirements is to limit antonyms to have valid CUIs. There are 545 aPairs which have CUIs in the TtSet. We conducted the same evaluation as above and the result is shown in Table 2 (process-55). The results confirm these three criteria improve precision and F1 while preserving recall for the scope of our task (antonyms have CUIs).

CriteriaPrecisionRecallF1
None0.48991.00000.6576
1. Must have EUI0.49081.00000.6584
2. Must have the same POS0.49631.00000.6634
3. Must not be synonyms0.48991.00000.6576
Combination of 1, 2 & 30.49631.00000.6634

III. Evaluation on instances with UMLS CUI on CC sources

Our goal is to find criteria to apply to the CC model to improve performance. Thus, we shifted our focus to the instances that are from CC source in the TtSet. There are 271 aPairs that have CUIS and are derived from CC in the TtSet. This set is used to evaluate criteria for the CC model.

We added a new criterion that aPairs must have same STI (semantic type). These four criteria were evaluated and the result is shown in the table below (process-56). This new criteria of having same STI increases the precision, yet drops the recall and F1 (by 0.02). In practice, we applied all 4 criteria to generate antonym candidates from CC model to increase the precision.

CriteriaPrecisionRecallF1
None0.51291.00000.6780
1. Must have EUI0.51481.00000.6797
2. Must have the same POS0.51871.00000.6830
3. Must not be synonyms0.51291.00000.6580
4. Must have the same STI0.54970.75540.6364
Combination of 1, 2, 3 & 40.55560.75540.6402

IV. Evaluation of CC models (Co-occurrence in MEDLINE n-grams)

Antonyms are often co-occurring in corpora. This phenomenon is used to retrieve antonym candidates from a selected corpus in the CC model. The MEDLINE net gram set was used as a corpus for the collocates model. Antonym collocates appear in 3-grams, 4-grams and 5 grams because antonyms must be a single word. The table below shows examples of antonyms in the 3-grams, 4-grams and 5-grams (process-57). A performance evaluation on N-grams (N= 3 ~ 5) was conducted on the TtSet because aPairs from TtSet appear as collocates in different N-grams (process-58), as shown in the last table. In general, the co-occurrence instances of 5-grams are a subset of 4-grams; and the co-occurrence instances of 4-grams are a subset of 3-grams. The MEDLINE 3-grams were chosen as the corpus in CC model for the best recall and F1 performance.

aPairincrease|decreasealive|deadalive|dead
3-grams
  • 5934|increase or decrease
  • 1990|increase and decrease
  • 940|decrease or increase
  • 691|decrease and increase
  • 205|decrease with increase
  • ...
  • 218|dead or alive
  • 198|alive or dead
  • 94|alive and dead
  • 45|dead and alive
  • ...
None
4-grams
  • 1662|increase or decrease in
  • 965|increase or decrease the
  • 775|an increase or decrease
  • 693|increase and decrease in
  • 662|to increase or decrease
  • ...
None
  • 417|copy of the original
5-grams
  • 528|an increase or decrease in
  • 439|increase or decrease in the
  • 342|an increase or a decrease
  • 291|decrease with an increase in
  • 277|increase or a decrease in
  • ...
None
  • 387|copy of the original print
  • 387|scanned copy of the original

Our antonym generation model includes source from:

  • lexical records with negative tag (LEX)
  • suffix derivations with negation (SD)
  • prefix derivations with negation (PD)
  • co-occurrence from a corpus (CC).

The CC model applies the MEDLINE 3-grams to have better recall. In our test, there are 582 aPairs (58.2%) in the TtSet that are not retrieved from our model. It is imperative to develop more models to have comprehensive coverage for antonym generation. In addition, we could use existing antonym corpus with semantic network (such as WordNet) to retrieve antonym candidates.

N-gramsPrecisionRecallF1
3-grams0.60290.49220.5419
4-grams0.63010.42580.5082
5-grams0.65440.34770.4547

Please see evaluation documents for more details.