Evaluation on TtSet
We use TtSet to train and test criteria in antonym generating models. Precision, recall and F1 are used as metrics to measure performance. Retrieved instances are aPairs derived from antonym generating models, while relevant instances are canonical aPairs annotated by linguists. They are described as follows.
I. Derived criteria from training set and evaluation on test sets
There are 1000 aPairs collected in the TtSet. The TtSet are split randomly into 80% training and 20% test set (process-52). As a result, the training and test set include 799 and 201 aPair instances, respectively. We analyzed three properties of EUI, POS and synonyms on the training set (process-53). The result shows:
These three criteria are then evaluated on the test set. APairs were retrieved by 4 different criteria, EUI, POS, not synonym and combination of above three criteria in this evaluation (process-54). The results show precision and F1 were increased, and the recall is preserved by applying these three criteria, as shown in the Table below. We concluded these three criteria are valid and the combination of them should be used in antonym generating models.
Criteria | Precision | Recall | F1 |
---|---|---|---|
None | 0.5124 | 1.0000 | 0.6775 |
1. Must have EUI | 0.5150 | 1.0000 | 0.6799 |
2. Must have the same POS | 0.5337 | 1.0000 | 0.6959 |
3. Must not be synonyms | 0.5124 | 1.0000 | 0.6776 |
Combination of 1, 2 & 3 | 0.5337 | 1.0000 | 0.6959 |
II. Evaluation on instances with UMLS CUI
The scope of the antonym generation task is using concepts in the UMLS-Metathesaurus because the Lexicon is one of the three major components to support NLP research using UMLS. Accordingly, one of the requirements is to limit antonyms to have valid CUIs. There are 545 aPairs which have CUIs in the TtSet. We conducted the same evaluation as above and the result is shown in Table 2 (process-55). The results confirm these three criteria improve precision and F1 while preserving recall for the scope of our task (antonyms have CUIs).
Criteria | Precision | Recall | F1 |
---|---|---|---|
None | 0.4899 | 1.0000 | 0.6576 |
1. Must have EUI | 0.4908 | 1.0000 | 0.6584 |
2. Must have the same POS | 0.4963 | 1.0000 | 0.6634 |
3. Must not be synonyms | 0.4899 | 1.0000 | 0.6576 |
Combination of 1, 2 & 3 | 0.4963 | 1.0000 | 0.6634 |
III. Evaluation on instances with UMLS CUI on CC sources
Our goal is to find criteria to apply to the CC model to improve performance. Thus, we shifted our focus to the instances that are from CC source in the TtSet. There are 271 aPairs that have CUIS and are derived from CC in the TtSet. This set is used to evaluate criteria for the CC model.
We added a new criterion that aPairs must have same STI (semantic type). These four criteria were evaluated and the result is shown in the table below (process-56). This new criteria of having same STI increases the precision, yet drops the recall and F1 (by 0.02). In practice, we applied all 4 criteria to generate antonym candidates from CC model to increase the precision.
Criteria | Precision | Recall | F1 |
---|---|---|---|
None | 0.5129 | 1.0000 | 0.6780 |
1. Must have EUI | 0.5148 | 1.0000 | 0.6797 |
2. Must have the same POS | 0.5187 | 1.0000 | 0.6830 |
3. Must not be synonyms | 0.5129 | 1.0000 | 0.6580 |
4. Must have the same STI | 0.5497 | 0.7554 | 0.6364 |
Combination of 1, 2, 3 & 4 | 0.5556 | 0.7554 | 0.6402 |
IV. Evaluation of CC models (Co-occurrence in MEDLINE n-grams)
Antonyms are often co-occurring in corpora. This phenomenon is used to retrieve antonym candidates from a selected corpus in the CC model. The MEDLINE net gram set was used as a corpus for the collocates model. Antonym collocates appear in 3-grams, 4-grams and 5 grams because antonyms must be a single word. The table below shows examples of antonyms in the 3-grams, 4-grams and 5-grams (process-57). A performance evaluation on N-grams (N= 3 ~ 5) was conducted on the TtSet because aPairs from TtSet appear as collocates in different N-grams (process-58), as shown in the last table. In general, the co-occurrence instances of 5-grams are a subset of 4-grams; and the co-occurrence instances of 4-grams are a subset of 3-grams. The MEDLINE 3-grams were chosen as the corpus in CC model for the best recall and F1 performance.
aPair | increase|decrease | alive|dead | alive|dead |
---|---|---|---|
3-grams |
|
| None |
4-grams |
| None |
|
5-grams |
| None |
|
Our antonym generation model includes source from:
The CC model applies the MEDLINE 3-grams to have better recall. In our test, there are 582 aPairs (58.2%) in the TtSet that are not retrieved from our model. It is imperative to develop more models to have comprehensive coverage for antonym generation. In addition, we could use existing antonym corpus with semantic network (such as WordNet) to retrieve antonym candidates.
N-grams | Precision | Recall | F1 |
---|---|---|---|
3-grams | 0.6029 | 0.4922 | 0.5419 |
4-grams | 0.6301 | 0.4258 | 0.5082 |
5-grams | 0.6544 | 0.3477 | 0.4547 |
Please see evaluation documents for more details.