Derive Criteria of aPairs from TtSet
I. Introduction
The collected antonyms from the training and test set (TtSet) are assumed to have representative characteristics of the overall antonyms in English and are used to identify generic properties of antonym pairs (aPairs). APairs in the TtSet are manually tagged for canonical, domains, types, and negations. Computer programs are developed to:
II. Processes
A program is developed to calculate the stats among properties in the previous section. This program is run on two data sets of: 1). 1000 aPairs from TT; 2). 514 canonical aPairs from TtSet.
APairs from TtSet that are not from source of [LEX|SD|PD] are temperately assigned as source from [TT]. These aPairs are then checked with MEDLINE n-gram set to retag the source as [CC] or [SN]. There are two possible for aPairs with source of [SN]:
III. Analysis
A summary of analyses is described below based on the observation of the results from this program.
The table below shows the source distribution:
Set | LEX | SD | PD | CC | SN | Total |
---|---|---|---|---|---|---|
TtSet (candidates) | 10 (1.00%) | 7 (0.70%) | 79 (7.90%) | 322 (2.20%) | 582 (58.20%) | 1000 |
TtSet (canonical) | 10 (1.95%) | 3 (0.58%) | 71 (13.81%) | 170 (33.07%) | 260 (50.58%) | 514 |
Source of CC contains about 1/3 distribution for both antonym candidates (32.20%) and canonical antonyms (33.07%). Currently, we have completed model development for antonym generation from source of LEX|SD|PD and antonym candidates from PD are still under tagging (tagging is completed for LEX and SD). It is imperative to develop antonym generation model from CC and SN to provide a comprehensive coverage for antonyms.
Antonyms must be in the Lexicon. The following table shows the percentage of antonyms from the TtSet in the Lexicon.
Set | Total | None | Ant-1 | Ant-2 | Both |
---|---|---|---|---|---|
TtSet (candidates) | 1000 | 0 (0.00%) | 1 (0.10%) | 0 (0.00%) | 999 (99.90%) |
TtSet (canonical) | 514 | 0 (0.00%) | 0 (0.00%) | 0 (0.00%) | 514 (100.00%) |
The table below shows the percentage of aPairs with the same POS.
Set | Total | Different POS | Same POS |
---|---|---|---|
TtSet | 1000 | 25 (2.50%) | 975 (97.50%) |
TtSet (both have EUIs) | 999 | 24 (2.40%) | 975 (97.60%) |
TtSet (canonical) | 514 | 0 (0.00%) | 514 (100.00%) |
Among the most used 1000 aPairs and canonical aPairs collected from TtSet, only about 51.95% ~ 55.18% of them, both antonyms have CUIs. However, our research scope is using concepts in the UMLS-Metathesaurus. Thus, our requirements are set as antonyms must have valid CUI.
Our aPairs are a more strictly defined (smaller) set than generally used antonyms. This is appropriate because we are targeting precision when applying antonyms in the NLP applications. We can’t find any concept for further NLP process anyway for those antonyms without CUIs.
Set | Total | No CUI | Ant-1 has CUI | Ant-2 has CUI | Both have CUIs |
---|---|---|---|---|---|
TtSet | 1000 | 138 (13.80%) | 170 (17.00%) | 147 (14.70%) | 545 (54.50%) |
TtSet (with the same POS) | 975 | 132 (13.54%) | 163 (16.72%) | 142 (14.56%) | 538 (55.18%) |
TtSet (Canonical) | 514 | 91 (17.70%) | 90 (17.51%) | 66 (12.84%) | 267 (51.95%) |
The table below shows:
Applying semantic type criteria on aPairs reduces the antonym candidates to a smaller and higher precision set than commonly used antonyms. It is appropriate for targeting higher precision NLP applications (the tradeoff is dropping the recall). This analysis is further investigated for the source model of CC.
Set | Total | Not share STI | Share STI |
---|---|---|---|
TtSet | 1000 | 680 (68.00%) | 320 (32.00%) |
TtSet (Canonical, both have CUIs) | 267 | 86 (32.21%) | 181 (67.79%) |
TtSet (Canonical, both have CUIs in [CC|SN]) | 228 | 75 (32.89%) | 153 (69.10%) |
Antonyms cannot be synonyms. The table below confirms the theory that antonyms and synonyms are similar in domain and different in polarity.
Set | Total | Not Synonym | Synonym |
---|---|---|---|
TtSet | 1000 | 1000 (100.00%) | 0 (0.00%) |
TtSet (Canonical) | 514 | 514 (100.00%) | 0 (0.00%) |
The tagged results on TtSet show canonical aPairs distributed in 10 domains.
Please see analysis documents for more details.