SPECIALIST Lexicon

Derive Criteria of aPairs from TtSet

I. Introduction

The collected antonyms from the training and test set (TtSet) are assumed to have representative characteristics of the overall antonyms in English and are used to identify generic properties of antonym pairs (aPairs). APairs in the TtSet are manually tagged for canonical, domains, types, and negations. Computer programs are developed to:

retrieve properties of these aPairs, such as EUIs, POSs, CUIs, STIs, sources, etc.
compute stats among properties to identify generic criteria of antonyms. These criteria include properties of EUI (Entry unique identifier), POS (Part-Of-Speech), concepts (CUIs – Concept Unique Identifier), semantic type (STI – Semantic Type Identifier) and synonyms. The identified criteria are then implemented in the antonym generation model to find antonym candidates from CC (collocates in MEDLINE).

II. Processes

A program is developed to calculate the stats among properties in the previous section. This program is run on two data sets of: 1). 1000 aPairs from TT; 2). 514 canonical aPairs from TtSet.

APairs from TtSet that are not from source of [LEX|SD|PD] are temperately assigned as source from [TT]. These aPairs are then checked with MEDLINE n-gram set to retag the source as [CC] or [SN]. There are two possible for aPairs with source of [SN]:

They are co-occurrence in other corpora, but not co-occurrence in MEDLINE. For examples:
- seller|buyer: “seller market and buyer market” can be found in other corpora
- compliment|insult: is collocates in iWeb corpus (https://www.collocates.info/iweb.asp)
They are no co-occurrence in any corpus. For example:
- abominate|love might not be in any corpus because abominate is such a rare word, so it is possible some of these are just not relevant for the collocate model.

III. Analysis

A summary of analyses is described below based on the observation of the results from this program.

Source analysis
The table below shows the source distribution:

Set LEX SD PD CC SN Total
TtSet (candidates) 10 (1.00%) 7 (0.70%) 79 (7.90%) 322 (2.20%) 582 (58.20%) 1000
TtSet (canonical) 10 (1.95%) 3 (0.58%) 71 (13.81%) 170 (33.07%) 260 (50.58%) 514
- among the most used 1000 aPairs (candidates) collected from TtSet, over 90.40% are from CC (32.20%) and SN (58.20%)
- among the 514 canonical aPairs (tagged) from TtSet, over 83.66% are from CC (33.07%) and SN (50.58%)
Source of CC contains about 1/3 distribution for both antonym candidates (32.20%) and canonical antonyms (33.07%). Currently, we have completed model development for antonym generation from source of LEX|SD|PD and antonym candidates from PD are still under tagging (tagging is completed for LEX and SD). It is imperative to develop antonym generation model from CC and SN to provide a comprehensive coverage for antonyms.
EUI analysis
Antonyms must be in the Lexicon. The following table shows the percentage of antonyms from the TtSet in the Lexicon.

Set Total None Ant-1 Ant-2 Both
TtSet (candidates) 1000 0 (0.00%) 1 (0.10%) 0 (0.00%) 999 (99.90%)
TtSet (canonical) 514 0 (0.00%) 0 (0.00%) 0 (0.00%) 514 (100.00%)
POS analysis
The table below shows the percentage of aPairs with the same POS.
- Among the most used 1000 aPairs (candidates) collected from TtSet, over 97.50% have the same POS.
- Among the most used 1000 aPairs (candidates) collected from TtSet, over 97.60% have the same POS if antonyms are in the Lexicon (have EUIs).
- Among 514 canonical aPairs (tagged) from TtSet, 100% have the same POS.
Set Total Different POS Same POS
TtSet 1000 25 (2.50%) 975 (97.50%)
TtSet (both have EUIs) 999 24 (2.40%) 975 (97.60%)
TtSet (canonical) 514 0 (0.00%) 514 (100.00%)

Set	LEX	SD	PD	CC	SN	Total
TtSet (candidates)	10 (1.00%)	7 (0.70%)	79 (7.90%)	322 (2.20%)	582 (58.20%)	1000
TtSet (canonical)	10 (1.95%)	3 (0.58%)	71 (13.81%)	170 (33.07%)	260 (50.58%)	514

Set	Total	None	Ant-1	Ant-2	Both
TtSet (candidates)	1000	0 (0.00%)	1 (0.10%)	0 (0.00%)	999 (99.90%)
TtSet (canonical)	514	0 (0.00%)	0 (0.00%)	0 (0.00%)	514 (100.00%)

Set	Total	Different POS	Same POS
TtSet	1000	25 (2.50%)	975 (97.50%)
TtSet (both have EUIs)	999	24 (2.40%)	975 (97.60%)
TtSet (canonical)	514	0 (0.00%)	514 (100.00%)

CUI analysis

Among the most used 1000 aPairs and canonical aPairs collected from TtSet, only about 51.95% ~ 55.18% of them, both antonyms have CUIs. However, our research scope is using concepts in the UMLS-Metathesaurus. Thus, our requirements are set as antonyms must have valid CUI.

Our aPairs are a more strictly defined (smaller) set than generally used antonyms. This is appropriate because we are targeting precision when applying antonyms in the NLP applications. We can’t find any concept for further NLP process anyway for those antonyms without CUIs.

Set	Total	No CUI	Ant-1 has CUI	Ant-2 has CUI	Both have CUIs
TtSet	1000	138 (13.80%)	170 (17.00%)	147 (14.70%)	545 (54.50%)
TtSet (with the same POS)	975	132 (13.54%)	163 (16.72%)	142 (14.56%)	538 (55.18%)
TtSet (Canonical)	514	91 (17.70%)	90 (17.51%)	66 (12.84%)	267 (51.95%)

STI analysis
The table below shows:
- Among the most used 1000 aPairs collected from TtSet, about 32.00% of them share the same semantic types.
- Among canonical aPairs (tagged) from TtSet, over 67.79% of antonyms share same semantic types if they both have CUIs.
- Among canonical aPairs (tagged) from TtSet, over 69.10% of antonyms share same semantic types if they both have CUIs and the source is CC or SN.
Applying semantic type criteria on aPairs reduces the antonym candidates to a smaller and higher precision set than commonly used antonyms. It is appropriate for targeting higher precision NLP applications (the tradeoff is dropping the recall). This analysis is further investigated for the source model of CC.

Set Total Not share STI Share STI
TtSet 1000 680 (68.00%) 320 (32.00%)
TtSet (Canonical, both have CUIs) 267 86 (32.21%) 181 (67.79%)
TtSet (Canonical, both have CUIs in [CC|SN]) 228 75 (32.89%) 153 (69.10%)
Synonym analysis
Antonyms cannot be synonyms. The table below confirms the theory that antonyms and synonyms are similar in domain and different in polarity.

Set Total Not Synonym Synonym
TtSet 1000 1000 (100.00%) 0 (0.00%)
TtSet (Canonical) 514 514 (100.00%) 0 (0.00%)
Domain analysis
The tagged results on TtSet show canonical aPairs distributed in 10 domains.

Set	Total	Not share STI	Share STI
TtSet	1000	680 (68.00%)	320 (32.00%)
TtSet (Canonical, both have CUIs)	267	86 (32.21%)	181 (67.79%)
TtSet (Canonical, both have CUIs in [CC\|SN])	228	75 (32.89%)	153 (69.10%)

Set	Total	Not Synonym	Synonym
TtSet	1000	1000 (100.00%)	0 (0.00%)
TtSet (Canonical)	514	514 (100.00%)	0 (0.00%)

Please see analysis documents for more details.

The SPECIALIST Lexicon