SPECIALIST Lexicon

Analysis on Antonyms based on TtSet and 2021 Data

I. Introduction

A program is developed to find the stats for the tagged antonym candidates from TtSet. This same program is generic and applied to all tagged antonym candidates to generate stats as well. The latest antonym generation data from 2021 are used as a subset to represent the overall antonyms in this analysis.

II. Implementation

Computer programs are implemented in the directory of TtSet:

GetStatsFromTagCand.java

III. Results The results and analysis of the three files (TtSet, 2021 all aPair candidates and 2021 canonical aPairs) are described below:

Domains
Having the same 10 domains corresponds to our hypothesis of using TtSet as a representative set for overall antonyms.
Canonical aPairs
APairs from TtSet and 2021 are used to compare the canonical rate in the antonym candidates. First, the 1000 antonyms are expanded to 1252 antonym candidates by expanding antonyms with their spelling variants. Only 45.77% among these antonym candidates are tagged as canonical antonyms, as shown in Table below. The canonical rate for 2021 antonym candidates (55.71%) is higher than the most used antonyms (TtSet). This implies our antonym generation model is effective to generate antonym candidates. The canonical rate of antonym candidates from 2021 data will be more accurate and have more meaning once PD is completely tagged.

Set APair Candidates Canonical Not Canonical
TtSet Candidates 1252 574 (45.77%) 679 (54.23%)
2021 Candidates 3558 1982 (55.71%) 1576 (44.29%)
Estimated overall canonical aPairs by source of SD
Currently, we have completely tagged and generated aPairs from the source of suffix derivation (SD) for 2021 data. The number of canonical aPairs from SD is 132. Thus, we estimated the total canonical aPairs is 22,758 (= 132/0.0058), the percentage of SD is 0.58% from Table 2. Please note that we do not use the tagged canonical aPairs from LEX for the estimation because that number is rather static and does not grow with the growth of corpora (the Lexicon). We will also estimate the total canonical aPairs by PD once PD is completely tagged to confirm our estimation.

Set	APair Candidates	Canonical	Not Canonical
TtSet Candidates	1252	574 (45.77%)	679 (54.23%)
2021 Candidates	3558	1982 (55.71%)	1576 (44.29%)

POS distribution

The table below shows the POS distribution for canonical aPairs from TtSet and 2021 data. The 2021 data is uncompleted (PD is not completely tagged), so the distribution is not 100% representative. However, the top four POS (Adj, Noun, Verb and Adv) for canonical aPairs are the same. Please note that canonical aPairs from the rest of the POSs (Modal, Pron, Aux, Prep, Det and Conj) are rather static and were retrieved from the Lexicon (because they are associated with negation tags in the Lexicon).

The top four POS distribution of canonical aPairs between TtSet and 2021 data is the same. This corresponds to our hypothesis of using TtSet as a representative set for overall antonyms.

POS	TtSet (Canonical)	2021 (Canonical)
Adj	42.06%	66.75%
Noun	26.35%	15.89%
Verb	22.16%	12.82%
Adv	5.76%	2.06%
Modal	0.35%	0.66%
Pron	1.22%	0.56%
Aux	0.00%	0.51%
Prep	1.39%	0.40%
Det	0.35%	0.20%
Conj	0.35%	0.15%
Compl	N/A (0.00%)	N/A (0.00%)

Negation Distribution
The table below shows the negation distribution for antonym candidates from TtSet and 2021 data. Both sets have similar negative and not-negative rates. Please note that negation is independent from the canonical property of an aPair. Accordingly, the negation distribution from candidates (including both canonical and non-canonical aPairs) are used for bigger sampling coverage.
The distribution of negation rate of aPairs between TtSet and 2021 Data are similar. This corresponds to our hypothesis of using TtSet as a representative set for overall antonyms.

POS TtSet (Canonical) 2021 (Canonical)
True Negative 1.76% 1.69%
Broadly Negative 7.75% 5.45%
Not Negative 90.50% 92.86%

POS	TtSet (Canonical)	2021 (Canonical)
True Negative	1.76%	1.69%
Broadly Negative	7.75%	5.45%
Not Negative	90.50%	92.86%

Please see analysis documents for more details.

The SPECIALIST Lexicon