The SPECIALIST Lexicon

Analysis on Antonyms based on TtSet and 2021 Data

I. Introduction

A program is developed to find the stats for the tagged antonym candidates from TtSet. This same program is generic and applied to all tagged antonym candidates to generate stats as well. The latest antonym generation data from 2021 are used as a subset to represent the overall antonyms in this analysis.

II. Implementation

Computer programs are implemented in the directory of TtSet:

  • GetStatsFromTagCand.java

III. Results The results and analysis of the three files (TtSet, 2021 all aPair candidates and 2021 canonical aPairs) are described below:

  • Domains

    Having the same 10 domains corresponds to our hypothesis of using TtSet as a representative set for overall antonyms.

  • Canonical aPairs

    APairs from TtSet and 2021 are used to compare the canonical rate in the antonym candidates. First, the 1000 antonyms are expanded to 1252 antonym candidates by expanding antonyms with their spelling variants. Only 45.77% among these antonym candidates are tagged as canonical antonyms, as shown in Table below. The canonical rate for 2021 antonym candidates (55.71%) is higher than the most used antonyms (TtSet). This implies our antonym generation model is effective to generate antonym candidates. The canonical rate of antonym candidates from 2021 data will be more accurate and have more meaning once PD is completely tagged.

    SetAPair CandidatesCanonicalNot Canonical
    TtSet Candidates1252574 (45.77%)679 (54.23%)
    2021 Candidates35581982 (55.71%)1576 (44.29%)
  • Estimated overall canonical aPairs by source of SD

    Currently, we have completely tagged and generated aPairs from the source of suffix derivation (SD) for 2021 data. The number of canonical aPairs from SD is 132. Thus, we estimated the total canonical aPairs is 22,758 (= 132/0.0058), the percentage of SD is 0.58% from Table 2. Please note that we do not use the tagged canonical aPairs from LEX for the estimation because that number is rather static and does not grow with the growth of corpora (the Lexicon). We will also estimate the total canonical aPairs by PD once PD is completely tagged to confirm our estimation.

  • POS distribution

    The table below shows the POS distribution for canonical aPairs from TtSet and 2021 data. The 2021 data is uncompleted (PD is not completely tagged), so the distribution is not 100% representative. However, the top four POS (Adj, Noun, Verb and Adv) for canonical aPairs are the same. Please note that canonical aPairs from the rest of the POSs (Modal, Pron, Aux, Prep, Det and Conj) are rather static and were retrieved from the Lexicon (because they are associated with negation tags in the Lexicon).

    The top four POS distribution of canonical aPairs between TtSet and 2021 data is the same. This corresponds to our hypothesis of using TtSet as a representative set for overall antonyms.

    POSTtSet (Canonical)2021 (Canonical)
    Adj42.06%66.75%
    Noun26.35%15.89%
    Verb22.16%12.82%
    Adv5.76%2.06%
    Modal0.35%0.66%
    Pron1.22%0.56%
    Aux0.00%0.51%
    Prep1.39%0.40%
    Det0.35%0.20%
    Conj0.35%0.15%
    ComplN/A (0.00%)N/A (0.00%)
  • Negation Distribution

    The table below shows the negation distribution for antonym candidates from TtSet and 2021 data. Both sets have similar negative and not-negative rates. Please note that negation is independent from the canonical property of an aPair. Accordingly, the negation distribution from candidates (including both canonical and non-canonical aPairs) are used for bigger sampling coverage.

    The distribution of negation rate of aPairs between TtSet and 2021 Data are similar. This corresponds to our hypothesis of using TtSet as a representative set for overall antonyms.

    POSTtSet (Canonical)2021 (Canonical)
    True Negative1.76%1.69%
    Broadly Negative7.75%5.45%
    Not Negative90.50%92.86%

Please see analysis documents for more details.