The SPECIALIST Lexicon

Derive Criteria of aPairs from TtSet

I. Introduction

The collected antonyms from the training and test set (TtSet) are assumed to have representative characteristics of the overall antonyms in English and are used to identify generic properties of antonym pairs (aPairs). APairs in the TtSet are manually tagged for canonical, domains, types, and negations. Computer programs are developed to:

  • retrieve properties of these aPairs, such as EUIs, POSs, CUIs, STIs, sources, etc.
  • compute stats among properties to identify generic criteria of antonyms. These criteria include properties of EUI (Entry unique identifier), POS (Part-Of-Speech), concepts (CUIs – Concept Unique Identifier), semantic type (STI – Semantic Type Identifier) and synonyms. The identified criteria are then implemented in the antonym generation model to find antonym candidates from CC (collocates in MEDLINE).

II. Processes

A program is developed to calculate the stats among properties in the previous section. This program is run on two data sets of: 1). 1000 aPairs from TT; 2). 514 canonical aPairs from TtSet.

APairs from TtSet that are not from source of [LEX|SD|PD] are temperately assigned as source from [TT]. These aPairs are then checked with MEDLINE n-gram set to retag the source as [CC] or [SN]. There are two possible for aPairs with source of [SN]:

  • They are co-occurrence in other corpora, but not co-occurrence in MEDLINE. For examples:
    • seller|buyer: “seller market and buyer market” can be found in other corpora
    • compliment|insult: is collocates in iWeb corpus (https://www.collocates.info/iweb.asp)
  • They are no co-occurrence in any corpus. For example:
    • abominate|love might not be in any corpus because abominate is such a rare word, so it is possible some of these are just not relevant for the collocate model.

III. Analysis

A summary of analyses is described below based on the observation of the results from this program.

  • Source analysis

    The table below shows the source distribution:

    SetLEXSDPDCCSNTotal
    TtSet (candidates)10 (1.00%)7 (0.70%)79 (7.90%)322 (2.20%)582 (58.20%)1000
    TtSet (canonical)10 (1.95%)3 (0.58%)71 (13.81%)170 (33.07%)260 (50.58%)514
    • among the most used 1000 aPairs (candidates) collected from TtSet, over 90.40% are from CC (32.20%) and SN (58.20%)
    • among the 514 canonical aPairs (tagged) from TtSet, over 83.66% are from CC (33.07%) and SN (50.58%)

    Source of CC contains about 1/3 distribution for both antonym candidates (32.20%) and canonical antonyms (33.07%). Currently, we have completed model development for antonym generation from source of LEX|SD|PD and antonym candidates from PD are still under tagging (tagging is completed for LEX and SD). It is imperative to develop antonym generation model from CC and SN to provide a comprehensive coverage for antonyms.

  • EUI analysis

    Antonyms must be in the Lexicon. The following table shows the percentage of antonyms from the TtSet in the Lexicon.

    SetTotalNoneAnt-1Ant-2Both
    TtSet (candidates)10000 (0.00%)1 (0.10%)0 (0.00%)999 (99.90%)
    TtSet (canonical)5140 (0.00%)0 (0.00%)0 (0.00%)514 (100.00%)
  • POS analysis

    The table below shows the percentage of aPairs with the same POS.

    • Among the most used 1000 aPairs (candidates) collected from TtSet, over 97.50% have the same POS.
    • Among the most used 1000 aPairs (candidates) collected from TtSet, over 97.60% have the same POS if antonyms are in the Lexicon (have EUIs).
    • Among 514 canonical aPairs (tagged) from TtSet, 100% have the same POS.
    SetTotalDifferent POSSame POS
    TtSet100025 (2.50%)975 (97.50%)
    TtSet (both have EUIs)99924 (2.40%)975 (97.60%)
    TtSet (canonical)5140 (0.00%)514 (100.00%)
  • CUI analysis

    Among the most used 1000 aPairs and canonical aPairs collected from TtSet, only about 51.95% ~ 55.18% of them, both antonyms have CUIs. However, our research scope is using concepts in the UMLS-Metathesaurus. Thus, our requirements are set as antonyms must have valid CUI.

    Our aPairs are a more strictly defined (smaller) set than generally used antonyms. This is appropriate because we are targeting precision when applying antonyms in the NLP applications. We can’t find any concept for further NLP process anyway for those antonyms without CUIs.

    SetTotalNo CUIAnt-1 has CUIAnt-2 has CUIBoth have CUIs
    TtSet1000138 (13.80%)170 (17.00%)147 (14.70%)545 (54.50%)
    TtSet (with the same POS)975132 (13.54%)163 (16.72%)142 (14.56%)538 (55.18%)
    TtSet (Canonical)51491 (17.70%)90 (17.51%)66 (12.84%)267 (51.95%)
  • STI analysis

    The table below shows:

    • Among the most used 1000 aPairs collected from TtSet, about 32.00% of them share the same semantic types.
    • Among canonical aPairs (tagged) from TtSet, over 67.79% of antonyms share same semantic types if they both have CUIs.
    • Among canonical aPairs (tagged) from TtSet, over 69.10% of antonyms share same semantic types if they both have CUIs and the source is CC or SN.

    Applying semantic type criteria on aPairs reduces the antonym candidates to a smaller and higher precision set than commonly used antonyms. It is appropriate for targeting higher precision NLP applications (the tradeoff is dropping the recall). This analysis is further investigated for the source model of CC.

    SetTotalNot share STIShare STI
    TtSet1000680 (68.00%)320 (32.00%)
    TtSet (Canonical, both have CUIs)26786 (32.21%)181 (67.79%)
    TtSet (Canonical, both have CUIs in [CC|SN])22875 (32.89%)153 (69.10%)
  • Synonym analysis

    Antonyms cannot be synonyms. The table below confirms the theory that antonyms and synonyms are similar in domain and different in polarity.

    SetTotalNot SynonymSynonym
    TtSet10001000 (100.00%)0 (0.00%)
    TtSet (Canonical)514514 (100.00%)0 (0.00%)
  • Domain analysis

    The tagged results on TtSet show canonical aPairs distributed in 10 domains.

Please see analysis documents for more details.