The SPECIALIST Lexicon

TT Source Model - Training and Test Set of Antonym Collection

I. Introduction

A collection of antonym pairs (aPairs) from various sources on the internet was established to find the characteristics and patterns of antonyms. Some sources have duplicated aPairs. For example, aPairs [absence|presence] and [presence|absence] are considered as the same aPair and counted as 1 unique aPair. In addition, antonyms in aPairs are lowercased and single word only. Multiword aPairs, such as [already|not yet] or [none of|a lot of], are removed from the collection. The source web sites, the number of unique aPairs and URLs of this training and test set are shown in Table 1.

IDSourceNo of unique aPairs
1Sherwood School449
2Proof Reading Services418
3Enchanted Learning324
47ESL339
5English Grammar Here321
6Synonyms Antonyms301
7SLP Lesson Plans251
8ESL Forums198
9My English Tutors170
10Love To Know167
11Your Dictionary159
12Classic Thesaurus100
13Power Thesaurus100
14Smart Words9

II. Design

A program is developed to:

  • collect aPairs from various antonym sources
  • unify aPairs (to remove duplications) from above collections
  • identify the source of antonyms
  • sort antonyms by sources first, then alphabetical order.
This antonym collection includes 1000+ unique aPairs.

Please see design documents for more details.

III. Implementation

Java source codes are implemented in the directory of TtSet:

  • CollectAntonyms.java
  • GetAntCandFromTtSet.java
  • GetProperties.java
  • GetAntPropertyStats.java
  • GetPRFOnTtSet.java

Algorithm:
Antonym sources are identified by computer programs (AntObj.java) for collected aPairs as follows:

  • LEX:
    • Lexical records with POSs of [adv|pron|aux|modals|prep|det|conj] and the negative and broad negative tags are used as aPairs and tagged with source of LEX. For an aPair with source of LEX, the negative antonym is stored as ant2. For example, [with|without] is a LEX aPair and the negative antonym [without] is stored as ant2. If the aPair from TT is in the LEX aPair set, the source is automatically identified as LEX.
  • SD:

    The algorithm for identifying a SD (suffixD) aPair is described as follows:

    • if one and only one of the antonyms ends with suffix “-less”
    • the root of the antonym that ends with suffix “-less” is also the root of the other antonym.
    • set the antonym ending with suffix “-less” as ant2 (negative), such as [careful|careless]
  • PD:

    The algorithm for identifying a PD (prefixD) aPair is described as follows:

    • ant1 ≠ ant2
    • The prefix belongs to the set of: a-, an-, anti- contra-, counter-, de-, dis-, dys-, il-, im-, in-, ir-, mis-, non-, un-, under-
    • ant1 is the root of ant2 or ant2 is the root of ant1.
    • Set the antonym with prefix to ant2 (negative), such as [possible|impossible]
  • CC:

    Co-occurrences in a Corpus, our first attempt is to use the terms co-occurring in MEDLINE. These are aPairs retrieved by co-occurring patterns from a corpus.

    • Set the ant1 and ant2 by alphabetic order, such as [accept|refuse] In other words, ant2 is not necessary the negative antonym.
  • SN:

    Semantic opposite in corpora. These are aPairs retrieved from a semantic network. If an aPair does not belong to the above sources, it is assigned as SN (semantic network). Patterns are yet to be developed.

    • Set the ant1 and ant2 by alphabetic order, such as [admit|deny] In other words, ant2 is not necessarily the negative antonym.
    • Sort all aPair from source semantic network, ant1|ant2