SPECIALIST Lexicon

TT Source Model - Training and Test Set of Antonym Collection

I. Introduction

A collection of antonym pairs (aPairs) from various sources on the internet was established to find the characteristics and patterns of antonyms. Some sources have duplicated aPairs. For example, aPairs [absence|presence] and [presence|absence] are considered as the same aPair and counted as 1 unique aPair. In addition, antonyms in aPairs are lowercased and single word only. Multiword aPairs, such as [already|not yet] or [none of|a lot of], are removed from the collection. The source web sites, the number of unique aPairs and URLs of this training and test set are shown in Table 1.

ID	Source	No of unique aPairs
1	Sherwood School	449
2	Proof Reading Services	418
3	Enchanted Learning	324
4	7ESL	339
5	English Grammar Here	321
6	Synonyms Antonyms	301
7	SLP Lesson Plans	251
8	ESL Forums	198
9	My English Tutors	170
10	Love To Know	167
11	Your Dictionary	159
12	Classic Thesaurus	100
13	Power Thesaurus	100
14	Smart Words	9

II. Design

A program is developed to:

collect aPairs from various antonym sources
unify aPairs (to remove duplications) from above collections
identify the source of antonyms
sort antonyms by sources first, then alphabetical order.

This antonym collection includes 1000+ unique aPairs.

Please see design documents for more details.

III. Implementation

Java source codes are implemented in the directory of TtSet:

CollectAntonyms.java
GetAntCandFromTtSet.java
GetProperties.java
GetAntPropertyStats.java
GetPRFOnTtSet.java

Algorithm:
Antonym sources are identified by computer programs (AntObj.java) for collected aPairs as follows:

LEX:
- Lexical records with POSs of [adv|pron|aux|modals|prep|det|conj] and the negative and broad negative tags are used as aPairs and tagged with source of LEX. For an aPair with source of LEX, the negative antonym is stored as ant2. For example, [with|without] is a LEX aPair and the negative antonym [without] is stored as ant2. If the aPair from TT is in the LEX aPair set, the source is automatically identified as LEX.
SD:
The algorithm for identifying a SD (suffixD) aPair is described as follows:
- if one and only one of the antonyms ends with suffix “-less”
- the root of the antonym that ends with suffix “-less” is also the root of the other antonym.
- set the antonym ending with suffix “-less” as ant2 (negative), such as [careful|careless]
PD:
The algorithm for identifying a PD (prefixD) aPair is described as follows:
- ant1 ≠ ant2
- The prefix belongs to the set of: a-, an-, anti- contra-, counter-, de-, dis-, dys-, il-, im-, in-, ir-, mis-, non-, un-, under-
- ant1 is the root of ant2 or ant2 is the root of ant1.
- Set the antonym with prefix to ant2 (negative), such as [possible|impossible]
CC:
Co-occurrences in a Corpus, our first attempt is to use the terms co-occurring in MEDLINE. These are aPairs retrieved by co-occurring patterns from a corpus.
- Set the ant1 and ant2 by alphabetic order, such as [accept|refuse] In other words, ant2 is not necessary the negative antonym.
SN:
Semantic opposite in corpora. These are aPairs retrieved from a semantic network. If an aPair does not belong to the above sources, it is assigned as SN (semantic network). Patterns are yet to be developed.
- Set the ant1 and ant2 by alphabetic order, such as [admit|deny] In other words, ant2 is not necessarily the negative antonym.
- Sort all aPair from source semantic network, ant1|ant2

The SPECIALIST Lexicon