You are here
From Functional Similarity Among Gene Products to Dependence Relations Among Gene Ontology Terms
The Gene Ontology (GO) is a controlled vocabulary widely used for the annotation of gene products. GO is organized in three hierarchies for molecular functions, cellular components, and biological processes but no relations are provided among terms across hierarchies. More generally, dependence relations both within and across the three hierarchies are not recorded in GO. Methods based on lexical similarity have been used to identify such relations. However, lexically similar words can have different meanings and the same meaning can be expressed by lexically dissimilar words. These limitations which are inherent to lexical methods led us to explore different approaches. In the annotation databases, each gene product is described a vector of GO terms. A vector space model (VSM) can thus be used to compute similarity among gene products, based on their annotations. Analogously, similarity can be computed among GO terms, based on the gene products with which these terms are associated. Figure 1 summarizes our approach. The vectors of gene products for each GO term are obtained by transposing the original matrix of gene products by GO terms. As it is usual with vector space models (e.g., in information retrieval applications), the similarity between two GO terms is computed as the dot product of the corresponding vectors of genes, after normalization of these vectors. The dot product of two vectors varies between 0 (no similarity) and 1 (perfect similarity). We applied this method to five annotation databases (FlyBase, the Human subset of GOA, MGI, SGD, and WormBase). Term-term similarity was computed pairwise for all GO terms present, resulting in a half-matrix for each model organism database. Relations with a similarity lower than .5 were ignored. As shown in Table 1, a total of 4,316 relations among GO terms were identified by this method, restricted to relations across hierarchies, in at least one annotation database. Examples of pairs of related terms include potassium channel activity (MF) / potassium ion transport (BP) and hemoglobin complex (CC) / oxygen transport (BP). Both lexical similarity and VSM similarity identified large numbers of dependence relations. However, only a few percent of these relations are common to both methods. Further validation (manual or against other methods) is needed to assess the validity of the relations identified. For further information on this research the reader is referred to.