Skip to main content

GO Term analysis



1. Repository for annotation of  GO ID against genes


download file (gene_association.goa_human.gz) from link:
http://geneontology.org/gene-associations/gene_association.goa_human.gz
http://geneontology.org/page/download-annotations


2. How to extract GO names/description from database from GO ID


Option 1 : use the file

 http://purl.obolibrary.org/obo/go.obo

from   http://geneontology.org/page/download-ontology

Option 2 :  use R package

# load the GO library 
library(GO.db)  
write.table( goterms, sep="\t", file="goterms.txt")


3. How to do get all GO terms for a gene

Option 1: Use the annotation database [BEST option you can dowload updated file and parse it based on UNIPROT ACC No ]

http://geneontology.org/gene-associations/gene_association.goa_human.gz


Option 2 : R package Ontologyzer [It can be used to show annotation at child and parent level also , scrool down below to check command]

http://compbio.charite.de/contao/index.php/cmdlineOntologizer.html


Option 3 : Using R package biomaRt [Very slow]
library(biomaRt)
databases=listMarts();
myDB = useMart("unimart") % "ensembl"
myDataset=listDatasets(myDB) % "hsapiens_gene_ensembl"

# for uniprot
martUni = useMart(biomart = "unimart", dataset = "uniprot")
results = getBM(attributes = c("go_id","go_name"), filters = "accession", values= c("P49023"), mart = mart)

# for ensemble
# martEn = useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
# attr = listAttributes(mart)
# filters = listFilters(mart)
results = getBM(attributes = c("go_id","name_1006"), filters = "refseq_mrna",values = c("NM_030621"), mart = mart)


 4. How to do GO term enrichment analysis


Option 1 : R package Ontologyzer [Best option]
http://compbio.charite.de/contao/index.php/cmdlineOntologizer.html

Works with any ID, I tested with UNIPROT ACCESSION no.

sample commands

java -jar Ontologizer.jar -a gene_association.goa_human -g go.obo -s input.txt -p background.txt -c Parent-Child-Union -m Bonferroni -d 0.05 -r 1000 -n -o outputfolder

java -jar Ontologizer.jar -a gene_association.goa_human -g go.obo -s input.txt -p background.txt -c Parent-Child-Union -m Westfall-Young-Single-Step -d 0.05 -r 1000 -n -o outputfolder

Option 2 : Using David Web Service Client:

  http://david.abcc.ncifcrf.gov/content.jsp?file=WS.html

Just change following 3 variables:

            
String inputIds= new String("paxi_human,git1_human");
                String idType = new String("UNIPROT_ID");
                String listName = new String("make_up");
                int listType = 0;
                double addListOutput =0;

                //Set user defined categories
                // String category_names = new String("BBID,BIOCARTA,COG_ONTOLOGY,INTERPRO,KEGG_PATHWAY,OMIM_DISEASE,PIR_SUPERFAMILY,SMART,SP_PIR_KEYWORDS,UP_SEQ_FEATURE");
                String category_names = new String("GOTERM_BP_FAT,GOTERM_CC_FAT,GOTERM_MF_FAT");
               

API for Parameters

http://david.abcc.ncifcrf.gov/content.jsp?file=DAVID_API.html



Option 3 : R package GOFunction ( Only takes entrez ID)


http://bioconductor.org/packages/release/bioc/manuals/GOFunction/man/GOFunction.pdf

source("http://bioconductor.org/biocLite.R")
biocLite("GOFunction")
biocLite("org.Hs.eg.db")
biocLite("graph")
biocLite("Rgraphviz")
biocLite("SparseM")
library(GOFunction)

myDir="D:/research/GOAnalysis/"; 
                                
setwd(myDir);
data(exampledata)

# Only takes entrez ID
myIntetest=interestGenes[1:20]
myRef=refGenes[1:200]
sigTermBP = GOFunction(myIntetest, myRef, organism="org.Hs.eg.db",
ontology="BP", fdrmethod="BY", fdrth=0.05, ppth=0.05, pcth=0.05, poth=0.05, peth=0.05, bmpSize=2000, filename="sigTermBP") 


sigTermMF = GOFunction(myIntetest, myRef, organism="org.Hs.eg.db",
ontology="MF", fdrmethod="BY", fdrth=0.05, ppth=0.05, pcth=0.05, poth=0.05, peth=0.05, bmpSize=2000, filename="sigTermMF")


5. How to calculate p-values for GO terms related to a gene

1.Wilcoxon Rank Sum and Signed Rank Tests: 

https://stat.ethz.ch/R-manual/R-patched/library/stats/html/wilcox.test.html


2. Find the confidence interval using bootstrap

library(boot)
fsimple  =  function(curdata, i){
    return(curdata[i])
}
mydata=rnorm(100,0,1)
bootsimple = boot(mydata, fsimple , R=500)
plot(bootsimple )
simpleCI=boot.ci(  bootsimple , conf=0.90 , type = "all")
simpleCI$normal


3. Find the cdf and calculate threshold (1-pvalue):

# Samples must be sorted ( e.g. 2,2,2, 3,3, 4,4,4,55 )
mysample=c( 2,2,2, 3,3, 4,4,4,5,5 )
mydf  = data.frame( mySamples=mysample    )
myCDF =  ( 1: length(mydf$mySamples   ))/length( mydf$mySamples )

# select minimum of sample value which match threshod
myThr=0.95
idOfMatchSamples = which( myCDF>= myThr)
mySampleAtThr = mysample[idOfMatchSamples[1]]






Other options : Must need to check





Which file to choose for annotation:


from http://geneontology.org/page/download-annotations


Option 1: gene_association.goa_ref_human  (based on UniProtKB reference proteomes) [I think it is good for standard protein analysis].

These files contains all GO annotations and protein information for a species

1. subset of proteins in the UniProt KnowledgeBase (UniProtKB) and
2. provide one protein per gene.
3. The protein accessions included in these files are the protein sequences annotated in Swiss-Prot or the longest TrEMBL transcript if there is no Swiss-Prot record.
4. If a particularprotein accession is not annotated with GO, then it will not appear in this file.
Option 2: gene_association.goa_human   (based on UniProtKB complete proteome)

These files contains all GO annotations and information for a species

1. subset of proteins in the UniProt KnowledgeBase (UniProtKB) and for entities other than proteins, e.g. macromolecular complexes (IntAct Complex Portal identifiers) or RNAs (RNAcentral identifiers).
2. These files may provide annotations to more than one protein per gene.
3. The protein accessions included in these files are all Swiss-Prot entries for that species plus any TrEMBL entries that have an Ensembl DR line. The TrEMBL entries are likely to overlap with the Swiss-Prot entries or their isoforms.
4. If a particular entity is not annotated with GO, then it will not appear in this file.








Comments

Popular posts from this blog

MATLAB cross validation

// use built-in function samplesize = size( matrix , 1); c = cvpartition(samplesize,  'kfold' , k); % return the indexes on each fold ///// output in matlab console K-fold cross validation partition              N: 10    NumTestSets: 4      TrainSize: 8  7  7  8       TestSize: 2  3  3  2 ////////////////////// for i=1 : k    trainIdxs = find(training(c,i) ); %training(c,i);  // 1 means in train , 0 means in test    testInxs  = find(test(c,i)       ); % test(c,i);       // 1 means in test , 0 means in train    trainMatrix = matrix (  matrix(trainIdxs ), : );    testMatrix  = matrix (  matrix(testIdxs  ), : ); end //// now calculate performance %%  calculate performance of a partition     selectedKfoldSen=[];selectedKfoldSpe=[];selectedKfoldAcc=[];     indexSen=1;indexSpe=1;indexAcc=1;     if ( kfold == (P+N) )% leave one out         sensitivity = sum(cvtp) /( sum(cvtp) + sum(cvfn) )         specificity = sum(cvtn) /( sum(cvfp) + sum(cvtn) )         acc

R tutorial

Install R in linux ============ In CRAN home page, the latest version is not available. So, in fedora, Open the terminal yum list R  --> To check the latest available version of r yum install R --> install R version yum update R --> update current version to latest one 0 find help ============ ?exact topic name (  i.e.   ?mean ) 0.0 INSTALL 3rd party package  ==================== install.packages('mvtnorm' , dependencies = TRUE , lib='/home/alamt/myRlibrary/')   #  install new package BED file parsing (Always use read.delim it is the best) library(MASS) #library(ggplot2) dirRoot="D:/research/F5shortRNA/TestRIKEN/Rscripts/" dirData="D:/research/F5shortRNA/TestRIKEN/" setwd(dirRoot) getwd() myBed="test.bed" fnmBed=paste(dirData, myBed, sep="") # ccdsHh19.bed   tmp.bed ## Read bed use read.delim - it is the  best mybed=read.delim(fnmBed, header = FALSE, sep = "\t", quote = &q