GO Term analysis

1. Repository for annotation of GO ID against genes

download file (gene_association.goa_human.gz) from link:
http://geneontology.org/gene-associations/gene_association.goa_human.gz
http://geneontology.org/page/download-annotations

2. How to extract GO names/description from database from GO ID

Option 1 : use the file

http://purl.obolibrary.org/obo/go.obo

from http://geneontology.org/page/download-ontology

Option 2 : use R package

# load the GO library

library(GO.db)

write.table( goterms, sep="\t", file="goterms.txt")

3. How to do get all GO terms for a gene

Option 1: Use the annotation database [BEST option you can dowload updated file and parse it based on UNIPROT ACC No ]

http://geneontology.org/gene-associations/gene_association.goa_human.gz

Option 2 : R package Ontologyzer [It can be used to show annotation at child and parent level also , scrool down below to check command]

http://compbio.charite.de/contao/index.php/cmdlineOntologizer.html

Option 3 : Using R package biomaRt [Very slow]
library(biomaRt)
databases=listMarts();
myDB = useMart("unimart") % "ensembl"
myDataset=listDatasets(myDB) % "hsapiens_gene_ensembl"

# for uniprot
martUni = useMart(biomart = "unimart", dataset = "uniprot")
results = getBM(attributes = c("go_id","go_name"), filters = "accession", values= c("P49023"), mart = mart)

# for ensemble
# martEn = useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
# attr = listAttributes(mart)
# filters = listFilters(mart)
results = getBM(attributes = c("go_id","name_1006"), filters = "refseq_mrna",values = c("NM_030621"), mart = mart)

4. How to do GO term enrichment analysis

Option 1 : R package Ontologyzer [Best option]

http://compbio.charite.de/contao/index.php/cmdlineOntologizer.html

Works with any ID, I tested with UNIPROT ACCESSION no.

sample commands

java -jar Ontologizer.jar -a gene_association.goa_human -g go.obo -s input.txt -p background.txt -c Parent-Child-Union -m Bonferroni -d 0.05 -r 1000 -n -o outputfolder

java -jar Ontologizer.jar -a gene_association.goa_human -g go.obo -s input.txt -p background.txt -c Parent-Child-Union -m Westfall-Young-Single-Step -d 0.05 -r 1000 -n -o outputfolder

Option 2 : Using David Web Service Client:

http://david.abcc.ncifcrf.gov/content.jsp?file=WS.html

Just change following 3 variables:

String inputIds= new String("paxi_human,git1_human");

String idType = new String("UNIPROT_ID");

String listName = new String("make_up");

int listType = 0;

double addListOutput =0;

//Set user defined categories

// String category_names = new String("BBID,BIOCARTA,COG_ONTOLOGY,INTERPRO,KEGG_PATHWAY,OMIM_DISEASE,PIR_SUPERFAMILY,SMART,SP_PIR_KEYWORDS,UP_SEQ_FEATURE");

String category_names = new String("GOTERM_BP_FAT,GOTERM_CC_FAT,GOTERM_MF_FAT");

API for Parameters
http://david.abcc.ncifcrf.gov/content.jsp?file=DAVID_API.html

Option 3 : R package GOFunction ( Only takes entrez ID)

http://bioconductor.org/packages/release/bioc/manuals/GOFunction/man/GOFunction.pdf

source("http://bioconductor.org/biocLite.R")
biocLite("GOFunction")
biocLite("org.Hs.eg.db")
biocLite("graph")
biocLite("Rgraphviz")
biocLite("SparseM")
library(GOFunction)

myDir="D:/research/GOAnalysis/";

setwd(myDir);
data(exampledata)

# Only takes entrez ID
myIntetest=interestGenes[1:20]
myRef=refGenes[1:200]

sigTermBP = GOFunction(myIntetest, myRef, organism="org.Hs.eg.db",
ontology="BP", fdrmethod="BY", fdrth=0.05, ppth=0.05, pcth=0.05, poth=0.05, peth=0.05, bmpSize=2000, filename="sigTermBP")

sigTermMF = GOFunction(myIntetest, myRef, organism="org.Hs.eg.db",
ontology="MF", fdrmethod="BY", fdrth=0.05, ppth=0.05, pcth=0.05, poth=0.05, peth=0.05, bmpSize=2000, filename="sigTermMF")

5. How to calculate p-values for GO terms related to a gene

1.Wilcoxon Rank Sum and Signed Rank Tests:

https://stat.ethz.ch/R-manual/R-patched/library/stats/html/wilcox.test.html

2. Find the confidence interval using bootstrap

library(boot)
fsimple = function(curdata, i){
return(curdata[i])
}
mydata=rnorm(100,0,1)
bootsimple = boot(mydata, fsimple , R=500)
plot(bootsimple )
simpleCI=boot.ci( bootsimple , conf=0.90 , type = "all")
simpleCI$normal

3. Find the cdf and calculate threshold (1-pvalue):

# Samples must be sorted ( e.g. 2,2,2, 3,3, 4,4,4,55 )
mysample=c( 2,2,2, 3,3, 4,4,4,5,5 )
mydf = data.frame( mySamples=mysample )
myCDF = ( 1: length(mydf$mySamples ))/length( mydf$mySamples )

# select minimum of sample value which match threshod
myThr=0.95

idOfMatchSamples = which( myCDF>= myThr)
mySampleAtThr = mysample[idOfMatchSamples[1]]

Other options : Must need to check

gProfileR: http://cran.r-project.org/web/packages/gProfileR/gProfileR.pdf
GoProfiles : http://bioconductor.org/packages/release/bioc/vignettes/goProfiles/inst/doc/goProfiles.pdf
TopGO: http://www.bioconductor.org/packages/release/bioc/html/topGO.html
GOstat:http://bioconductor.org/packages/release/bioc/vignettes/GOstats/inst/doc/GOstatsForUnsupportedOrganisms.pdf

Which file to choose for annotation:

from http://geneontology.org/page/download-annotations

Option 1: gene_association.goa_ref_human (based on UniProtKB reference proteomes) [I think it is good for standard protein analysis].

These files contains all GO annotations and protein information for a species

1. subset of proteins in the UniProt KnowledgeBase (UniProtKB) and
2. provide one protein per gene.
3. The protein accessions included in these files are the protein sequences annotated in Swiss-Prot or the longest TrEMBL transcript if there is no Swiss-Prot record.
4. If a particularprotein accession is not annotated with GO, then it will not appear in this file.
Option 2: gene_association.goa_human (based on UniProtKB complete proteome)

These files contains all GO annotations and information for a species

1. subset of proteins in the UniProt KnowledgeBase (UniProtKB) and for entities other than proteins, e.g. macromolecular complexes (IntAct Complex Portal identifiers) or RNAs (RNAcentral identifiers).
2. These files may provide annotations to more than one protein per gene.
3. The protein accessions included in these files are all Swiss-Prot entries for that species plus any TrEMBL entries that have an Ensembl DR line. The TrEMBL entries are likely to overlap with the Swiss-Prot entries or their isoforms.
4. If a particular entity is not annotated with GO, then it will not appear in this file.

SLURM tutorial : Basic commands

Main website for learning SLRUM http://slurm.schedmd.com/tutorials.html Submit a job with name and outputfile name(This will overwrite the parameters in shell file header ) sbatch -J job1 -o job1.out --partition=batch myscript.sh Basic shell script for job #!/bin/sh # #SBATCH --job-name=testJob #SBATCH --time=01:00:00 #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --partition=dragon-default # # Display all variables set by slurm env | grep "^SLURM" | sort # cd /projects/dragon/FANTOM5/processed_data_feature ## All my commands for job will go here date;time; mkdir t1 How to submit a batch job sbatch myscript.sh How to check the list of jobs of a user squeue -u user1 squeue -u user1 -l # it will show in details How to check the whole history and status of a job scontrol show job=JOBID How to use one particular node in interactive mode. Useful when all...

My Blog

Search This Blog

GO Term analysis

1. Repository for annotation of GO ID against genes

2. How to extract GO names/description from database from GO ID

3. How to do get all GO terms for a gene

4. How to do GO term enrichment analysis

5. How to calculate p-values for GO terms related to a gene

Which file to choose for annotation:

Comments

Post a Comment

Popular posts from this blog

R tutorial

MATLAB cross validation

SLURM tutorial : Basic commands