Tuesday, April 14, 2015

MicroRNA database : Mirbase naming convension ; Target Database


MirTarBase V 4.5


For human, It has  targets for 596 mature microRNA from mirBase.

Mirbase 20

Number of precursor micorRNA:  1871 ID , 1870 Name (mir-511 has two id)
Number of mature micorRNA:   2794 ID , 2576 Name.

ID 


Each precursor and mature microRNA has unique ID. If you work on ID that would be fine.

 Name  -- > ID

Precursor microRNA
For each primary ID  there is 1 primary name, except one case hsa-mir-511. This name is common for two primary transcript

chr10    .    miRNA_primary_transcript    17887107    17887193    .    +    .    ID=MI0003127;Alias=MI0003127;Name=hsa-mir-511

chr10    .    miRNA_primary_transcript    18134036    18134122    .    +    .    ID=MI0003127_2;Alias=MI0003127;Name=hsa-mir-511

mature microRNA
For each mature name there can be N mature RNA ID .


Target of matureRNA -> target of Precursor microRNA


Option 1
1. map matrueRNA name  : to matureRNA Name set (here you need to check mir-511 manually)


Option 2
1. map matrueRNA name  : to matureRNA ID set
2. map mature RNA ID set  : to precursor RNA ID set
3. include the targets for all member of primary RNA ID set .





Sunday, April 12, 2015

GO Term analysis



1. Repository for annotation of  GO ID against genes


download file (gene_association.goa_human.gz) from link:
http://geneontology.org/gene-associations/gene_association.goa_human.gz
http://geneontology.org/page/download-annotations


2. How to extract GO names/description from database from GO ID


Option 1 : use the file

 http://purl.obolibrary.org/obo/go.obo

from   http://geneontology.org/page/download-ontology

Option 2 :  use R package

# load the GO library 
library(GO.db)  
write.table( goterms, sep="\t", file="goterms.txt")


3. How to do get all GO terms for a gene

Option 1: Use the annotation database [BEST option you can dowload updated file and parse it based on UNIPROT ACC No ]

http://geneontology.org/gene-associations/gene_association.goa_human.gz


Option 2 : R package Ontologyzer [It can be used to show annotation at child and parent level also , scrool down below to check command]

http://compbio.charite.de/contao/index.php/cmdlineOntologizer.html


Option 3 : Using R package biomaRt [Very slow]
library(biomaRt)
databases=listMarts();
myDB = useMart("unimart") % "ensembl"
myDataset=listDatasets(myDB) % "hsapiens_gene_ensembl"

# for uniprot
martUni = useMart(biomart = "unimart", dataset = "uniprot")
results = getBM(attributes = c("go_id","go_name"), filters = "accession", values= c("P49023"), mart = mart)

# for ensemble
# martEn = useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
# attr = listAttributes(mart)
# filters = listFilters(mart)
results = getBM(attributes = c("go_id","name_1006"), filters = "refseq_mrna",values = c("NM_030621"), mart = mart)


 4. How to do GO term enrichment analysis


Option 1 : R package Ontologyzer [Best option]
http://compbio.charite.de/contao/index.php/cmdlineOntologizer.html

Works with any ID, I tested with UNIPROT ACCESSION no.

sample commands

java -jar Ontologizer.jar -a gene_association.goa_human -g go.obo -s input.txt -p background.txt -c Parent-Child-Union -m Bonferroni -d 0.05 -r 1000 -n -o outputfolder

java -jar Ontologizer.jar -a gene_association.goa_human -g go.obo -s input.txt -p background.txt -c Parent-Child-Union -m Westfall-Young-Single-Step -d 0.05 -r 1000 -n -o outputfolder

Option 2 : Using David Web Service Client:

  http://david.abcc.ncifcrf.gov/content.jsp?file=WS.html

Just change following 3 variables:

            
String inputIds= new String("paxi_human,git1_human");
                String idType = new String("UNIPROT_ID");
                String listName = new String("make_up");
                int listType = 0;
                double addListOutput =0;

                //Set user defined categories
                // String category_names = new String("BBID,BIOCARTA,COG_ONTOLOGY,INTERPRO,KEGG_PATHWAY,OMIM_DISEASE,PIR_SUPERFAMILY,SMART,SP_PIR_KEYWORDS,UP_SEQ_FEATURE");
                String category_names = new String("GOTERM_BP_FAT,GOTERM_CC_FAT,GOTERM_MF_FAT");
               

API for Parameters

http://david.abcc.ncifcrf.gov/content.jsp?file=DAVID_API.html



Option 3 : R package GOFunction ( Only takes entrez ID)


http://bioconductor.org/packages/release/bioc/manuals/GOFunction/man/GOFunction.pdf

source("http://bioconductor.org/biocLite.R")
biocLite("GOFunction")
biocLite("org.Hs.eg.db")
biocLite("graph")
biocLite("Rgraphviz")
biocLite("SparseM")
library(GOFunction)

myDir="D:/research/GOAnalysis/"; 
                                
setwd(myDir);
data(exampledata)

# Only takes entrez ID
myIntetest=interestGenes[1:20]
myRef=refGenes[1:200]
sigTermBP = GOFunction(myIntetest, myRef, organism="org.Hs.eg.db",
ontology="BP", fdrmethod="BY", fdrth=0.05, ppth=0.05, pcth=0.05, poth=0.05, peth=0.05, bmpSize=2000, filename="sigTermBP") 


sigTermMF = GOFunction(myIntetest, myRef, organism="org.Hs.eg.db",
ontology="MF", fdrmethod="BY", fdrth=0.05, ppth=0.05, pcth=0.05, poth=0.05, peth=0.05, bmpSize=2000, filename="sigTermMF")


5. How to calculate p-values for GO terms related to a gene

1.Wilcoxon Rank Sum and Signed Rank Tests: 

https://stat.ethz.ch/R-manual/R-patched/library/stats/html/wilcox.test.html


2. Find the confidence interval using bootstrap

library(boot)
fsimple  =  function(curdata, i){
    return(curdata[i])
}
mydata=rnorm(100,0,1)
bootsimple = boot(mydata, fsimple , R=500)
plot(bootsimple )
simpleCI=boot.ci(  bootsimple , conf=0.90 , type = "all")
simpleCI$normal


3. Find the cdf and calculate threshold (1-pvalue):

# Samples must be sorted ( e.g. 2,2,2, 3,3, 4,4,4,55 )
mysample=c( 2,2,2, 3,3, 4,4,4,5,5 )
mydf  = data.frame( mySamples=mysample    )
myCDF =  ( 1: length(mydf$mySamples   ))/length( mydf$mySamples )

# select minimum of sample value which match threshod
myThr=0.95
idOfMatchSamples = which( myCDF>= myThr)
mySampleAtThr = mysample[idOfMatchSamples[1]]