Monday, October 19, 2015

SVN tutorial




SVN cheat sheet:

http://www.cheatography.com/davechild/cheat-sheets/subversion/

For more useful svn commands, check the following tutorial:
http://openoffice.apache.org/svn-basics.html



Server : Create a  SVN repository


svn co https://svn.apache.org/repos/asf/openoffice/trunk  mySVNrepository

Local machine: Make a repository and connect with server


Just on your local machine, run the following command:

Then, all current testing scripts will be added to your local machine. 

Add and commit your changes to server:


In order to share with us your work, inside your own folder (e.g. ChallengeScripts/Wail ), create any folder, file or scripts and do:
1- svn add folder1
2- svn commit folder1 -m "Comments of your changes"


Monday, October 12, 2015

Linux Tips and Tricks


Collected from CBRC wiki:
Original source: http://dragon.cbrc.kaust.edu.sa/wiki/index.php/Linux_Tips_and_Tricks


Linux Tips and Tricks

Jump to: navigation, search

Contents

Useful commands

General commands

  • find Linux version
$ cat /etc/issue

Directories processing

$ # recursively delete backup files
$ find ./ -name '*~' | xargs rm 
 
$ # Process all files in a directory
$ ls dir_path/*.xml | xargs -n 1 MyProcessingProgram 
 
$ # gzip/guznip directory:
$ tar czvf myfile.tar.gz mydir/*
$ tar xzvf filename.tar.gz (or: tar xzvf filename.tar.gz -C foldername/)
$ gunzip -c foo.tar.gz | tar xvf

Files processing

$ # batch convert dos to unix type text files (e.g. .pl)
$ find . -name "*.pl" -type f -exec dos2unix {} \; 
 
$ # remove blank lines from a text file
$ sed '/^$/d' in.txt > out.txt
 
$ # sort file from shortest to longest line
$ cat myfile.txt | awk '{ print length, $0 }' | sort -n | cut -d" " -f2- > out.txt
$ # or from longest to the shortest
$ cat myfile.txt | awk '{ print length, $0 }' | sort -nr | cut -d" " -f2- > out.txt
 
$ # Count word occurence in a file
$ grep -w 'searchword' filename.txt | tr ' ' '\n' | wc -l

Find/Replace

$ # find words listed in one file (e.g. dictionary) in another (e.g. free text)
$ grep -i -f dictionary.txt text.txt   # ignore case (-i) and print matching lines
$ grep -io -f dictionary.txt text.txt  # ignore case (-i) and print matching pattern
$ while read line; do grep -io -P "\Q$line\E" text.txt; done < dictionary.txt # if there are special chars
 
$ # replace char1 with char2 in a file, e.g comma with tab
$ tr ',' '\t' < in.csv > out.tsv

Processes

$ # print process name, pid and running time for the process "xyz"
$ ps -eo fname,pid,etime | awk '{if ($1=="xyz") print $1,$2,$3}' | more
$ % or show pid for process XYZ that is running more than an hour:
$ ps --no-headers -C XYZ -o pid,etime | awk  '/[0-23]:[0-9][0-9]:[0-9][0-9]/ {print $1}'

File System /Disk Management

Disk space report

$ ncdu
To install ncdu on ubuntu:
$ sudo apt-get install ncdu
To install ncdu on Scientific Linux you need to install Extra Packages for Enterprise Linux (or EPEL) repository. The easiest way to do it is from the Synaptic Package Manager: just search for: Extra Packages for Enterprise Linux. After the EPEL is installed, you can install ncdu:
$ sudo yum install ncdu

Resizing Shared Memory (/dev/shm)

You can use /dev/shm to improve the performance of application software or overall Linux system performance. Use 'df -h' to see partition allocation. To permanently increase size of shared memory, append or modify /dev/shm entry as follows to set size to 8G
$ sudo nano /etc/fstab
# modify the following line by adding new size
none      /dev/shm        tmpfs   defaults,size=8G        0 0
# save and close the file. For the changes to take effect immediately remount /dev/shm:
$ sudo mount -o remount /dev/shm
# verify the same:
$ df -h


Resizing partitions CentOS

  1. display partition information
$ df -kh
Filesystem                      Size  Used  Avail  Use% Mounted on
/dev/mapper/vg_centos-lv_root    50G   3.6G   44G   8%    /
/tmpfs                          1.9G   228K  1.9G   1%   /dev/shm
/dev/sda1                       485M    68M  393M  15%   /boot
/dev/mapper/vg_centos-lv_home   2.0T   200M  1.9T   1%   /home
  1. Drop into the single user terminal to avoid read and writing to the partitions that will be resized
$ sudo telinit 1
if filesystem is used, try booting in a single user-mode (press any key to interrupt boot->letter 'e'->'kernel' option->letter 'e'->type single->letter 'b' to boot)
  1. Decrease /dev/mapper/vg_web-lv_home
$ umount /dev/mapper/vg_centos-lv_home
$ e2fsck -f /dev/mapper/vg_centos-lv_home
$ resize2fs /dev/mapper/vg_centos-lv_home 200G
$ lvresize -L 200G /dev/mapper/vg_centos-lv_home
$ mount /home
$ mount -o remount /home
$ df -kh # check the size
  1. Increase /dev/mapper/vg_centos-lv_root
$ lvresize -l +100%FREE /dev/mapper/vg_centos-lv_root
$ # or, alternatively: lvresize -L 1800G /dev/mapper/vg_centos-lv_root
$ resize2fs /dev/mapper/vg_centos-lv_root
$ df -kh # check the size
$ shutdown -r now

Disk crash recovery

My Linux (kubuntu) suddenly crashed. The screen blinked several times and the keyboard stopped responding. A new reboot failed, it either hung or went to a recovery mode (that did not recover anything). The (almost) only solution in such situations is to run a disk-checking program (e.g. e2fsck) that can fix bad sectors; in this case the suspicious was the Linux partition table, or the Linux boot sector. Therefore, this is what I did:
  1. Get the latest version of Ubuntu on a CD.
  1. Boot from it (on MacBook hold "C" during the reboot; that's how MacBook knows thta you want to boot from a CD). Go to a recovery mode.
  1. Start partition management program parted. I needed to run this program to find out what is the device name of my Linux partition - and parted told me that it was /dev/sda5.
  1. Then I started e2fsck (with the found device /dev/sda5 as a parameter) to check and fix the disk. I answered yes (meaning: yes, fix it please) to all prompts (actually I started it with the -y option).

Mounting WebDAV Share

$ mkdir ~/<my mount point>
$ sudo mount -t davfs http://10.75.106.38:5005 ~/<my mount point>

Installing and configuring WebDAV for non-sudo on Ubuntu

  1. install davfs: sudo apt-get install davfs2
  2. Reconfigure davfs2 to enable non-sudo mount: sudo dpkg-reconfigure davfs2
  3. Edit /etc/davfs2/davfs2.conf to enable automatic credentials by uncommenting the line: secrets ~/.davfs2/secrets
  4. Edit ~/.davfs2/secrets file to add credentials to remote WebDav diectory. Add a line to the end of file in following style: http://
  5. Set the permission: chmod 600 ~/.davfs2/secrets
  6. Add a line to /ect/fstab about the remote WebDav directory: http:// davfs user,noauto,file_mode=600,dir_mode=700 0 1
  7. Add your user to the davfs2 group: sudo vi /etc/group as follows: davfs2:x:134: (134 can be some other gorup number - don't change it)
  8. You can use following commands without being a root user to mount/umount
$ mount <mount point>
$ umount <mount point>
You can also use nautilus to mount/umount the directory.

Thursday, August 13, 2015

GIT tutorial



Tutorial: 

http://gitready.com/
http://blog.osteele.com/posts/2008/05/my-git-workflow/

Basic structure







Image from: From : http://blog.osteele.com/posts/2008/05/my-git-workflow/

Following tips are collected from CBRC git (Thanks to John and Craig)
Original LInk: http://dragon.cbrc.kaust.edu.sa/wiki/index.php/Cbrcgit

Cbrcgit

Jump to: navigation, search

Contents

What is git

http://en.wikipedia.org/wiki/Git_%28software%29
Git is a free distributed revision control, or software source code management, project with an emphasis on being fast. Git was initially designed and developed by Linus Torvalds for Linux kernel development.
Every Git working directory is a full-fledged repository with complete history and full revision tracking capabilities, not dependent on network access or a central server.
Git has a very powerfull command and option set. Some defaults need to be set in new projects to streamline the workflow.

CBRC git server

The central git server for CBRC runs at git.cbrc.kaust.edu, or if you want to use a friendlier GitLab version, have a look at Gitlab wiki.
Also, notice that in communications with the cbrcgit server, everyone should be using the 'git' username. This minimizes the chances of having conflicts with different users creating files in the server.
To access the server, you need to send the CBRC IT support your public ssh key (~/.ssh/id_rsa.pub). If your work computer does not have one, you can generate it:
 $  ssh-keygen -t rsa # with an empty pass phrase
You can find more info about this on FAQ pages.

Quick start

  • On your development computer (provided that git has been installed):
$ cd <my_project_directory>
$ git init
$ git config user.name 'My Name'
$ git config user.email 'My Email Address'
$ git remote add origin ssh://git@git.cbrc.kaust.edu.sa/datapool/git/datagit/My_Project_Name.git/
  • Login to the CBRC git repository and create your project.
$ ssh git@git.cbrc.kaust.edu.sa
$ mkdir /datapool/git/datagit/My_Project_Name.git
$ cd /datapool/git/datagit/My_Project_Name.git
$ git --bare init
$ vim description
$ exit
  • Back on your development machine, add files to the local and remote repository: 
A. Make your local version integrate all from server
$ git pull ssh://git@git.cbrc.kaust.edu.sa/datapool/git/datagit/LDmotif.git
 
B . Now upload your changes to server 

$ git add -A
$ git commit -am 'note about the commit'
$ git push origin master
Please do not forget to edit description file. 3 lines are enough, but you can add more:
   project name
   web location or short description
   contact

Using git

Start by telling git your contact details:
 $ git config --global user.name "Your Name"
 $ git config --global user.email "your.name@kaust.edu.sa"
Since git is a distributed version controlling system, you can use it for any directory. Simply run:
 $ git init
After that you can add subdirectories and files and modify existing ones. Note that git commit command takes note of only those files that have been 'add'ed.
 $ git add .
 $ git commit -a -m 'note about the commit'
Git contains powerful mechanisms to branch, merge versions and study older versions.
Important: If you want to rename a file or directory that is under git, do not use the normal 'mv' command. Git can keep track of the change, if you do:
 $ git mv filename newfilename
Similarly, remove a file from disk and git at the same time:
 $ git rm filename
Committing to a remote git: The latest git does not allow you to commit before you select the push method. The following command sets it globally for all your git repositories:
 $ git config --global push.default matching

See: man pages git, gittutorial


Creating a new remote repository

Once your public ssh key is installed on the git machine (see the procedure in the previous chapter), you can add a new project, e.g. 'myproj', into the git server.
 $ ssh git@git.cbrc.kaust.edu.sa
 $ mkdir /datapool/git/datagit/myproj.git
 $ cd /datapool/git/datagit/myproj.git
 $ git --bare init
 Initialized empty Git repository in /datapool/git/datagit/myproj.git/
 $ exit
Note that this remote directory name needs to end '.git' for the connection to work.

Creating a new local git repository

You can create a local git repository and start using it to keep track of changes a few simple commands:
 $ mkdir myproj
 $ cd myproj/
 $ git --bare init
 # add files and content, e.g.:
 $ echo "text" > readme
 # tell git not to keep track of backup files, e.g. 
 $ echo '*~' > .gitignore
 # tell git to track all files and commit them for the first time
 $ git add .
 $ git commit -a -m 'first commit'
The local git repository sits on your machine and reflects all changes in your files after you commit them. Remember, therefore: the commit command does not update anything outside of your machine (as it does with few others version control systems). If you need to share your project and its files with others, you need to propagate your changes to the CBRC central git server. Read on.

These following commands are needed only when you want to link the local to the CBRC central server:
 $ git remote add origin ssh://git@git.cbrc.kaust.edu.sa/datapool/git/datagit/myproj.git
 $ git push origin master
 

Very important command to let git know using master branch

 
# tell git that by default you are working on the head of the master
 # branch in the remote repository
 $ git config branch.master.remote origin
 $ git config branch.master.merge refs/heads/master

Other CBRC members (who have their public ssh keys installed - see above) can now download their copy of the project with:
 $ git clone ssh://git@git.cbrc.kaust.edu.sa/datapool/git/datagit/myproj.git
From there on, before working on the project, you should first check if someone has update the central repository:
 $ git pull
 # After setting the defaults above, this is same as:
 $ git pull ssh://git@git.cbrc.kaust.edu.sa/datapool/git/datagit/myproj.git master
While working with files, you can check the status any time with:
 $ git status
 $ git log (filename)
 $ git diff filename
 $ git blame filename
and commit any changes in logical sets.
After working with a set of tasks, you should push the changes to the central repository for others to see:
 $ git push origin master

Project repository

The current git repository location is: git@git.cbrc.kaust.edu.sa:/datapool/git/datagit/
CBRC Git repository browser: http://dragon.cbrc.kaust.edu.sa/cbrcwg/

See also

Best Tutorial


http://classic.scottr.org/presentations/git-in-5-minutes/

1. CREATE : ## Make a REMOTE repository in server as git repository

 

  •  On your git  server machine

$ ssh git@git.cbrc.kaust.edu.sa
$ mkdir /datapool/git/datagit/My_Project_Name.git
$ cd /datapool/git/datagit/My_Project_Name.git
$ git --bare init
Initialized empty Git repository in /datapool/git/datagit/myproj.git/
$ vim description $ exit

2. CREATE :  ## Make a LOCAL repository and upload to git repository

  •  On your development computer (provided that git has been installed):
$ cd <my_local_project_directory>
$ git --bare init
$ git config user.name 'My Name'
$ git config user.email 'My Email Address'
$ git remote add origin ssh://git@git.cbrc.kaust.edu.sa/datapool/git/datagit/My_Project_Name.git/


  • add files to the local and remote repository:
$ git add -A
# tell git to track all files and commit them for the first time
$ git add .
$ git commit -am 'note about the commit' $ git push origin master



3. DOWNLOAD : ALREADY EXISTING git repository from server FIRST TIME



 

4. WORK : ALREADY EXISTING git repository


  • From there on, before working on the project, you should first check if someone has update the central repository:

 $ git pull
 # After setting the defaults above, this is same as:
 $ git pull ssh://git@git.cbrc.kaust.edu.sa/datapool/git/datagit/myproj.git master


  • While working with files, you can check the status any time with:

 $ git status
 $ git log (filename)
 $ git diff filename
 $ git blame filename

and commit any changes in logical sets.



  • After working with a set of tasks, you should push the changes to the central repository for others to see:

 $ git push origin master


If git is not sure which branch to use, use master

# tell git that by default you are working on the head of the master
 # branch in the remote repository
 $ git config branch.master.remote origin
 $ git config branch.master.merge refs/heads/master

 

5. Some useful commands

## Make local index from remote git index when you create git local index First time


git clone


## Show the remote git repository

git remote -v

## Download from git repository


git pull


## check difference between local file and local git index
git diff   file.html

## Commit locally
git commit -m "uploading a single file"  help.html
git commit -m "uploading all changes"  .

## Commit in server
git push

Tuesday, April 14, 2015

MicroRNA database : Mirbase naming convension ; Target Database


MirTarBase V 4.5


For human, It has  targets for 596 mature microRNA from mirBase.

Mirbase 20

Number of precursor micorRNA:  1871 ID , 1870 Name (mir-511 has two id)
Number of mature micorRNA:   2794 ID , 2576 Name.

ID 


Each precursor and mature microRNA has unique ID. If you work on ID that would be fine.

 Name  -- > ID

Precursor microRNA
For each primary ID  there is 1 primary name, except one case hsa-mir-511. This name is common for two primary transcript

chr10    .    miRNA_primary_transcript    17887107    17887193    .    +    .    ID=MI0003127;Alias=MI0003127;Name=hsa-mir-511

chr10    .    miRNA_primary_transcript    18134036    18134122    .    +    .    ID=MI0003127_2;Alias=MI0003127;Name=hsa-mir-511

mature microRNA
For each mature name there can be N mature RNA ID .


Target of matureRNA -> target of Precursor microRNA


Option 1
1. map matrueRNA name  : to matureRNA Name set (here you need to check mir-511 manually)


Option 2
1. map matrueRNA name  : to matureRNA ID set
2. map mature RNA ID set  : to precursor RNA ID set
3. include the targets for all member of primary RNA ID set .





Sunday, April 12, 2015

GO Term analysis



1. Repository for annotation of  GO ID against genes


download file (gene_association.goa_human.gz) from link:
http://geneontology.org/gene-associations/gene_association.goa_human.gz
http://geneontology.org/page/download-annotations


2. How to extract GO names/description from database from GO ID


Option 1 : use the file

 http://purl.obolibrary.org/obo/go.obo

from   http://geneontology.org/page/download-ontology

Option 2 :  use R package

# load the GO library 
library(GO.db)  
write.table( goterms, sep="\t", file="goterms.txt")


3. How to do get all GO terms for a gene

Option 1: Use the annotation database [BEST option you can dowload updated file and parse it based on UNIPROT ACC No ]

http://geneontology.org/gene-associations/gene_association.goa_human.gz


Option 2 : R package Ontologyzer [It can be used to show annotation at child and parent level also , scrool down below to check command]

http://compbio.charite.de/contao/index.php/cmdlineOntologizer.html


Option 3 : Using R package biomaRt [Very slow]
library(biomaRt)
databases=listMarts();
myDB = useMart("unimart") % "ensembl"
myDataset=listDatasets(myDB) % "hsapiens_gene_ensembl"

# for uniprot
martUni = useMart(biomart = "unimart", dataset = "uniprot")
results = getBM(attributes = c("go_id","go_name"), filters = "accession", values= c("P49023"), mart = mart)

# for ensemble
# martEn = useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
# attr = listAttributes(mart)
# filters = listFilters(mart)
results = getBM(attributes = c("go_id","name_1006"), filters = "refseq_mrna",values = c("NM_030621"), mart = mart)


 4. How to do GO term enrichment analysis


Option 1 : R package Ontologyzer [Best option]
http://compbio.charite.de/contao/index.php/cmdlineOntologizer.html

Works with any ID, I tested with UNIPROT ACCESSION no.

sample commands

java -jar Ontologizer.jar -a gene_association.goa_human -g go.obo -s input.txt -p background.txt -c Parent-Child-Union -m Bonferroni -d 0.05 -r 1000 -n -o outputfolder

java -jar Ontologizer.jar -a gene_association.goa_human -g go.obo -s input.txt -p background.txt -c Parent-Child-Union -m Westfall-Young-Single-Step -d 0.05 -r 1000 -n -o outputfolder

Option 2 : Using David Web Service Client:

  http://david.abcc.ncifcrf.gov/content.jsp?file=WS.html

Just change following 3 variables:

            
String inputIds= new String("paxi_human,git1_human");
                String idType = new String("UNIPROT_ID");
                String listName = new String("make_up");
                int listType = 0;
                double addListOutput =0;

                //Set user defined categories
                // String category_names = new String("BBID,BIOCARTA,COG_ONTOLOGY,INTERPRO,KEGG_PATHWAY,OMIM_DISEASE,PIR_SUPERFAMILY,SMART,SP_PIR_KEYWORDS,UP_SEQ_FEATURE");
                String category_names = new String("GOTERM_BP_FAT,GOTERM_CC_FAT,GOTERM_MF_FAT");
               

API for Parameters

http://david.abcc.ncifcrf.gov/content.jsp?file=DAVID_API.html



Option 3 : R package GOFunction ( Only takes entrez ID)


http://bioconductor.org/packages/release/bioc/manuals/GOFunction/man/GOFunction.pdf

source("http://bioconductor.org/biocLite.R")
biocLite("GOFunction")
biocLite("org.Hs.eg.db")
biocLite("graph")
biocLite("Rgraphviz")
biocLite("SparseM")
library(GOFunction)

myDir="D:/research/GOAnalysis/"; 
                                
setwd(myDir);
data(exampledata)

# Only takes entrez ID
myIntetest=interestGenes[1:20]
myRef=refGenes[1:200]
sigTermBP = GOFunction(myIntetest, myRef, organism="org.Hs.eg.db",
ontology="BP", fdrmethod="BY", fdrth=0.05, ppth=0.05, pcth=0.05, poth=0.05, peth=0.05, bmpSize=2000, filename="sigTermBP") 


sigTermMF = GOFunction(myIntetest, myRef, organism="org.Hs.eg.db",
ontology="MF", fdrmethod="BY", fdrth=0.05, ppth=0.05, pcth=0.05, poth=0.05, peth=0.05, bmpSize=2000, filename="sigTermMF")


5. How to calculate p-values for GO terms related to a gene

1.Wilcoxon Rank Sum and Signed Rank Tests: 

https://stat.ethz.ch/R-manual/R-patched/library/stats/html/wilcox.test.html


2. Find the confidence interval using bootstrap

library(boot)
fsimple  =  function(curdata, i){
    return(curdata[i])
}
mydata=rnorm(100,0,1)
bootsimple = boot(mydata, fsimple , R=500)
plot(bootsimple )
simpleCI=boot.ci(  bootsimple , conf=0.90 , type = "all")
simpleCI$normal


3. Find the cdf and calculate threshold (1-pvalue):

# Samples must be sorted ( e.g. 2,2,2, 3,3, 4,4,4,55 )
mysample=c( 2,2,2, 3,3, 4,4,4,5,5 )
mydf  = data.frame( mySamples=mysample    )
myCDF =  ( 1: length(mydf$mySamples   ))/length( mydf$mySamples )

# select minimum of sample value which match threshod
myThr=0.95
idOfMatchSamples = which( myCDF>= myThr)
mySampleAtThr = mysample[idOfMatchSamples[1]]



Sunday, March 1, 2015



How to clone a repository


1. Go to the repository

2. click Fork

3. In local machine , type following command (her tanviralambd   = user name)

git clone https://github.com/tanviralambd/intermine.git

Thursday, February 19, 2015

JAVA directory operation : List files in a folder , copy files and folder creation



List files in a folder

 import java.io.*;
public static Vector listFiles_Files_Dir(String absPathFolder)
    {
        Vector vectAllFiles = new Vector();


        File folder = new File(absPathFolder);
        File[] listOfFiles = folder.listFiles();

        for (int i = 0; i < listOfFiles.length; i++) {
            if (listOfFiles[i].isFile()) {
               
                System.out.println("File: " + listOfFiles[i].getName());
                vectAllFiles.add(listOfFiles[i].getName() );
               
            } else if (listOfFiles[i].isDirectory()) {
               
                System.out.println("Directory: " + listOfFiles[i].getName());
                vectAllFiles.add(listOfFiles[i].getName() );
               
            }
        }

        return vectAllFiles;
    }


Create folder 

    import java.io.*; 
    public static void create_new_folder(String absPathFolder)
    {
        File theDir = new File(absPathFolder);

        // if the directory does not exist, create it
        if (!theDir.exists()) {
            System.out.println("creating directory: " + absPathFolder);

            try{
                theDir.mkdir();
            } catch(SecurityException se){
                se.printStackTrace();
            }

        }
    }

Copy and paste files from a  folder to another folder

    import java.nio.file.Files;
    import java.io.*;
    public static void copy_paste_file(String pathCopy, String pathPaste)
    {
        try {
            File source = new File(pathCopy);
            File dest = new File(pathPaste);
            Files.copy(source.toPath(), dest.toPath());
        } catch (Exception e) {
            e.printStackTrace();
        }

    }

Tuesday, February 3, 2015

ngsplot ngs.plot tutorial



#!/bin/sh
#
#SBATCH --job-name=testJob
#SBATCH --time=24:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --partition=dragon-default
#SBATCH --ntasks-per-node=1
#SBATCH --mem=8192
#
# Display all variables set by slurm
env | grep "^SLURM" | sort

## run with command: sbatch NGSPLOT_test.sh

#module load gcc/4.7.4
#module load  compilers-extra
#module load samtools/1.1
#module load bwtool/1.0

module purge
module load slurm
module load gcc/4.7.4
module load R/3.1.1
module load ngsplot

dataDir="/home/alamt/HMCan_test/"
workDir=/home/alamt/NGSPLOT/

cd $workDir


hg19size=hg19.chrom.sizes

tfName="MEF_2A"

bedFile="F5_robust_DPI_TSS_500_upstream_500_downstream_nonCoding_only.bed"


###############################
##### 1. Run ngs.plot.r
##### Mandatory parameters:
#  -G   Genome name. Use ngsplotdb.py list to show available genomes.
#  -R   Genomic regions to plot: tss, tes, genebody, exon, cgi, enhancer, dhs or bed
#  -C   Indexed bam file or a configuration file for multiplot
#  -O   Name for output: multiple files will be generated
##### Optional parameters related to configuration file:
#  -E   Gene list to subset regions OR bed file for custom region
#  -T   Image title
###############################

ngs.plot.r  -G hg19 -R bed  -C $dataDir/wgEncodeHaibTfbs/wgEncodeHaibTfbsGm12878Mef2aPcr1xAlnRep1.bam:$dataDir/wgEncodeSydhTfbs/wgEncodeSydhTfbsGm12878InputStdAlnRep1.bam  -O  MEF_2A_NGSPLOT.txt -T $tfName -E $bedFile

ggplot ggplot2 tutorial




Data format processing for ggplot2


You have to make data frame to work with ggplot2

http://mundosubmundo.kaiux.com/?p=346

START WITH THIS SITE


http://www.cookbook-r.com/Graphs/Bar_and_line_graphs_%28ggplot2%29/



 Official web site

http://docs.ggplot2.org/current/

Practical Tips


http://www.ceb-institute.org/bbs/wp-content/uploads/2011/09/handout_ggplot2.pdf

3 Steps to make any ggplot graph (Tanvir's 3 step approach; it will work on 90 percent graph)

 

1. Make a data frame

2. Make a ggplot object with aes ( group =.. , color=.. , x = .. , y= ..)

Here for all plot don't require both x and y. For example, density plot only requires x , so for density provide only x for aes (). For boxplot it requires y value and x-value to show different group. so provide both x- and y-value in aes().

3. Use different function to make different type of plot

##### Step -1 
NoA = 10
u = rnorm(NoA)
NoB = 20
v = rnorm(NoB)

myfm =data.frame(genegroup = factor( rep(c("ccds","miRNA"), c(NoA,NoB) )  ) , exprVal = c(u,v) )

##### Step -2  and Step-3 [Densityplot : Y-value not required]
m = ggplot(myfm, aes(x=exprVal, colour=genegroup, group=genegroup))
m + geom_density(fill=NA)
   





 

##### Step -2  and Step-3 [ BoxPlot: : X and Y-value required]

m = ggplot(myfm, aes(y=exprVal,x=genegroup, colour=genegroup, group=genegroup))
 m + geom_boxplot()





Bar chart



NoA = 10
NoB = 20
myfm =data.frame(genegroup = factor( rep(c("lncRNA","miRNA"), c(NoA,NoB) )  )  )
m = ggplot(myfm , aes(     genegroup  )   )
m+ geom_bar( aes(fill = genegroup )) + theme( text = element_text(size=8),  axis.text.x = element_text(angle=90, vjust=1)) + ylab(" type ") +  xlab("Population")





Now some more complex data to make graphs

BoxPlot

 # Create a dataframe with two varialble Yvalues, Xvalues
val=as.matrix(valforplot)
cno=as.matrix(clusternoforplot)
dfForPlot=data.frame(Yvalues=val,Xvalues=cno)
colnames(dfForPlot)
# check the value of first variable
dfForPlot$Yvalues

# make the gg plot grouping agins Xvalues
ggplot(data=dfForPlot, aes(x=factor( as.matrix(dfForPlot$Xvalues)), y=as.matrix(dfForPlot$Yvalues) ) ) + geom_boxplot(  notch = FALSE,outlier.colour = "green", outlier.size = 3  ) +
xlab("Cluster no") +  ylab("Log Expr") 

Histogram

args  =commandArgs(TRUE)
fName= as.character(args[1]);
myTitle= as.character(args[2]);


## Step -1: Read file
 

baseMatH = read.table(fName, header=FALSE, row.names=1)
noRowH= dim(baseMatH)[1]
noColH= dim(baseMatH)[2]
goTermCount=baseMatH[,2]
maxCount= max(goTermCount)


## Step -2: Make Data Frame
myfm =data.frame(method = factor( rep(c("GOcount"), c(noRowH) )  ) , goCount = c(goTermCount) )

## Step -3 : Histogram from ggplot2
m = ggplot(myfm, aes(x=goCount, colour=method, group=method)) +  xlim(0,maxCount )
m + geom_histogram(binwidth = 1)  + labs(list(title = myTitle , x = "Number of GO association", y = "Total Gene")) + theme(text = element_text(size=14) )

 

Normal Bar chart


fname="../dataV1/F9_RA_Sox17.bed.100.100.summary"
data looks like following with two column
TF1  500
TF2 700
TF3 900

# A . Read the table
myTable = read.table( fname , header= FALSE, sep="\t")

# B . Form data frame. You must use factor with Level . otherwise it will sort the first column
 mydf = data.frame(
 Commodity =  factor( myTable$V1, levels = myTable$V1  ),
 Production = c(myTable$V2)   
 )

# C . Call the ggplot 
ggplot(data=mydf, aes(x=mydf$Commodity, y=mydf$Production )) + geom_bar(stat="identity")  + theme(text = element_text(size=8),  axis.text.x = element_text(angle=90, vjust=1)) + ylab("Number of ENCODE peaks overlap with [-100,+100] region of peak summit") +  xlab("TF")




Normal Density


smallNumber=1e-6
finalX=NULL
if(T){
       
    for(curFile in allOutWithout ) {  
        myfname = paste(myData  , curFile, sep="")
        print(myfname)
        baseTable = read.table(myfname, header=T, row.names=1)
        myvarMat=as.matrix(baseTable)
        myvarRow=rowVars(myvarMat)
        myvarRow=as.matrix(log(myvarRow+smallNumber ))
        totSample = dim(myvarRow)[1]
        finalX=c(finalX,myvarRow)
       
    }
   
}

if(T)
{
    dat =data.frame(timepoint = factor(rep(c("24","28","36","48","72"), each=totSample)), deviationCAGE = finalX )
   
    ggplot(dat, aes(x=deviationCAGE, fill=timepoint)) + geom_density(alpha=0.2) + ylab("deviation density") +  xlab("log(CAGE expression deviation)")
    ggsave(file=paste("deviationCAGEdensity",'.png',sep="") , width = 140, height = 140, units = "mm", dpi = 400, scale = 2 )

## box plot
    ggplot(data=dat, aes(x=factor( timepoint), y=as.matrix(deviationCAGE) ) ) + geom_boxplot(  notch = FALSE,outlier.colour = "green", outlier.size = 3  ) + xlab("Time Points") +  ylab("Log (CAGE expression deviation)") 
    ggsave(file=paste("deviationCAGEboxplot",'.png',sep="") , width = 140, height = 140, units = "mm", dpi = 400, scale = 2 )


}



Multiple Plots in a single plot

 

 

 

if(TRUE){
    library(ggplot2)
    library(grid)
    library(gridExtra) # You need to use this library
}
 m1 = ggplot(myfm, aes(y=featureCount,x=location, colour=location, group=location))
m1 = m1 + geom_boxplot() + labs(list(title = "Transcripts", x = "", y = "")) + theme(legend.position = "none")


 m2 = ggplot(myfm, aes(y=featureCount,x=location, colour=location, group=location))
m2 = m2 + geom_boxplot() + labs(list(title = "Transcripts", x = "", y = "")) + theme(legend.position = "none")

 m3 = ggplot(myfm, aes(y=featureCount,x=location, colour=location, group=location))
m3 = m3 + geom_boxplot() + labs(list(title = "Transcripts", x = "", y = "")) + theme(legend.position = "none")



png(paste("myimagename" '.png',sep=""))
grid.arrange(m1 ,m2,m3   ncol=2,nrow=3);
dev.off()


Heatmap

if(TRUE){

    library(MASS)
    library(ggplot2)
    library(gplots)
    library(extrafont)
    library(stringr)
    library(plyr)
}
typeRes="Mtb Only"
fnmInput=paste(myData , "NONSTIM.table.txt.report.specific", sep="" ) ;
figureTitle=paste(typeRes , " specific", sep="" ) ;
myTable = read.table( fnmInput , header= TRUE, row.names=1, sep="\t")
myValues= myTable[,2:5];

png(file=paste(fnmInput, '_heatmap' ,'.png',sep=""),   bg="white")  ;  # create device

heatmap.2(as.matrix(myValues), col=redblue(50), scale="row", dendrogram = "row",Colv="NA" ,  key=TRUE, keysize=1.0,margins = c(8, 16), density.info="none",row.names=TRUE,  trace="none",labRow=rownames(myTable),labCol=colnames(myTable)[2:5], main = figureTitle ,ylab = "change after Mtb infection", xlab = "DE clusters")

dev.off()
# margins = c(8, 16) # to fit all labels





How to use small font and angle when there are lot of x/y tick label

your ggplot +
theme(text = element_text(size=8),  axis.text.x = element_text(angle=90, vjust=1))

 

How to use x- and y-axis  label

your ggplot +
ylab("Number of ENCODE peaks overlap with [-100,+100] region of peak summit") +  xlab("TF")

How to save image created by ggplot


It allows extensions eps/ps, tex (pictex), pdf, jpeg, tiff, png, bmp, svg and wmf (windows only).

Suppose you have crated a box plot using ggplot

ggplot(data=dfForPlot, aes(x=factor( as.matrix(dfForPlot$clusterno)), y=as.matrix(dfForPlot$expr) ) ) + 
    geom_boxplot(notch = FALSE,outlier.colour = "green", outlier.size = 3   ) + 
    xlab("Cluster no") +  ylab("Log Expr")



Now save using following command

    ggsave(file=paste(outputPlot,'.png',sep="") , width = 140, height = 140, units = "mm", dpi = 400, scale = 2 )


Line graph (Mostly use for drawing average lines)




 x =  seq(0.01, .99, length=100)
df  = data.frame(x = rep(x, 2), y = c(qlogis(x), 2 * qlogis(x)), genetype = rep(c("e_lncRNA","p_lncRNA"), each=100))
p  =  ggplot(df, aes(x=x, y=y, group=genetype)) # Should work p + geom_line(linetype = 2) 

p + geom_line(aes(colour = genetype), linetype = 1)

How to use confidence interval in line graph


           xval    yval   groupname
ab       1        10    group1
ab       2        15    group1
c         1        10    group2
cd       2        15    group2


p = ggplot(mtcars, aes(x=xval, y= yval, group=groupname , colour=factor(groupname) ))
p + stat_smooth()

+ theme(legend.position = "none")
 
 

Customizing label, axis, legend

How to remove legends 

p + theme(legend.position = "none")
 
How to change x/y axis label, title etc

p+  labs(list(title = "Gm12878", x = "Distance to TSS", y = "Avg. Signal"))
 

How to change axis-tick label 

http://www.cookbook-r.com/Graphs/Axes_%28ggplot2%29/#setting-tick-mark-labels

# discrete caser
x <- br="" c="">
p+ scale_x_discrete(breaks=x,labels=x) 
 
# groupwise 
p + scale_x_discrete(breaks=c("cl", "t1", "t2"),labels=c("Control", "Treat1", "Treat2"))

# continusous case : Specify tick marks directly
p + coord_cartesian(ylim=c(5, 7.5)) +  scale_y_continuous(breaks=seq(0, 10, 0.25))  # Ticks from 0-10, every .25

Wednesday, January 7, 2015

SLURM tutorial : Basic commands


Main website for learning SLRUM


http://slurm.schedmd.com/tutorials.html

Submit a job with name and outputfile name(This will overwrite the parameters in shell file header )

sbatch   -J   job1  -o   job1.out  --partition=batch    myscript.sh

 

Basic shell script for job

#!/bin/sh
#
#SBATCH --job-name=testJob
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --partition=dragon-default
#
# Display all variables set by slurm
env | grep "^SLURM" | sort

#
cd /projects/dragon/FANTOM5/processed_data_feature

## All my commands for job will go here

date;time;
mkdir t1

How to submit a batch job

sbatch myscript.sh

How to check the list of jobs of a user

squeue -u user1
squeue -u user1 -l  # it will show in details
 

How to check the whole history and status of a job

 scontrol show job=JOBID

 

How to use one particular node in interactive mode. Useful when all jobs are pending and you need to run a job



srun --pty --time=5:00:00 bash


How to kill job


  •  Cancel job 1234 along with all of its steps:
    •               scancel 1234
  •  Send SIGKILL to all steps of job 1235, but do not cancel the job itself:
    •               scancel --signal=KILL 1235
  •  Send SIGUSR1 to the batch shell processes of job 1236:
    •               scancel --signal=USR1 --batch 1236
  •  Cancel job all pending jobs belonging to user "bob" in partition "debug":
    •               scancel --state=PENDING --user=bob --partition=debug
  •  Cancel only array ID 4 of job array 1237
    •               scancel 1237_4

How to start a node in interactive mode


srun --pty --nodes=1 --exclusive --partition=interactive bash -l

How to start a  GUI in cluster


You need xserver to access the GUI

1. Login to cluster

 ssh -Y username@login.cbrc.kaust.edu.sa

Next, you'll need to get an interactive jobs started, for example to get a whole node in the interactive queue:

2.  Open an interactive node

srun --pty --nodes=1 --exclusive --partition=interactive bash -l
3. Suppose you want to use matlab in GUI, then do following two commands

module load matlab
matlab &

Full list of SLURM commands [ Source: http://www.tchpc.tcd.ie/node/129 ]

Man pages exist for all SLURM daemons, commands, and API functions. The command option --help also provides a brief summary of options. Note that the command options are all case insensitive.
  • sacct is used to report job or job step accounting information about active or completed jobs.
  • salloc is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.
  • sattach is used to attach standard input, output, and error plus signal capabilities to a currently running job or job step. One can attach to and detach from jobs multiple times.
  • sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.
  • sbcast is used to transfer a file from local disk to local disk on the nodes allocated to a job. This can be used to effectively use diskless compute nodes or provide improved performance relative to a shared file system.
  • scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
  • scontrol is the administrative tool used to view and/or modify SLURM state. Note that many scontrol commands can only be executed as user root.
  • sinfo reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options.
  • smap reports state information for jobs, partitions, and nodes managed by SLURM, but graphically displays the information to reflect network topology.
  • squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.
  • srun is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared nodes within the job's node allocation.
  • smap reports state information for jobs, partitions, and nodes managed by SLURM, but graphically displays the information to reflect network topology.
  • strigger is used to set, get or view event triggers. Event triggers include things such as nodes going down or jobs approaching their time limit.
  • sview is a graphical user interface to get and update state information for jobs, partitions, and nodes managed by SLURM.

 Handling multiple jobs by Job Array:


 http://slurm.schedmd.com/job_array.html