Skip to main content

MATLAB check unique string in file

function identifyDuplicate
clc;

uniqueSeq={};
dupSeq={};

index=1;
uniqueIndex=1;
dupIndex=1;
uniq=[];
dup=[];
isDuplicated = 0;
fid = fopen('1400M_from_287PS_287NS.ranked','r');


tline = fgetl(fid); % ******
 while ischar(tline)
    
     consensusSeq = fgetl(fid); % Consessus: AAACC
     consensusSeq = upper(consensusSeq);

     curSeq = sscanf(consensusSeq,'%*s %s', [1, inf]);
     curSeq = upper(curSeq);

     fgetl(fid); % Threshold
     fgetl(fid); % Coverage
     fgetl(fid); % p-value
     fgetl(fid); % r1
     fgetl(fid); % r2
     fgetl(fid); % r3
     fgetl(fid); % r4
    
     isExist=0;
    
     for en=1:uniqueIndex -1    
         exist = strcmp(curSeq,uniqueSeq{en})
         if exist ==1
            isDuplicated = 1;
             break;
         end
     end
      
     if( isDuplicated == 1 ) % already exist       
         dupSeq{dupIndex}  = {curSeq};      
         dupIndex = dupIndex + 1;
         dup = [dup;index];
     else % not found
        
         uniqueSeq{uniqueIndex} = {curSeq};  
         uniqueIndex = uniqueIndex + 1;
         uniq = [ uniq;index];
     end
      
   
     
     tline = fgetl(fid); % next ******
     index = index + 1;
     isDuplicated = 0;
    
    
 end


 dlmwrite('unique',uniq,'\t'); % index of unique entry
 dlmwrite('dup'   ,dup   ,'\t'); % index of duplicate entry


fclose(fid);


Comments

Popular posts from this blog

Running openmp in eclipse

As we know to run openmp in gcc , C++ project we have to compile it with g++ -fopenmp option. To configure this with eclipse you just need to add -fopenmp under GCC C++ linker command option

MATLAB cross validation

// use built-in function samplesize = size( matrix , 1); c = cvpartition(samplesize,  'kfold' , k); % return the indexes on each fold ///// output in matlab console K-fold cross validation partition              N: 10    NumTestSets: 4      TrainSize: 8  7  7  8       TestSize: 2  3  3  2 ////////////////////// for i=1 : k    trainIdxs = find(training(c,i) ); %training(c,i);  // 1 means in train , 0 means in test    testInxs  = find(test(c,i)       ); % test(c,i);       // 1 means in test , 0 means in train    trainMatrix = matrix (  matrix(trainIdxs ), : );    testMatrix  = matrix (  matrix(testIdxs  ), : ); end //// now calculate performance %%  calculate performance of a partition     selectedKfoldSen=[];selectedKfoldSpe=[];selectedKfoldAcc=[];     indexSen=1;indexSpe=1;indexAcc=1;     if ( kfold == (P+N) )% leave one out         sensitivity = sum(cvtp) /( sum(cvtp) + sum(cvfn) )         specificity = sum(cvtn) /( sum(cvfp) + sum(cvtn) )         acc

R tutorial

Install R in linux ============ In CRAN home page, the latest version is not available. So, in fedora, Open the terminal yum list R  --> To check the latest available version of r yum install R --> install R version yum update R --> update current version to latest one 0 find help ============ ?exact topic name (  i.e.   ?mean ) 0.0 INSTALL 3rd party package  ==================== install.packages('mvtnorm' , dependencies = TRUE , lib='/home/alamt/myRlibrary/')   #  install new package BED file parsing (Always use read.delim it is the best) library(MASS) #library(ggplot2) dirRoot="D:/research/F5shortRNA/TestRIKEN/Rscripts/" dirData="D:/research/F5shortRNA/TestRIKEN/" setwd(dirRoot) getwd() myBed="test.bed" fnmBed=paste(dirData, myBed, sep="") # ccdsHh19.bed   tmp.bed ## Read bed use read.delim - it is the  best mybed=read.delim(fnmBed, header = FALSE, sep = "\t", quote = &q