Bioinformatics is Fun: 2014

Friday, 17 October 2014

Bowtie

Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. Bowtie employs a Burrows-Wheeler index with a novel quality-aware backtracking algorithm that permits mismatches and it is based on the full-text minute-space (FM) index, which has a memory footprint of only about 1.3 gigabytes (GB) for the human genome. The small footprint allows Bowtie to run on a typical desktop computer with 2 GB of RAM. The index is small enough to be distributed over the internet and to be stored on disk and re-used. Multiple processor cores can be used simultaneously to achieve even greater alignment speed.

Bowtie aligns 35-base pair (bp) reads at a rate of more than 25 million reads per CPU-hour, which is more than 35 times faster than Maq and 300 times faster than SOAP under the same conditions.

Bowtie makes a number of compromises to achieve this speed, but these trade-offs are reasonable within the context of mammalian re-sequencing projects. If one or more exact matches exist for a read, then Bowtie is guaranteed to report one, but if the best match is an inexact one then Bowtie is not guaranteed in all cases to find the highest quality alignment. With its highest performance settings, Bowtie may fail to align a small number of reads with valid alignments, if those reads have multiple mismatches. If the stronger guarantees are desired, Bowtie supports options that increase accuracy at the cost of some performance.

Bowtie is open source http://bowtie.cbcb.umd.edu

Thursday, 16 October 2014

Code to find ratio of AT and GC

using namespace std;

#include<iostream>
#include<string>

int count( string &s, int& a_count, int& t_count, int& g_count, int& c_count, float &at , float &gc)
{
for(int i=0;i<s.size();i++)
{

    if(s[i]=='A')
    {
        a_count++;
    }
    else if(s[i]=='T')
    {
        t_count++;
    }
    else if(s[i]=='G')
    {
        g_count++;
    }
    else if(s[i]=='C')
    {
        c_count++;
    }
}
at= ((a_count+t_count)*100/s.size());
gc= ((c_count+g_count)*100/s.size());
return (a_count);
return (t_count);
return (g_count);
return (c_count);
return at;
return gc;

}

int main()
{
    int a_count=0;
    int t_count=0;
    int g_count=0;
    int c_count=0;
    float at=0.0;
    float gc=0.0;
    string s;
    cout<<"Enter a sequence: "<<endl;
    cin>>s;
    cout<<"The size of sequence is: "<<s.size()<<endl;
    count(s,a_count,t_count,g_count,c_count,at,gc);
    cout<<"The number of A: "<<a_count<<endl;
    cout<<"The number of T: "<<t_count<<endl;
    cout<<"The number of G: "<<g_count<<endl;
    cout<<"The number of C: "<<c_count<<endl;
    cout<<"Ratio of A "<< ((a_count)/s.size())*100)<<endl;
    cout<<"Ratio of T "<< ((t_count)/s.size())*100)<<endl;
    cout<<"Ratio of G "<< ((g_count)/s.size())*100)<<endl;
    cout<<"Ratio of C "<< ((c_count)/s.size())*100)<<endl;
    cout<<"Ratio of AT " << at<<endl;
    cout<<"Ratio of GC " << gc<<endl;

    return 0;
}

Code to find ORF and genes in a string

#include<iostream>
#include<cstring>
#include<vector>
#include<algorithm>

using namespace std;
// this is the function for collection a raw string as a input and converting into a vector
int complement(string &seq);
int get_codon(string seq,vector<string>&ivec,int frame_no)
{
    string s1;
    //this is for forward frame code extraction
    if(frame_no ==1 ||frame_no==2 || frame_no==3 )
    {
        //this is to extract the sequence into the vector called ivec
        for(int i = frame_no;i<seq.size()-1;i++)
        {
            s1="";
            s1.push_back(seq[i-1]);
            s1.push_back(seq[i]);
            s1.push_back(seq[i+1]);
            //this is for extracting the string as codons to a string in triplets
            i = i+2;
            if(s1.size()>=3)
            {
                ivec.push_back(s1);
            }
            else
            {
                break;
            }
        }
    }
    //this is for reverse open reading frame ie -1,-2,-3
    else if (frame_no ==4 || frame_no ==5|| frame_no == 6)
    {
        complement(seq);
        reverse(seq.begin(),seq.end());
        for(int i = frame_no; i<seq.size()-1;i++)
        {
            s1="";
            s1.push_back(seq[i-1]);
            s1.push_back(seq[i]);
            s1.push_back(seq[i+1]);
            //xtraction of the codes into triplets as s1 string
            i =i+2;
            if(s1.size()>=3)
            {
                ivec.push_back(s1);
            }
            else
            {
                break;
            }
        }
    }
}
int complement(string &seq)
{
    for(int i = 0; i < seq.size();i++)
    {
        if(seq[i]=='A'||seq[i]=='a')
            seq[i]='T';
        else if(seq[i]=='T'||seq[i]=='t')
            seq[i]='A';
        else if(seq[i]=='G'||seq[i]=='g')
            seq[i]='C';
        else if(seq[i]=='C'||seq[i]=='c')
            seq[i]='G';
        else
            continue;
    }
}
// this function is for extraction of gene from given orf frame which is having start and stop flags :P

void codon_find(vector<string>&ivec,vector<string>&gvec)
{
    int start_ind = 0;
    int stop_ind =0;
    string gene;
    string start = "ATG";
    string stop1 = "TAG";
    string stop2 = "TAA";
    string stop3 = "TGA";
    for(int i = 0;i<ivec.size();i++)
    {
    //this is statement for the begining of the a gene. We are iterating through the vector in search for
    //start string and then we are adding it to the gene string ;-)
        if ((ivec[i]==start)&&(start_ind==0)&&(start_ind==0))
        {
            start_ind=1;
            gene = gene+ivec[i];
        }
    //this condition is when the genecomes across the start codon whichc is
    // already having a start codon before
        else if((ivec[i]==start)&&(start_ind == 1)&& (stop_ind==0))
        {
            gene= "";
            gene = gene+ivec[i];
        }
    //when program comes across the terminating codon
    //Pushing all the codons which are present in gene string to gvec vector :P
        else if ((ivec[i]==stop1)||(ivec[i]==stop2)||(ivec[i]==stop3)&&(start_ind==1)&&(stop_ind==0))
        {
            gene = gene+ivec[i];
            gvec.push_back(gene);
            gvec.push_back("\n");
            gene = "";
            start_ind = 0;
            start_ind = 0;

        }
    // if we doesnt come across any of the special cases mentioned above then.......
        else if ((start_ind == 1 )&& (stop_ind == 0))
        {
            gene = gene+ivec[i];
        }

    }
}
int main()
{
    string seq= "GAAGTGTTTTATCTGACTTACACCCCTGAAGATGTTGAAGGGAATGTTCAGCTGGAAACTGGAGATAAAATAAACTTTGTAATTGATAACAATAAACATACTGGTGCTGTAAGTGCTCGTAATATTATGCTGTTGAAAAAGAAACAAGCTCGCTATCAGGGAGTAGTTTGTGCCATGAAAGAGGCATTTGGCTTTATTGAAAGAGGCGATATTGTAAAGGAGATATTCTTTCACTATAGTGAATTTAAAGGTGACTTAGAATCCTTACAGCCTGGAGATGACGTGGAATTCACAATCAAGGACCGAAATGGTAAAGAAGTTGCAACAGATGTCAGACTATTGCCTCAAGGAACAGTCATTTTTGAAGATATCAGCATTGAACATTTTGAAGGAACTGTAACCAAAGTTATCCCCAAAGTACCCAGTAAAAACCAGAATGACCCATTGCCAGGACGCATCAAAGTTGATTTTGTGATTCCTAAAGAACTTCCCTTTGGAGACAAAGATACAAAATCCAAGGTGACGCTGTTGGAAGGTGACCACGTTAGGTTTAATATTTCAACAGACCGTCGTGACAAATTAGAACGAGCAACCAACATAGAAGTTCTATCAAATACATTTCAGTTCACTAATGAAGCCAGAGAGATGGGTGTAATTGCTGCCATGAGAGATGGTTTTGGTTTCATCAAGTGTGTGGATCGTGATGCTCGTATGTTCTTCCACTTCAGTGAAATTCTGGATGGGAACCAGCTTCATATTGCAGATGAAGTAGAGTTTACTGTGGTTCCTGATATGCTCTCTGCCCAAAGAAATCATGCTATTAGGATTAAAAAACTTCCCAAGGGCACGGTTTCGTTCCACTCCCATTCAGATCATCGTTTTCTGGGCACTGTAGAAAAAGAGGCCACTTTTTCGAATCCTAAAACCACTAGCCCAAATAAAGGCAAAGAAAAGGAGGCTGAGGATGGCATTATTGCTTATGATGATTGTGGGGTGAAACTGACTATTGCTTTTCAAGCCAAGGATGTGGAAGGATCTACTTCTCCTCAAATAGGAGACAAGGTTGAATTTAGTATTAGTGACAAACAGAGGCCTGGACAGCAGATTGCAACTTGTGTGCGGCTCTTAGGTCGTAATTCAAACTCCAAGAGGCTCTTGGGTTATGTGGCAACTTTGAAGGATAATTTTGGATTTATTGAAACAGCCAATCATGATAAGGAAATCTTTTTCCATTACAGTGAGTTCTCTGGTGATGTTGATAGCCTGGAACTGGGGGACATGGTTGAGTACAGCTTGTCCAAAGGAAAAGGCAACAAAGTCAGTGCAGAAAAAGTGAACAAAACACACTCAGTGAATGGCATTACTGAGGAAGCTGATCCCACCATCTACTCTGGTAAAGTCATTCGCCCCTTGAGGAGTGTTGATCCAACACAGAATGAGTACCAAGGAATGATTGAGATCGTGGACGAAGGGGATATGAAAGGTGAGGTCTATCCATTTGGCATAGTTGGGATGGCCAACAAAGGGGATTGCCTACAGAAAGGGGAGAGTGTCAAGTTCCAGTTGTGTGTCCTGGGCCAAAATGCACAGACTATGGCCTACAACATCACACCCCTGCGTAGGGCTACAGTGGAGTGTGTGAAAGATCAGTTTGGCTTCATTAACTATGAAGTAGGAGATAGCAAGAAGCTCTTTTTCCACGTGAAAGAAGTTCAGGATGGCATTGAGCTACAGGCAGGAGATGAGGTGGAATTCTCAGTGATTCTTAATCAGCGCACTGGCAAGTGCAGTGCTTGTAATGTTTGGCGAGTCTGCGAGGGCCCCAAGGCTGTTGCAGCTCCACGACCTGATAGGTTGGTCAATCGCTTGAAGAATATCACCCTGGATGATGCCAGTGCTCCTCGCCTAATGGTTCTTCGTCAGCCAAGGGGACCAGATAACTCAATGGGATTTGGTGCAGAAAGAAAGATCCGTCAAGCTGGTGTCATTGACTAACCACATCCACAAAGCACATCATTAATCCACTATGATCAAGTTGGGGGGATTCTGGTGAAGGGTTCTGAATATCTCTCTCTTCATCCCTCCCAAAATCTGGAATACTTATTCTATTGAGCTATTACACCAGTTTTAACACCTTCC";
    vector<string>ivec;
    vector<string>gvec;
    //calling the function for ORF finder :P
    for(int j=1; j<=6;j++)
    {

        cout<<"ORF"<<j<<":-"<<endl;
        get_codon(seq,ivec,j);

        cout<<endl;
        for(int i = 0;i<ivec.size();i++)
        {
            cout<<ivec[i]<<" ";

        }
        cout<<endl;
    //calling the codon find function :P
        codon_find(ivec,gvec);
        vector<string>::iterator iter;
        cout << "Genes :- "<< endl;
        for(iter= gvec.begin();iter<gvec.end();iter++)
        {
            cout<<*iter;

        }
        cout<<endl;
        ivec.clear();
        gvec.clear();

    }
    return (0);

}

Models Models everywhere....

The current bioinformatics leap is on modelling. A model is very much essential for data interpretation but, the fundamental question is: what levels of models should be chosen? A model class should be selected according to the data requirements and the objectives of the modeling
and analysis. This involves classical engineering tradeoffs. For example, a “fine” model with many parameters will capture detailed “low-level” phenomena, but will require large amounts of data for the inference, for fear of the model being “over fitted ” to the data, whereas a less complex “coarse” model with fewer parameters will capture “high-level” phenomena, but will require small amounts of data. Within a chosen model class, according to Occam’s Razor principle, the model should never be made more complex than what is necessary to “explain the data”. There are numerous approaches for modelling gene regulatory networks: it goes from linear models, Bayesian networks, neural networks, non linear ordinary differential equations, and stochastic models to Boolean models, logical networks, Petri nets, graph-based models, grammars, and process algebras.
So, which model will you choose for your biological data...?!!

Monday, 13 October 2014

Gene ID conversion resources

I see lot of my friends spending time on researching upon databases. There is a dire requirement for compiling all the available databases in a useful and user friendly way.

Here, let's see the currently available Gene ID conversion resources:

NetAffx: http://www.affymetrix.com/analysis/index.affx

It has batch and net affx query, provides detailed description of individual probeset and allows user to group probsets according to annotation type. Covers 10 species and has 17 possible conversions.

g:Profiler: http://biit.cs.ut.ee/gprofiler/gconvert.cgi

Input can be a mix of different ID types. User has to choose only the desired output. It covers 31 species and has 14 possible conversions.

Babelomics ID Converter: http://babelomics3.bioinfo.cipf.es/

GeneID alone as input ID type. Covers 11 species and 36 output ID types are available.

BioDBnet:http://biodbnet.abcc.ncifcrf.gov/db/db2db.php

The most comprehensive and very easy to use database wherein we can give any ID as input and get the output in any convertible ID format.

DAVID: http://david.abcc.ncifcrf.gov/conversion.jsp

User can upload more than one file of different ID types. A specific format is not available and it has 32 possible conversion.

Clone/Gene ID converter:http://idconverter.bioinfo.cnio.es/

Gene ontology, pathway and literature references are available. Provides chromosome location for human. Covers 3 species and has 25 possible conversions.

The Synergizer: http://llama.mshri.on.ca/synergizer/translate/

It is a repository of gene and protein identifier synonym relationships. The number of species covered changes based on the 'authority' used and accordingly the number of input and output format also change.

MUSIC

We all listen to music. It may be RAP, Rock, Hard metal or melodious any kind of music form may remain close to heart.
But, here is the Bioinformatics music. MUSIC, a signal processing approach for identification of enriched regions in ChIP-Seq data, available at music.gersteinlab.org.
MUSIC first filters the ChIP-Seq read-depth signal for systematic noise from non-uniform mappability, which fragments enriched regions. Then it performs a multiscale decomposition, using median filtering, identifying enriched regions at multiple length scales.
This is useful given the wide range of scales probed in ChIP-Seq assays. MUSIC performs favorably in terms of accuracy and reproducibility compared with other methods.
So, start listening Oops working upon your MUSIC.

OncoCis

Whole genome sequencing has enabled the identification of thousands of somatic mutations within non-coding genomic regions of individual cancer samples. However, identification of mutations that potentially alter gene regulation remains a major challenge. OncoCis is a new method that enables identification of potential cis-regulatory mutations using cell-type specific genome and epigenome-wide datasets along with matching gene expression data. OncoCis demonstrates that the use of cell-type specific information and gene expression can significantly reduce the number of candidate cis-regulatory mutations compared with existing tools designed for the annotation of cis-regulatory SNPs. The OncoCis webserver is freely accessible at https://powcs.med.unsw.edu.au/OncoCis/.

Friday, 5 September 2014

Edit the Genome

Bioinformatics Post Doctoral Position

http://www.genomeweb.com/node/1435336

Monday, 13 January 2014

Perl Code for Optical alignment

# Needleman-Wunsch  Algorithm 

# usage statement
die "usage: $0 <sequence 1> <sequence 2>\n" unless @ARGV == 2;

# get sequences from command line
my ($seq1, $seq2) = @ARGV;

# scoring scheme
my $MATCH    =  1; # +1 for letters that match
my $MISMATCH = -1; # -1 for letters that mismatch
my $GAP      = -1; # -1 for any gap

# initialization
my @matrix;
$matrix[0][0]{score}   = 0;
$matrix[0][0]{pointer} = "none";
for(my $j = 1; $j <= length($seq1); $j++) {
    $matrix[0][$j]{score}   = $GAP * $j;
    $matrix[0][$j]{pointer} = "left";
}
for (my $i = 1; $i <= length($seq2); $i++) {
    $matrix[$i][0]{score}   = $GAP * $i;
    $matrix[$i][0]{pointer} = "up";
}

# fill
for(my $i = 1; $i <= length($seq2); $i++) {
    for(my $j = 1; $j <= length($seq1); $j++) {
        my ($diagonal_score, $left_score, $up_score);

        # calculate match score
        my $letter1 = substr($seq1, $j-1, 1);
        my $letter2 = substr($seq2, $i-1, 1);                            
        if ($letter1 eq $letter2) {
            $diagonal_score = $matrix[$i-1][$j-1]{score} + $MATCH;
        }
        else {
            $diagonal_score = $matrix[$i-1][$j-1]{score} + $MISMATCH;
        }

        # calculate gap scores
        $up_score   = $matrix[$i-1][$j]{score} + $GAP;
        $left_score = $matrix[$i][$j-1]{score} + $GAP;

        # choose best score
        if ($diagonal_score >= $up_score) {
            if ($diagonal_score >= $left_score) {
                $matrix[$i][$j]{score}   = $diagonal_score;
                $matrix[$i][$j]{pointer} = "diagonal";
            }
        else {
                $matrix[$i][$j]{score}   = $left_score;
                $matrix[$i][$j]{pointer} = "left";
            }
        } else {
            if ($up_score >= $left_score) {
                $matrix[$i][$j]{score}   = $up_score;
                $matrix[$i][$j]{pointer} = "up";
            }
            else {
                $matrix[$i][$j]{score}   = $left_score;
                $matrix[$i][$j]{pointer} = "left";
            }
        }
    }
}

# trace-back

my $align1 = "";
my $align2 = "";

# start at last cell of matrix
my $j = length($seq1);
my $i = length($seq2);

while (1) {
    last if $matrix[$i][$j]{pointer} eq "none"; # ends at first cell of matrix

    if ($matrix[$i][$j]{pointer} eq "diagonal") {
        $align1 .= substr($seq1, $j-1, 1);
        $align2 .= substr($seq2, $i-1, 1);
        $i--;
        $j--;
    }
    elsif ($matrix[$i][$j]{pointer} eq "left") {
        $align1 .= substr($seq1, $j-1, 1);
        $align2 .= "-";
        $j--;
    }
    elsif ($matrix[$i][$j]{pointer} eq "up") {
        $align1 .= "-";
        $align2 .= substr($seq2, $i-1, 1);
        $i--;
    }    
}

$align1 = reverse $align1;
$align2 = reverse $align2;
print "$align1\n";
print "$align2\n";

Needleman- Wunsch explained in detail

Global alignment algorithm:

The alignment takes place in a two-dimensional matrix in which each cell corresponds to a pairing of one letter from each sequence.The alignment starts at the upper left and follows a mostly diagonal path down and to the right. When two letters are aligned, the path follows a diagonal trajectory. There are several places in which the letters from TCGCA are paired to gap characters. In this case, the graph is followed horizontally. Although not shown here, the path may be also be followed vertically when the letters from TCCA are paired with gap characters. Gap characters can never be paired to each other.

Initialization:
In the initialization phase, you assign values for the first row and column . The next stage of the algorithm depends on this. The score of each cell is set to the gap score multiplied by the distance from the origin. Gaps may be present at the beginning of either sequence, and their cost is the same as anywhere else. The arrows all point back to the origin, which ensures that alignments go all the way back to the origin (a requirement for global alignment). If the gap penalty is -2 then it'll be -2, -4,-6...

Fill :
In the fill phase (also called induction), the entire matrix is filled with scores and pointers using a simple operation that requires the scores from the diagonal, vertical, and horizontal neighbouring cells. You will compute three scores: a match score, a vertical gap score, and a horizontal gap score. The match score is the sum of the diagonal cell score and the score for a match (+1 or -1). The horizontal gap score is the sum of the cell to the left and the gap score (-1), and the vertical gap score is computed analogously. Once you've computed these scores, assign the maximum value to the cell and point the arrow in the direction of the maximum score. Continue this operation until the entire matrix is filled, and each cell contains the score and pointer to the best possible alignment at that point.
The match score is the sum of the preceding diagonal cell (score = 0) and the score for aligning C to P (-1). The total match score is -1. The horizontal gap score is the sum of the score to the left (-1) and the gap score (-1). The horizontal gap score is therefore -2. The same is true for the vertical gap score. Your maximum score is therefore the diagonal score (-1), and the pointer is set to the diagonal

Trace back :

The trace-back lets you recover the alignment from the matrix. Like the other parts of this algorithm, it's pretty simple. Start at the bottom-right corner and follow the arrows until you get to the beginning. To produce the alignment, at each cell, write out the corresponding letters or a hyphen for the gap symbol. Since you're following it from the end to the start, the alignment will be backward, and you just reverse it. The final alignment looks like this: