Bioinformatics is Fun: October 2014

Friday, 17 October 2014

Bowtie

Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. Bowtie employs a Burrows-Wheeler index with a novel quality-aware backtracking algorithm that permits mismatches and it is based on the full-text minute-space (FM) index, which has a memory footprint of only about 1.3 gigabytes (GB) for the human genome. The small footprint allows Bowtie to run on a typical desktop computer with 2 GB of RAM. The index is small enough to be distributed over the internet and to be stored on disk and re-used. Multiple processor cores can be used simultaneously to achieve even greater alignment speed.

Bowtie aligns 35-base pair (bp) reads at a rate of more than 25 million reads per CPU-hour, which is more than 35 times faster than Maq and 300 times faster than SOAP under the same conditions.

Bowtie makes a number of compromises to achieve this speed, but these trade-offs are reasonable within the context of mammalian re-sequencing projects. If one or more exact matches exist for a read, then Bowtie is guaranteed to report one, but if the best match is an inexact one then Bowtie is not guaranteed in all cases to find the highest quality alignment. With its highest performance settings, Bowtie may fail to align a small number of reads with valid alignments, if those reads have multiple mismatches. If the stronger guarantees are desired, Bowtie supports options that increase accuracy at the cost of some performance.

Bowtie is open source http://bowtie.cbcb.umd.edu

Thursday, 16 October 2014

Code to find ratio of AT and GC

using namespace std;

#include<iostream>
#include<string>

int count( string &s, int& a_count, int& t_count, int& g_count, int& c_count, float &at , float &gc)
{
for(int i=0;i<s.size();i++)
{

    if(s[i]=='A')
    {
        a_count++;
    }
    else if(s[i]=='T')
    {
        t_count++;
    }
    else if(s[i]=='G')
    {
        g_count++;
    }
    else if(s[i]=='C')
    {
        c_count++;
    }
}
at= ((a_count+t_count)*100/s.size());
gc= ((c_count+g_count)*100/s.size());
return (a_count);
return (t_count);
return (g_count);
return (c_count);
return at;
return gc;

}

int main()
{
    int a_count=0;
    int t_count=0;
    int g_count=0;
    int c_count=0;
    float at=0.0;
    float gc=0.0;
    string s;
    cout<<"Enter a sequence: "<<endl;
    cin>>s;
    cout<<"The size of sequence is: "<<s.size()<<endl;
    count(s,a_count,t_count,g_count,c_count,at,gc);
    cout<<"The number of A: "<<a_count<<endl;
    cout<<"The number of T: "<<t_count<<endl;
    cout<<"The number of G: "<<g_count<<endl;
    cout<<"The number of C: "<<c_count<<endl;
    cout<<"Ratio of A "<< ((a_count)/s.size())*100)<<endl;
    cout<<"Ratio of T "<< ((t_count)/s.size())*100)<<endl;
    cout<<"Ratio of G "<< ((g_count)/s.size())*100)<<endl;
    cout<<"Ratio of C "<< ((c_count)/s.size())*100)<<endl;
    cout<<"Ratio of AT " << at<<endl;
    cout<<"Ratio of GC " << gc<<endl;

    return 0;
}

Code to find ORF and genes in a string

#include<iostream>
#include<cstring>
#include<vector>
#include<algorithm>

using namespace std;
// this is the function for collection a raw string as a input and converting into a vector
int complement(string &seq);
int get_codon(string seq,vector<string>&ivec,int frame_no)
{
    string s1;
    //this is for forward frame code extraction
    if(frame_no ==1 ||frame_no==2 || frame_no==3 )
    {
        //this is to extract the sequence into the vector called ivec
        for(int i = frame_no;i<seq.size()-1;i++)
        {
            s1="";
            s1.push_back(seq[i-1]);
            s1.push_back(seq[i]);
            s1.push_back(seq[i+1]);
            //this is for extracting the string as codons to a string in triplets
            i = i+2;
            if(s1.size()>=3)
            {
                ivec.push_back(s1);
            }
            else
            {
                break;
            }
        }
    }
    //this is for reverse open reading frame ie -1,-2,-3
    else if (frame_no ==4 || frame_no ==5|| frame_no == 6)
    {
        complement(seq);
        reverse(seq.begin(),seq.end());
        for(int i = frame_no; i<seq.size()-1;i++)
        {
            s1="";
            s1.push_back(seq[i-1]);
            s1.push_back(seq[i]);
            s1.push_back(seq[i+1]);
            //xtraction of the codes into triplets as s1 string
            i =i+2;
            if(s1.size()>=3)
            {
                ivec.push_back(s1);
            }
            else
            {
                break;
            }
        }
    }
}
int complement(string &seq)
{
    for(int i = 0; i < seq.size();i++)
    {
        if(seq[i]=='A'||seq[i]=='a')
            seq[i]='T';
        else if(seq[i]=='T'||seq[i]=='t')
            seq[i]='A';
        else if(seq[i]=='G'||seq[i]=='g')
            seq[i]='C';
        else if(seq[i]=='C'||seq[i]=='c')
            seq[i]='G';
        else
            continue;
    }
}
// this function is for extraction of gene from given orf frame which is having start and stop flags :P

void codon_find(vector<string>&ivec,vector<string>&gvec)
{
    int start_ind = 0;
    int stop_ind =0;
    string gene;
    string start = "ATG";
    string stop1 = "TAG";
    string stop2 = "TAA";
    string stop3 = "TGA";
    for(int i = 0;i<ivec.size();i++)
    {
    //this is statement for the begining of the a gene. We are iterating through the vector in search for
    //start string and then we are adding it to the gene string ;-)
        if ((ivec[i]==start)&&(start_ind==0)&&(start_ind==0))
        {
            start_ind=1;
            gene = gene+ivec[i];
        }
    //this condition is when the genecomes across the start codon whichc is
    // already having a start codon before
        else if((ivec[i]==start)&&(start_ind == 1)&& (stop_ind==0))
        {
            gene= "";
            gene = gene+ivec[i];
        }
    //when program comes across the terminating codon
    //Pushing all the codons which are present in gene string to gvec vector :P
        else if ((ivec[i]==stop1)||(ivec[i]==stop2)||(ivec[i]==stop3)&&(start_ind==1)&&(stop_ind==0))
        {
            gene = gene+ivec[i];
            gvec.push_back(gene);
            gvec.push_back("\n");
            gene = "";
            start_ind = 0;
            start_ind = 0;

        }
    // if we doesnt come across any of the special cases mentioned above then.......
        else if ((start_ind == 1 )&& (stop_ind == 0))
        {
            gene = gene+ivec[i];
        }

    }
}
int main()
{
    string seq= "GAAGTGTTTTATCTGACTTACACCCCTGAAGATGTTGAAGGGAATGTTCAGCTGGAAACTGGAGATAAAATAAACTTTGTAATTGATAACAATAAACATACTGGTGCTGTAAGTGCTCGTAATATTATGCTGTTGAAAAAGAAACAAGCTCGCTATCAGGGAGTAGTTTGTGCCATGAAAGAGGCATTTGGCTTTATTGAAAGAGGCGATATTGTAAAGGAGATATTCTTTCACTATAGTGAATTTAAAGGTGACTTAGAATCCTTACAGCCTGGAGATGACGTGGAATTCACAATCAAGGACCGAAATGGTAAAGAAGTTGCAACAGATGTCAGACTATTGCCTCAAGGAACAGTCATTTTTGAAGATATCAGCATTGAACATTTTGAAGGAACTGTAACCAAAGTTATCCCCAAAGTACCCAGTAAAAACCAGAATGACCCATTGCCAGGACGCATCAAAGTTGATTTTGTGATTCCTAAAGAACTTCCCTTTGGAGACAAAGATACAAAATCCAAGGTGACGCTGTTGGAAGGTGACCACGTTAGGTTTAATATTTCAACAGACCGTCGTGACAAATTAGAACGAGCAACCAACATAGAAGTTCTATCAAATACATTTCAGTTCACTAATGAAGCCAGAGAGATGGGTGTAATTGCTGCCATGAGAGATGGTTTTGGTTTCATCAAGTGTGTGGATCGTGATGCTCGTATGTTCTTCCACTTCAGTGAAATTCTGGATGGGAACCAGCTTCATATTGCAGATGAAGTAGAGTTTACTGTGGTTCCTGATATGCTCTCTGCCCAAAGAAATCATGCTATTAGGATTAAAAAACTTCCCAAGGGCACGGTTTCGTTCCACTCCCATTCAGATCATCGTTTTCTGGGCACTGTAGAAAAAGAGGCCACTTTTTCGAATCCTAAAACCACTAGCCCAAATAAAGGCAAAGAAAAGGAGGCTGAGGATGGCATTATTGCTTATGATGATTGTGGGGTGAAACTGACTATTGCTTTTCAAGCCAAGGATGTGGAAGGATCTACTTCTCCTCAAATAGGAGACAAGGTTGAATTTAGTATTAGTGACAAACAGAGGCCTGGACAGCAGATTGCAACTTGTGTGCGGCTCTTAGGTCGTAATTCAAACTCCAAGAGGCTCTTGGGTTATGTGGCAACTTTGAAGGATAATTTTGGATTTATTGAAACAGCCAATCATGATAAGGAAATCTTTTTCCATTACAGTGAGTTCTCTGGTGATGTTGATAGCCTGGAACTGGGGGACATGGTTGAGTACAGCTTGTCCAAAGGAAAAGGCAACAAAGTCAGTGCAGAAAAAGTGAACAAAACACACTCAGTGAATGGCATTACTGAGGAAGCTGATCCCACCATCTACTCTGGTAAAGTCATTCGCCCCTTGAGGAGTGTTGATCCAACACAGAATGAGTACCAAGGAATGATTGAGATCGTGGACGAAGGGGATATGAAAGGTGAGGTCTATCCATTTGGCATAGTTGGGATGGCCAACAAAGGGGATTGCCTACAGAAAGGGGAGAGTGTCAAGTTCCAGTTGTGTGTCCTGGGCCAAAATGCACAGACTATGGCCTACAACATCACACCCCTGCGTAGGGCTACAGTGGAGTGTGTGAAAGATCAGTTTGGCTTCATTAACTATGAAGTAGGAGATAGCAAGAAGCTCTTTTTCCACGTGAAAGAAGTTCAGGATGGCATTGAGCTACAGGCAGGAGATGAGGTGGAATTCTCAGTGATTCTTAATCAGCGCACTGGCAAGTGCAGTGCTTGTAATGTTTGGCGAGTCTGCGAGGGCCCCAAGGCTGTTGCAGCTCCACGACCTGATAGGTTGGTCAATCGCTTGAAGAATATCACCCTGGATGATGCCAGTGCTCCTCGCCTAATGGTTCTTCGTCAGCCAAGGGGACCAGATAACTCAATGGGATTTGGTGCAGAAAGAAAGATCCGTCAAGCTGGTGTCATTGACTAACCACATCCACAAAGCACATCATTAATCCACTATGATCAAGTTGGGGGGATTCTGGTGAAGGGTTCTGAATATCTCTCTCTTCATCCCTCCCAAAATCTGGAATACTTATTCTATTGAGCTATTACACCAGTTTTAACACCTTCC";
    vector<string>ivec;
    vector<string>gvec;
    //calling the function for ORF finder :P
    for(int j=1; j<=6;j++)
    {

        cout<<"ORF"<<j<<":-"<<endl;
        get_codon(seq,ivec,j);

        cout<<endl;
        for(int i = 0;i<ivec.size();i++)
        {
            cout<<ivec[i]<<" ";

        }
        cout<<endl;
    //calling the codon find function :P
        codon_find(ivec,gvec);
        vector<string>::iterator iter;
        cout << "Genes :- "<< endl;
        for(iter= gvec.begin();iter<gvec.end();iter++)
        {
            cout<<*iter;

        }
        cout<<endl;
        ivec.clear();
        gvec.clear();

    }
    return (0);

}

Models Models everywhere....

The current bioinformatics leap is on modelling. A model is very much essential for data interpretation but, the fundamental question is: what levels of models should be chosen? A model class should be selected according to the data requirements and the objectives of the modeling
and analysis. This involves classical engineering tradeoffs. For example, a “fine” model with many parameters will capture detailed “low-level” phenomena, but will require large amounts of data for the inference, for fear of the model being “over fitted ” to the data, whereas a less complex “coarse” model with fewer parameters will capture “high-level” phenomena, but will require small amounts of data. Within a chosen model class, according to Occam’s Razor principle, the model should never be made more complex than what is necessary to “explain the data”. There are numerous approaches for modelling gene regulatory networks: it goes from linear models, Bayesian networks, neural networks, non linear ordinary differential equations, and stochastic models to Boolean models, logical networks, Petri nets, graph-based models, grammars, and process algebras.
So, which model will you choose for your biological data...?!!

Monday, 13 October 2014

Gene ID conversion resources

I see lot of my friends spending time on researching upon databases. There is a dire requirement for compiling all the available databases in a useful and user friendly way.

Here, let's see the currently available Gene ID conversion resources:

NetAffx: http://www.affymetrix.com/analysis/index.affx

It has batch and net affx query, provides detailed description of individual probeset and allows user to group probsets according to annotation type. Covers 10 species and has 17 possible conversions.

g:Profiler: http://biit.cs.ut.ee/gprofiler/gconvert.cgi

Input can be a mix of different ID types. User has to choose only the desired output. It covers 31 species and has 14 possible conversions.

Babelomics ID Converter: http://babelomics3.bioinfo.cipf.es/

GeneID alone as input ID type. Covers 11 species and 36 output ID types are available.

BioDBnet:http://biodbnet.abcc.ncifcrf.gov/db/db2db.php

The most comprehensive and very easy to use database wherein we can give any ID as input and get the output in any convertible ID format.

DAVID: http://david.abcc.ncifcrf.gov/conversion.jsp

User can upload more than one file of different ID types. A specific format is not available and it has 32 possible conversion.

Clone/Gene ID converter:http://idconverter.bioinfo.cnio.es/

Gene ontology, pathway and literature references are available. Provides chromosome location for human. Covers 3 species and has 25 possible conversions.

The Synergizer: http://llama.mshri.on.ca/synergizer/translate/

It is a repository of gene and protein identifier synonym relationships. The number of species covered changes based on the 'authority' used and accordingly the number of input and output format also change.

MUSIC

We all listen to music. It may be RAP, Rock, Hard metal or melodious any kind of music form may remain close to heart.
But, here is the Bioinformatics music. MUSIC, a signal processing approach for identification of enriched regions in ChIP-Seq data, available at music.gersteinlab.org.
MUSIC first filters the ChIP-Seq read-depth signal for systematic noise from non-uniform mappability, which fragments enriched regions. Then it performs a multiscale decomposition, using median filtering, identifying enriched regions at multiple length scales.
This is useful given the wide range of scales probed in ChIP-Seq assays. MUSIC performs favorably in terms of accuracy and reproducibility compared with other methods.
So, start listening Oops working upon your MUSIC.

OncoCis

Whole genome sequencing has enabled the identification of thousands of somatic mutations within non-coding genomic regions of individual cancer samples. However, identification of mutations that potentially alter gene regulation remains a major challenge. OncoCis is a new method that enables identification of potential cis-regulatory mutations using cell-type specific genome and epigenome-wide datasets along with matching gene expression data. OncoCis demonstrates that the use of cell-type specific information and gene expression can significantly reduce the number of candidate cis-regulatory mutations compared with existing tools designed for the annotation of cis-regulatory SNPs. The OncoCis webserver is freely accessible at https://powcs.med.unsw.edu.au/OncoCis/.