Bioinformatics is Fun: December 2013

Wednesday, 25 December 2013

Sequence homology versus sequence similarity

An important concept in sequence analysis is sequence homology. When two
sequences are descended from a common evolutionary origin, they are said to have a homologous relationship or share homology.
A related but different term is sequence similarity, which is the percentage of aligned residues that are similar in physiochemical properties such as size, charge, and hydrophobicity.

Generally, if the sequence similarity level is high enough, a common evolutionary relationship can be inferred. In dealing with real research problems, the issue of at what similarity level can one infer homologous relationships is not always clear. The answer depends on the type of sequences being examined and sequence lengths.

Shorter sequences require higher cut-offs for inferring homologous relationships than longer sequences. For determining a homology relationship of two protein sequences, for example, if both sequences are aligned at full length, which is 100 residues long, an identity of 30% or higher can be safely regarded as having close homology. They are sometimes referred to as being in the “safe zone”. If their identity level falls between 20% and 30%, determination of homologous relationships in this range becomes less certain. This is the area often regarded as the “twilight zone,” where remote homologs mix with randomly related sequences.
Below 20% identity, where high proportions of non-related sequences are present, homologous relationships cannot be reliably determined and thus fall into the “mid-night zone.”

Applications and limitations of bioinformatics

Applications:
Bioinformatics has not only become essential for basic genomic and molecular biology research, but is having a major impact on many areas of biotechnology and biomedical sciences. It has applications, for example, in knowledge-based drug design,
forensic DNA analysis, and
agricultural biotechnology.
Computational studies of protein–ligand interactions provide a rational basis for the rapid identification of novel leads for synthetic drugs. Knowledge of the three-dimensional structures of proteins allows molecules to be designed that are capable of binding to the receptor site of a target protein with great affinity and specificity. This informatics-based approach significantly reduces the time and cost necessary to develop drugs with higher potency, fewer side effects, and less toxicity than using the traditional trial-and-error approach.
In forensics, results from molecular phylogenetic analysis have been accepted as evidence in criminal courts. Some sophisticated Bayesian statistics and likelihood-based methods for analysis of DNA have been applied in the analysis of forensic identity.
It is worth mentioning that genomics and bioinformtics are now poised to revolutionize our healthcare system by developing personalized and customized medicine. The high speed genomic sequencing coupled with sophisticated informatics technology will allow a doctor in a clinic to quickly sequence a patient’s genome and easily detect potential harmful mutations and to engage in early diagnosis and effective treatment of diseases. Bioinformatics tools are being used in agriculture as well. Plant genome
databases and gene expression profile analyses have played an important role in the development of new crop varieties that have higher productivity and more resistance to disease.

Limitations :
Bioinformatics predictions are not formal proofs of any concepts. They
do not replace the traditional experimental research methods of actually testing hypotheses. In addition, the quality of bioinformatics predictions depends on the quality of data and the sophistication of the algorithms being used. Sequence data from high throughput analysis often contain errors. If the sequences are wrong or annotations incorrect, the results from the downstream analysis are misleading as well.

Needleman and Wunsch algorithm

Needleman and Wunsch (1970) performed progressive building of an alignment by comparing two amino acids at a time. They started at the end of each sequence and then moved ahead one amino acid pair at a time, allowing for various combinations of matched pairs, mismatched pairs, or extra amino acids in one sequence (insertion or deletion). In computer science, this approach is called dynamic programming.
The Needleman and Wunsch approach generated :
(1) every possible alignment, each one including every possible combination of match, mismatch, and single
insertion or deletion, and
(2) a scoring system to score the alignment. The object was to
determine which was the best alignment of all by determining the highest score.
Thus, every match in a trial alignment was given a score of 1, every mismatch a score of 0, and individual gaps a penalty score. These numbers were then added across the alignment to obtain a total score for the alignment. The alignment with the highest possible score was
defined as the optimal alignment.

The procedure for generating all of the possible alignments is to move sequentially through all of the matched positions within a matrix, much like the dot matrix graph (see above), starting at those positions that correspond to the end of one of the sequences, as shown in Figure 1.4. At each position in the matrix, the highest possible score that can be achieved up to that point is placed in that position, allowing for all possible starting points
in either sequence and any combination of matches, mismatches, insertions, and deletions.
The best alignment is found by finding the highest-scoring position in the graph, and then tracing back through the graph through the path that generated the highest-scoring positions. The sequences are then aligned so that the sequence characters corresponding to this path are matched.

Tuesday, 24 December 2013

The dot matrix / Dot plot -Introduction

n 1970, A.J. Gibbs and G.A. McIntyre (1970) described a new method for comparing two amino acid and nucleotide sequences in which a graph was drawn with one sequence written across the page and the other down the left-hand side.
Whenever the same letter appeared in both sequences, a dot was placed at the intersection of the corresponding sequence positions on the graph
The resulting graph was then scanned for a series of dots that formed a diagonal, which revealed similarity, or a string of the same characters, between the sequences. Long sequences can also be compared in this manner on a single page by using smaller dots.

The dot matrix method quite readily reveals the presence of insertions or deletions between sequences because they shift the diagonal horizontally or vertically by the amount of change. Comparing a single sequence to itself can reveal the presence of a repeat of the same sequence in the same (direct repeat) or reverse (inverted repeat or palindrome) orientation. This method of self-comparison can reveal several features, such as similarity
between chromosomes, tandem genes, repeated domains in a protein sequence, regions of low sequence complexity where the same characters are often repeated, or self-complementary sequences in RNA that can potentially base-pair to give a double-stranded structure. Because diagonals may not always be apparent on the graph due to weak similarity,
Gibbs and McIntyre counted all possible diagonals and these counts were compared to those of random sequences to identify the most significant alignments.

The first sequences to be collected were those of proteins

The development of protein-sequencing methods (Sanger and Tuppy 1951) led to the sequencing of representatives of several of the more common protein families such as cytochromes from a variety of organisms. Margaret Dayhoff (1972, 1978) and her collaborators at the National Biomedical Research Foundation (NBRF), Washington, DC, were the first to assemble databases of these sequences into a protein sequence atlas in the 1960s, and their collection center eventually became known as the Protein Information Resource (PIR).

Dayhoff and her coworkers organized the proteins into families and superfamilies based on the degree of sequence similarity. Tables that reflected the frequency of changes observed in the sequences of a group of closely related proteins were then derived. Proteins that were less than 15% different were chosen to avoid the chance that the observed amino acid
changes reflected two sequential amino acid changes instead of only one. From aligned sequences, a phylogenetic tree was derived showing graphically which sequences were most related and therefore shared a common branch on the tree.

Once these trees were made, they were used to score the amino acid changes that occurred during evolution of the genes
for these proteins in the various organisms from which they originated .

Subsequently, a set of matrices (tables)—the percent amino acid mutations accepted by evolutionary selection or PAM tables—which showed the probability that one amino acid changed into any other in these trees was constructed, thus showing which amino acids are most conserved at the corresponding position in two sequences. These tables are still used to measure similarity between protein sequences and in database searches to find sequences that match a query sequence. The rule used is that the more identical and conserved amino acids that there are in two sequences, the more likely they are to have been derived from a common ancestor gene during evolution. If the sequences are very much alike, the proteins probably have the same biochemical function and three-dimensional
structural folds.

Wednesday, 18 December 2013

How many biology databases....?!!

Do you have any idea how many biological databases are available in the internet ...?!!
How many are for miRNA , PPI , Mass spec data , Genome data, RNA seq data ,Gene ontology , Expression analysis etc

Here is the website which has listed all the databases accordingly based on their purpose ,ranking .
Click here to dive into vistas of databases

It is a very useful start point which makes the data analysis, data mining and biocuration job easier and more specific.

Thursday, 5 December 2013

What is a biological model..?!

( Not to confused with model organism)

Models represent aspects, a term that denotes a coherent set of properties or phenomena of biological interest. The aspect anchors the model in the real world. We establish a correspondence through an ontology, an explicit formal specification of how to represent the objects, concepts, and other entities assumed to exist in the biological domain being studied and the relationships that hold among them. The model and appropriate elements must then be linked to elements in the ontology.

Assumptions condition or determine the relationship between models and the aspects they represent. Assumptions underpin model construction,
constitute the rationale for the model, and must be precisely documented and connected to the model for it to have meaning beyond the immediate use to which it has been put.
Experimental biologists make observations about phenomena of biological interest. Classically, these observations are used to validate interpretations
derived from models. Commonly, however, models yield interpretations that prompt further observations or, when compared with observations, question the validity of the assumptions. Researchers document the observations in the scientific literature and in data resources associated with the experiments.

Tuesday, 3 December 2013

Nature of biological data

The biological data sets are intrinsically complex and are organized in loose hierarchies that reflect our understanding of the complex living systems, ranging from genes and proteins, to protein-protein interactions, biochemical pathways and regulatory networks, to cells and tissues, organisms and populations, and finally the ecosystems on earth. This system spans many orders of magnitudes in time and space and poses challenges in informatics, modeling, and simulation equivalent to or beyond any other scientific endeavor. A notional description of the vast scale of
complexity, population, time, and space in the biological systems.

Reflecting the complexity of biological systems, the types of biological data
are highly diverse. They range from the plain text of laboratory records and literature publications, nucleic acid and protein sequences, three-dimensional atomic structures of molecules, and biomedical images with different levels of resolutions, to various experimental outputs from technology as diverse as microarray chips, gels, light and electronic microscopy, Nuclear Magnetic Resonance (NMR),X-ray crystallography and mass spectrometry.

Challenges in biological data integration

Biology also demonstrates three challenges for data integration that are common in evolving scientific domains but not typically found elsewhere. The first is the:
1)
Sheer number of available data sources and the inherent heterogeneity of their contents.
The World Wide Web has become the preferred approach for disseminating scientific data among researchers, and as a result, literally hundreds of small data sources have appeared over the past 10 years. These sources are typically a "labor of love" for a small number of people. As a result, they often lack the support and resources to provide detailed documentation and to respond to community requests in a timely manner. Furthermore, if the principal supporter leaves, the site usually becomes completely unsupported. Some of these sources contain data from a single lab or project, whereas others are the definitive repositories for very specific types of information (e.g., for a specific genetic mutation).
Not only do these sources complicate the concept identification issue previously mentioned (because they use highly specialized data semantics), but their number make it infeasible to incorporate all of them into a consistent repository.

2)
The data formats and data access methods (associated interfaces) change regularly.
Many data providers extend or update their data formats
approximately every 6 months, and they modify their interfaces with the same frequency. These changes are an attempt to keep up with the scientific evolution occurring in the community at large. However, a change in a data source representation can have a dramatic impact on systems that integrate that source, causing the integration to fail on the new format or worse, introducing subtle errors into the systems. As a result of this problem, bioinformatics infrastructures need to be more flexible than systems developed for more static domains.

3)
The data and related analysis are becoming increasingly complex.
As the nature of genomics research evolves from a predominantly wet-lab activity into knowledge-based analysis, the scientists' need for access to the wide variety of available information increases dramatically. To address this need, information needs to be brought together from various heterogeneous data sources and presented to researchers in ways that allow them to answer their questions. This means providing access not only to the sequence data that is commonly stored in data sources today, but also to multimedia information such as expression data, expression pathway data, and simulation results. Furthermore, this information needs to be available for a large number of organisms under a variety of conditions.

Biological data integeration

Data integration issues have stymied computer scientists and geneticists alike for the last 20 years, and yet successfully overcoming them is critical to the success of genomics research as it transitions from a wet-lab activity to an electronic-based activity as data are used to drive the increasingly complicated research performed on computers. This research is motivated by scientists striving to understand not only the data they have generated, but more importantly, the information implicit in these data, such as relationships between individual components. Only through this understanding will scientists be able to successfully model and simulate entire genomes, cells, and ultimately entire organisms.
Whereas the need for a solution is obvious, the underlying data integration
issues are not as clear.
Many of the problems facing genomics data integration are related to data semantics~the meaning of the data represented in a data source~and the differences between the semantics within a set of sources.
These differences can require addressing issues surrounding concept identification, data transformation, and concept overloading.

Unfortunately, the semantics of biological data are usually hard to define precisely because they are not explicitly stated but are implicitly included in the database design. The reason is simple: At a given time, within a single research community, common definitions of various terms are often well understood and have precise meaning. As a result, the semantics of a data source are usually understood by those within that community without needing to be explicitly defined. However, genomics (much less all of biology or life science) is not a single, consistent scientific domain; it is composed of dozens of smaller, focused research communities. This would not be a significant issue if researchers only accessed data from within a single domain, but that is not usually the case. Typically, researchers require integrated access to data from multiple domains, which requires resolving terms that have slightly different meanings across the communities.

Monday, 2 December 2013

Are you biologist scared of computers...?!!

It looks like biologists are colonizing the dictionary with all these bio-
words: we have bio-chemistry, bio-metrics, bio-physics, bio-technology,
bio-hazards,bio-statistics and even bio-terrorism. Now what’s up with the new entry in the bio-informatics?

Bioinformatics is a much simpler subject than you ever thought possible. For
most people new to this field, the main difficulty is finding out the kind of
questions they can ask with these new tools. If you’re a biologist, don’t let the computer scare you; bioinformatics is nothing more than good, sound, regular biology hidden inside a computer.

The magic thing about bioinformatics is that, with a simple Internet connec-
tion, you can browse databases that contain the sum of our entire human biological knowledge — and you can do this with the most sophisticated tools ever developed by mankind. And how much is this going to cost you? Nothing!

Bioinformatics v/s Computational biology

Computational biology is a very close relative of bioinformatics .

Computational biology generally is concerned with the development of novel and efficient algorithms that can be proven to work on a difficult problem ,such as multiple sequence alignment or genome fragment assembly .

On the other hand , bioinformatics focuses more on the development of practical tools for data management and analysis. E.g, Display of genomic information and sequence analysis, but with less emphasis on efficiency and proven accuracy.

What is bioinformatics ?

Earlier, bioinformatics is defined as an interdisciplinary filed involving biology, computer science , mathematics and statistics to analyse biological sequence data , genome content and arrangement , and to predict the function and structure of macromolecules.
With the advent of genomic era ,bioinformatics now plays added roles in biological and medical research and accounts for an increasing number of publications each year.

Bioinfo is now involved in many fields by organizing biological data relevant to genomes with a view to applying this information in agriculture , pharmacology, and other commercial applications.