Tuesday, 3 December 2013

Biological data integeration

Data integration issues have stymied computer scientists and geneticists alike for the last 20 years, and yet successfully overcoming them is critical to the success of genomics research as it transitions from a wet-lab activity to an electronic-based activity as data are used to drive the increasingly complicated research performed on computers. This research is motivated by scientists striving to understand not only the data they have generated, but more importantly, the information implicit in these data, such as relationships between individual components. Only through this understanding will scientists be able to successfully model and simulate entire genomes, cells, and ultimately entire organisms.
Whereas the need for a solution is obvious, the underlying data integration
issues are not as clear.
 Many of the problems facing genomics data integration are related to data semantics~the meaning of the data represented in a data source~and the differences between the semantics within a set of sources.
These differences can require addressing issues surrounding concept identification, data transformation, and concept overloading.

Unfortunately, the semantics of biological data are usually hard to define precisely because they are not explicitly stated but are implicitly included in the database design. The reason is simple: At a given time, within a single research community, common definitions of various terms are often well understood and have precise meaning. As a result, the semantics of a data source are usually understood by those within that community without needing to be explicitly defined. However, genomics (much less all of biology or life science) is not a single, consistent scientific domain; it is composed of dozens of smaller, focused research communities. This would not be a significant issue if researchers only accessed data from within a single domain, but that is not usually the case. Typically, researchers require integrated access to data from multiple domains, which requires resolving terms that have slightly different meanings across the communities.

No comments:

Post a Comment