Biology also demonstrates three challenges for data integration that are common in evolving scientific domains but not typically found elsewhere. The first is the:
1)
Sheer number of available data sources and the inherent heterogeneity of their contents.
The World Wide Web has become the preferred approach for disseminating scientific data among researchers, and as a result, literally hundreds of small data sources have appeared over the past 10 years. These sources are typically a "labor of love" for a small number of people. As a result, they often lack the support and resources to provide detailed documentation and to respond to community requests in a timely manner. Furthermore, if the principal supporter leaves, the site usually becomes completely unsupported. Some of these sources contain data from a single lab or project, whereas others are the definitive repositories for very specific types of information (e.g., for a specific genetic mutation).
Not only do these sources complicate the concept identification issue previously mentioned (because they use highly specialized data semantics), but their number make it infeasible to incorporate all of them into a consistent repository.
2)
The data formats and data access methods (associated interfaces) change regularly.
Many data providers extend or update their data formats
approximately every 6 months, and they modify their interfaces with the same frequency. These changes are an attempt to keep up with the scientific evolution occurring in the community at large. However, a change in a data source representation can have a dramatic impact on systems that integrate that source, causing the integration to fail on the new format or worse, introducing subtle errors into the systems. As a result of this problem, bioinformatics infrastructures need to be more flexible than systems developed for more static domains.
3)
The data and related analysis are becoming increasingly complex.
As the nature of genomics research evolves from a predominantly wet-lab activity into knowledge-based analysis, the scientists' need for access to the wide variety of available information increases dramatically. To address this need, information needs to be brought together from various heterogeneous data sources and presented to researchers in ways that allow them to answer their questions. This means providing access not only to the sequence data that is commonly stored in data sources today, but also to multimedia information such as expression data, expression pathway data, and simulation results. Furthermore, this information needs to be available for a large number of organisms under a variety of conditions.
1)
Sheer number of available data sources and the inherent heterogeneity of their contents.
The World Wide Web has become the preferred approach for disseminating scientific data among researchers, and as a result, literally hundreds of small data sources have appeared over the past 10 years. These sources are typically a "labor of love" for a small number of people. As a result, they often lack the support and resources to provide detailed documentation and to respond to community requests in a timely manner. Furthermore, if the principal supporter leaves, the site usually becomes completely unsupported. Some of these sources contain data from a single lab or project, whereas others are the definitive repositories for very specific types of information (e.g., for a specific genetic mutation).
Not only do these sources complicate the concept identification issue previously mentioned (because they use highly specialized data semantics), but their number make it infeasible to incorporate all of them into a consistent repository.
2)
The data formats and data access methods (associated interfaces) change regularly.
Many data providers extend or update their data formats
approximately every 6 months, and they modify their interfaces with the same frequency. These changes are an attempt to keep up with the scientific evolution occurring in the community at large. However, a change in a data source representation can have a dramatic impact on systems that integrate that source, causing the integration to fail on the new format or worse, introducing subtle errors into the systems. As a result of this problem, bioinformatics infrastructures need to be more flexible than systems developed for more static domains.
3)
The data and related analysis are becoming increasingly complex.
As the nature of genomics research evolves from a predominantly wet-lab activity into knowledge-based analysis, the scientists' need for access to the wide variety of available information increases dramatically. To address this need, information needs to be brought together from various heterogeneous data sources and presented to researchers in ways that allow them to answer their questions. This means providing access not only to the sequence data that is commonly stored in data sources today, but also to multimedia information such as expression data, expression pathway data, and simulation results. Furthermore, this information needs to be available for a large number of organisms under a variety of conditions.
No comments:
Post a Comment