A Database Platform for Bioinformatics

A Database Platform for Bioinformatics
Sandeepan Banerjee Oracle Corporation Talked about the needs of the bioinformatics database, and gave four aspects of technologies by Oracle focus on how to resolve the problems. Extensibility architecture to store gene sequence data natively and perform high-dimensional structure-searches in the database. Warehousing technologies and data mining on genetic patterns. Data integration technologies to enable heterogeneous queries across distributed biological sources Internet portal technologies that allow life sciences information to be published and managed across intranets and the internet

Background Need massive storage for more and more genomic and proteomic data generated in database Need high-performance computing platform to search data, identify similarities and patterns within genomic data and unify the slices of distributed developed knowledge The atlas of the human genome promises to revolutionize medical practice and biological research for the next millennium. All over the world.

Steps to genomic projects
Divide the chromosomes into smaller fragments that can be isolated Order these fragments to correspond to their respective locations on the chromosomes. Determine the sequence of bases A,T,C & G in each fragment. Annotate the regions of sequenced chromosomes with their function Catalogue the differences in sequences to make a series of descriptive diagrams maps of each human chromosome at increasingly finer resolutions

Computing for cataloguing
Any two individuals differ in about 1/1000 of their genetic material, i.e. about 3 million base pairs. The global population is now about 6 billion. So a full cataloguing of all sequence differences will run to 18*1015 entries. So we can see there is a huge computing needed.

Traditional Database Few databases have had a native ability to deal with complex data Hard to handle high-dimensional data Ex. Query on structural similarity: Given a particular sequence, what other sequences resembling this sequence exist in the database? They mainly manage simple business data like numbers, characters or dates. Such as performing similarity queries on gene sequences, spatial queries on locations, or looks-like queries on images.

BLAST A set of similarity search programs
Hard to handle due to too complex, too large and far too custom-built Degrade performing when interactions with database increase Query optimizations not easy Hard to manage with database as a whole system. Not the blast virus we got couples of weeks ago Basic Local Alignment Search Tool A set of similarity search programs to detect relationships between sequences, and rank statistically Not easy because blast severs don’t know about textual annotations.

Four technologies needed
Extensibility database architecture Data mining and Data Warehousing Data integration technologies Internet portal technologies The author argued it was a better way to add these functions within the database. SO there are four To store gene sequence data natively and perform high-dimensional structure-searches within the database Warehousing technologies and data mining on genetic patterns due to its huge size To enable heterogeneous queries across distributed biological sources That allow lief sciences information to be published and managed over the internet.

Extending Databases User-defined Types User-defined operators
Domain-specific indexing Optimizer extensibility

User-defined Types Oracle Type System
Object types – structure is fully known to the database Opaque types – not known to the database OTS provides a high-level sql-based interface for defining types. Two central consstructs in OTS are For example, combine number, and date into semantic behaviour. Type method or functions is written in the 3GL like C. Learning curve? Benefit is to use external data model and behaviour available to store and minipulate the sequences.

User-defined operators
Define domain-specific operators resembles() Can be invoked anywhere built-in operators can be used. Like in Select command: SELECT ID FROM DNATABLE WHERE Contains(fragment, `GCCATA`); Like resembles() operator for comparing sequences

Extensible Indexing Cooperative indexing User-supplied implementations and the Oracle server cooperate to build and maintain indexes for complex types such as genetic, text or spatial data. User implemented Indextype. The framework to develop new index types is based on the conecpt of cooperative indexing where a user-supplied implementations and the oracle server cooperate to build and maintain indexes for complex types such as genetic, text or spatial data. Efficient to build the index? Must be a professional programmer first.

Extensible Optimizer Gives developers control over the three main inputs used by the optimizer: statistics, selectivity, and cost. The user has also to to be a database expert in order to efficiently implement opitmizer function for statistics, selectivity and cost.

Mining Sequence Data Oracle Darwin for bioinformatics
Darwin was built to address the terabyte databases found in genomics database. It provides classification and regression trees, neural networks, and k-nearest neighbours algorithms, k-means Naïve-Bayes and enhanced clustering algorithms. Darwin is based on a distributed-memory SPMD( single program multiple data) paradigm. Those memeory communciate with each other by message-passing library. And stardata is unifed data access and manipulation library, provides most of the data access and transformation infrastructure that supports the machine learning modules startree, … The author also mentioned that mining of sequence data is still ijn its infancy

Integrating Heterogeneous Data
Sequence data will be distributed all over the institutes. Annotations to this data will make it change and grow all the time. Oracle use ODBC and OLEDB to connect non-Oracle database system to do query, search, insert, delete.

Portal Technologies ‘Soft Goods’ Sales Visualization
Security & Access Control Oracle provides dynamical html and xml through java servlet. There are some other requirements for portals related to genomics and bioinformatics. Portals use soft goods like the number and complexity of queries against a sequence database to measure its usage. Oracle provides a wide array of server-based features to enable soft goods transactions overt he internet, how? Oracle enables the publishing of graphical data informats such as the vector markup language thatn can be used to display sequences. VLM will be transferred from by XML through XSL Oracle provides comprehensive PKI-based security to protect not only the results of queries, but the queries themselves. Public Key Infrastructure.

Questions? What’s the benefit for Oracle compared with BLAST?
Are there any other technologies required for the Bioinformatics database platform? Is there anything Darwin can’t do for data mining in the bioinformatics database?

A Database Platform for Bioinformatics

Similar presentations

Presentation on theme: "A Database Platform for Bioinformatics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Database Platform for Bioinformatics

Similar presentations

Presentation on theme: "A Database Platform for Bioinformatics"— Presentation transcript:

Similar presentations

About project

Feedback