Methodological Foundations of Biomedical Informatics (BMSC-GA 4449) Himanshu Grover
Big Data in Biology/Healthcare 3 Vs: Volume Velocity Variety (richness/complexity; Structured, semi-structured, unstructured) Examples: Omics technologies (Proteomics, genomics / NGS, metabolomics etc.) Clinical data – EMRs (patients, providers, medications, procedures, symptoms, diagnoses, financials) Need simple and advanced analytics – exploratory analyses & knowledge discovery, complex visualizations, reporting, operations etc.
Utilization: Persistence + Analytics File System Vs. Databases 1.Ex. ASCII, semi-structured (xml), binary 2.Overhead in Parsing 3.No Indexing/Search/Filtering 4.Too large 1.Ex. ASCII, semi-structured (xml), binary 2.Overhead in Parsing 3.No Indexing/Search/Filtering 4.Too large 1.Efficient storage and access 2.Analytics 1.Efficient storage and access 2.Analytics
Example: Proteomics {'experimentName' : ' ', 'filename' : 'GPM mgf', 'scan’: 3 'mz' : , 'expPeaks' : [ { 'mz' : , 'intensity' : 14.0}, { 'mz' : , 'intensity' : 23.0}, { 'mz' : , 'intensity' : 19.0}, { 'mz' : , 'intensity' : 22.0},... ] }... Identifier Peaks Peptide Info MS/MS Spectrum Information
{'experimentName' : ' ', 'filename' : 'GPM mgf', 'scan’: 3 'mz' : , 'expPeaks' : [ { 'mz' : , 'intensity' : 14.0}, { 'mz' : , 'intensity' : 23.0}, { 'mz' : , 'intensity' : 19.0}, { 'mz' : , 'intensity' : 22.0}, } … Identification Peaks Peptide Info Relational Databases id/pkExp Name File Name scan … Peaks Impedance Mismatch Spectrum Table id/pkExp Name File Name scan … Peak1Peak2… Difficulty running on Clusters id/pkExp Name File Name scan … id/pkFkmzint v1 (un-normalized) v2 (un-normalized) Spectrum Table Peaks Table v3 (normalized)
Spectrum Ex. cont’d Example of a 1-to-many relationship Un-normalized schema redundancy and disk wastage non-uniformity (ex. different numbers of peaks per spectrum) query ability varies in blob storage Normalized schema effective, but requires joins Other examples (relationship types?: Proteins-to-peptides; genes to proteins; patients-to- diseases
Not Only SQL (NOSQL) / Non-Relational Key features: – Aggregate Orientation, i.e. closely related data, that is accessed as a unit (aggregate), leads to faster read/write operations – Facility for rich structure – Easier to program data access (application development productivity) – Application/context-specific, unlike generic relational data model (database as an integration point) Representation: – Key-value Stores (Ex.Riak, Redis, etc.) – Column Family Stores (Ex. Cassandra, HBase etc.) – Document-oriented Stores (Ex. MongoDB, CouchDB etc.) – Graph databases (Ex. Neo4J etc.)
Why MongoDB: Flexible Collections (≈Tables) of Documents (≈Rows) Documents = set of key-value pairs (Ex. Python Dict, Java HashMap etc.) doc={ ‘_id ’:, :, : { :, : }, : [ { : }, { : val 42 }, …] } Simple ’experimentName' : ' ’ Embedded/Hierarchical List (non-uniform) Ex. ‘peaks' : [{'mz’ : 792.6, 'int' : 14.0}, { 'mz' : 874.6,'int' : 23.0},…] Non-uniform and dynamic
Collection: SpectrumArchive {'_id' : ObjectId('52c ded5b32082bb5'), 'experimentName' : ' ', 'filename' : 'GPM mgf', 'scan' : 1749 'mz' : , 'intensity' : 0.0, 'rt' : 0.0, 'expPeaks' : [ { 'mz' : , 'intensity' : 14.0}, { 'mz' : , 'intensity' : 23.0}, { 'mz' : , 'intensity' : 19.0}, { 'mz' : , 'intensity' : 22.0},... ] }...
Data Modeling: Design Choices Document structure (Entities/Aggregates) – Data Access patterns => What is accessed together must go together Relationships – 1-to-few, 1-to-many, many-to-many – Embedding (de-normalized) vs. Referencing (normalized) – cardinality of relationship may be unbounded and/or quite large (for some cases) Document growth issue
Why MongoDB: Distribution Aggregates are a natural unit of interaction as well as distribution – no notion of joins – Scale out storage and processing Two forms – Replication – Sharding MongoD - 1 MongoD - 2 MongoD - 5..… MongoS Automatic data distribution Seamless distributed query processing and analytics Automatic data distribution Seamless distributed query processing and analytics Application Code
Why MongoDB: Other features Powerful query language and operators – including ability to look into nested/embedded documents and arrays/lists Secondary indexing Performant and extensive analytics operators over distributed databases
Demo BasicC – Create R – Read U – Update D – Delete -From mongo shell -From PyMongo driver
Some Limitations of MongoDB No built-in atomic transactions across multiple documents or collections. Difficult to do lots of many-to-many relationships (use graph databases)
Some Resources NoSQL Distilled – Broad discussion of diff. categories of NoSQL databases MongoDB specific: – MongoDB website user manual, public talks/presentations – MongoDB University: “MongoDB for Developers” course uses Python – MongoDB – The Definitive Guide ()
Take home No one size fits all Choice depends on application requirements