PLEXdb Redesign & Implementation Project : Plex Awesomeness Course Involved : CS 461/561 Project Members : Jesse Walsh Brian Nordland Stephen Mueller Arun.

Slides:



Advertisements
Similar presentations
XML DOCUMENTS AND DATABASES
Advertisements

Abstract BarleyBase ( is a USDA-funded public repository for plant microarray data. BarleyBase houses raw and normalized expression.
Management Information Systems, Sixth Edition
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
Topic Denormalisation S McKeever Advanced Databases 1.
Managing Data Resources
Oct 31, 2000Database Management -- Fall R. Larson Database Management: Introduction to Terms and Concepts University of California, Berkeley School.
Physical Database Monitoring and Tuning the Operational System.
Supplement 02CASE Tools1 Supplement 02 - Case Tools And Franchise Colleges By MANSHA NAWAZ.
System Analysis and Design
Using ArrayExpress. ArrayExpress is an international public repository for well-annotated microarray data, including gene expression, comparative genomic.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
 MODERN DATABASE MANAGEMENT SYSTEMS OVERVIEW BY ENGINEER BILAL AHMAD
Information systems and databases Database information systems Read the textbook: Chapter 2: Information systems and databases FOR MORE INFO...
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
Section 01Resources1 HSQ - DATABASES & SQL 01 Resources And Franchise Colleges Name :MANSHA NAWAZ room :G 0/32
This chapter is extracted from Sommerville’s slides. Text book chapter
PHASE 3: SYSTEMS DESIGN Chapter 7 Data Design.
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Design Completion A Major Milestone System is Presented to Users and Management for Approval.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Lesson 7 Guide for Software Design Description (SDD)
Systems analysis and design, 6th edition Dennis, wixom, and roth
CSC271 Database Systems Lecture # 30.
Course Introduction Introduction to Databases Instructor: Joe Bockhorst University of Wisconsin - Milwaukee.
ITEC224 Database Programming
Chapter 1 Overview of Database Concepts Oracle 10g: SQL
Database Technical Session By: Prof. Adarsh Patel.
PLEXdb Plant Expression database Ethalinda Cannon Iowa State University January 15th, 2007.
PowerPoint Presentation for Dennis & Haley Wixom, Systems Analysis and Design, 2 nd Edition Copyright 2003 © John Wiley & Sons, Inc. All rights reserved.
Software School of Hunan University Database Systems Design Part III Section 5 Design Methodology.
9/14/2012ISC329 Isabelle Bichindaritz1 Database System Life Cycle.
CSCI 3140 Module 2 – Conceptual Database Design Theodore Chiasson Dalhousie University.
Concepts and Terminology Introduction to Database.
Chapter 7: Database Systems Succeeding with Technology: Second Edition.
Physical Database Design Chapter 6. Physical Design and implementation 1.Translate global logical data model for target DBMS  1.1Design base relations.
7.1 Managing Data Resources Chapter 7 Essentials of Management Information Systems, 6e Chapter 7 Managing Data Resources © 2005 by Prentice Hall.
Lecture 2 An Overview of Relational Database IST 318 – DB Admin.
Abstract BarleyBase is a USDA-funded public repository for plant microarray data. BarleyBase houses raw and normalized expression data from the 22K Affymetrix.
1 MIAME The MIAME website: © 2002 Norman Morrison for Manchester Bioinformatics.
1/26/2004TCSS545A Isabelle Bichindaritz1 Database Management Systems Design Methodology.
1 maxdLoad The maxd website: © 2002 Norman Morrison for Manchester Bioinformatics.
Information Systems & Databases 2.2) Organisation methods.
The Data Warehouse “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of “all” an organisation’s data in support.
Problem Statement: Users can get too busy at work or at home to check the current weather condition for sever weather. Many of the free weather software.
1 Design Issues in XML Databases Ref: Designing XML Databases by Mark Graves.
What is an Ontology? An ontology is a specification of a conceptualization that is designed for reuse across multiple applications and implementations.
CIS 210 Systems Analysis and Development Week 6 Part II Designing Databases,
Introduction to Database Tonga Institute of Higher Education NOS 215.
+ Information Systems and Databases 2.2 Organisation.
Master Data Management & Microsoft Master Data Services Presented By: Jeff Prom Data Architect MCTS - Business Intelligence (2008), Admin (2008), Developer.
Ontologies Working Group Agenda MGED3 1.Goals for working group. 2.Primer on ontologies 3.Working group progress 4.Example sample descriptions from different.
Data Mining at PLEXdb : Plant and Plant Pathogen Gene Expression Database.
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Assoc. Prof. Dr. Ahmet Turan ÖZCERİT.  The concept of Data, Information and Knowledge  The fundamental terms:  Database and database system  Database.
1 DATABASE TECHNOLOGIES (Part 2) BUS Abdou Illia, Fall 2015 (September 9, 2015)
Winter 2011SEG Chapter 11 Chapter 1 (Part 1) Review from previous courses Subject 1: The Software Development Process.
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
Oman College of Management and Technology Course – MM Topic 7 Production and Distribution of Multimedia Titles CS/MIS Department.
Database Planning Database Design Normalization.
ArrayExpress Ugis Sarkans EMBL - EBI
MESA A Simple Microarray Data Management Server. General MESA is a prototype web-based database solution for the massive amounts of initial data generated.
GEO (Gene Expression Omnibus) Deepak Sambhara Georgia Institute of Technology 21 June, 2006.
Managing Data Resources File Organization and databases for business information systems.
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
CS4222 Principles of Database System
Using ArrayExpress.
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Methodology – Monitoring and Tuning the Operational System
Methodology – Monitoring and Tuning the Operational System
Presentation transcript:

PLEXdb Redesign & Implementation Project : Plex Awesomeness Course Involved : CS 461/561 Project Members : Jesse Walsh Brian Nordland Stephen Mueller Arun Chander

Introduction to Clients John Vanhemert - –John is developing new tools for PLEXdb, and as such is involved in the plex database. John's difficulty understanding the existing database structure and his recognition of its many flaws led him to propose a redesign of the database. John was our primary point of contact, providing us with initial requirements and continuous feedback. Sudhansu Dash - –Sudhansu is a curator for PLEXdb. He is the expert on the data and how users access it. He was able to help clarify what data was important and how it was linked together. Ethalinda Cannon - –Ethy was one of the original creators of PLEXdb. While she is no longer on the PLEXdb project, she was graciously willing to meet with us and explain some of the considerations that led to the orginal design. She was very helpful in explaining how some of the original tables were meant to join together. Julie Dickerson - –Julie is a PI on the PLEXdb project. Julie gave to go-ahead to start our pilot project. She expressed approval with our ER design considerations.

Plant and Plant Pathogen Gene Expression Database Repository containing microarray gene expression data MIAME compliant data submission - Minimum Information about A Microarray Experiment Data from > 200 microarray experiments, > 6000 chips = Experiments from 14 Affymetrix arrays = 13 Species

Requirement Collection Clients initial motivation in soliciting our group to work on their project included –Recognition of existing problems, although the extent of problems had not been assessed. –Need to store new types of information in PlexDB required updates to the schema. –Without documentation, knowledge of the database had been lost as its designers moved on. If the database was allowed to grow in size without clear understanding of the tables, the project risks introducing problems later on. –Clients wanted to start fresh with a clearly documented and properly designed schema

Client Requirements Expectations from the new database Remove redundancy and get it normalized. Better way to store vital information. Control the overall size of databases. Schema should support upcoming technologies Eg: nextgen

Expected Deliverables Normalized schema design that can replace the experiment and data portions of the existing schema Scripts that can populate the new schema Intuitive web-based scripts to edit the organism table Views that can read from the new schema and present read-only structures similar to existing tables

ISSUES – Table size PO – 26 Annotation – 105 Blast – 6 Gramenedata – 40 Interpro – 49 Normalization – 229 Ontology – 14 Plexdb – 36 Submission – 12 Table Overgrowth!

Redundant tables Creation of new tables that hold the same data Solution Proposed: Replace ISAM with InnoDB Usage of joins Indexes to match speed Translate table names to attributes

Improper Storage of Critical Data Solution proposed: Translate table names to attributes

Other Issues Improper typing Undefined relations Solution Proposed: Store data using a seperate membership table Redundancy Repeated text blobs Solution proposed: Minimize points of storage of such pieces of data using foreign keys

Proposed Improvements Database Level Complete new schema design Provide JDBC and SQL scripts for data translation Weblogic Level Complete view of parent/child relationship for an organism using the nested set model

Technologies Used SQL Version JavaVersion 1.6.0_22 PHPVersion

ER DIAGRAM Jesse Walsh

Background MIAME –(Minimum Information on A Microarray Experiment) –Does not specify particular format or terminology PlexDB claims to be MIAME compliant –Our design to be MIAME compliant –Unfortunately, we learned about MIAME late into the design process –We could achieve MIAME compliance with small tweaks

MIAME – 6 critical points The raw data for each hybridisation (e.g., CEL or GPR files) The final processed (normalised) data for the set of hybridisations in the experiment (study) (e.g., the gene expression data matrix used to draw the conclusions from the study) The essential sample annotation including experimental factors and their values (e.g., compound and dose in a dose response experiment) The experimental design including sample data relationships (e.g., which raw data file relates to which sample, which hybridisations are technical, which are biological replicates) Sufficient annotation of the array (e.g., gene identifiers, genomic coordinates, probe oligonucleotide sequences or reference commercial array catalog number) The essential laboratory and data processing protocols (e.g., what normalisation method has been used to obtain the final processed data)

Background Biological data can be complex Procedures used and data collected can vary widely –Require a flexible schema to handle this

ER Diagram 16 Entities

ER Diagram

Experiment an example

ExperimentControl Treatment 1 Treatment 2 Samples

Measurement Experiment an example ExperimentControl Treatment 1 Treatment 2 Measurement Measure with Microarrays

Treatment = Factor + Level Time –10 hrs –20 hrs Temperature –30 F –50 F Stress –Control –Salinity –Drought

ER Diagram

What is a MicroArray?

Take home message Microarrays measure genes The smallest thing measured are probes Probes are grouped and summarized into probe sets Roughly, probe set = gene Microarrays experiment is called a hybridization

ER Diagram

DATABASE DESIGN Arun Chander

Relational Schema Factor(ID,factor_name,factor_order) Factor_level(ID,factor_id,factor_level,factor_level_order) Provider(ID,provider,provider_institution,provider_head_of_lab,provider_ ,provide r_telephone,provider_url) Users(login_id,first,middle,last,head_of_lab_name,lab,institution,street,state_province,cit y,country,zip_code,telephone,fax, ,url,password,activated,created_time,last_upd _time,lastaccess,job_title) Groups(name,description,creator,owner,created_date,upd_date) Experiment(ID,accession_no,experiment_name,experiment_description,login_id,array_n ame,quality_control,quality_control_description,visibility,public_release,curator_visi ble,reviewer_visible,reviewer_access_code,geo_submit,geo_series,import,atlas,finaliz ed,normalized,mark_delete, sandbox,create,lastmod)

Organism(ID,organism,leftPointer,rightPointer) Sample(ID,exp_id,sample_accession_no,sample_name,sample_picture,sampling_date, sample_preparation_date,hybridization_date,sample_description,organism,germpla sm_name,germplasm_description,ecotype,mutant_description,transgenic_descrip tion,organism_part,cell_type,development_stage,extracted_molecule,growth_med ia,age,growth_temperature,growth_description,environmental_conditions,separa tion_technique,extract_protocol_id,labeling_protocol_id,hybridization_protocol_i d,scanning_protocol_id,washing_procedure_id,create,lastmod,providerid) Applied_treatment(ID,sample_id,factor_level_id); Hybridization_alignment(ID,hybridization_accession_no,login_id, experiment_accession_no,sample_id,filename,array_name,CDF_file_name) Expression_units_type(ID,typename) Expression_units(ID,name,xvalue,yvalue,sd,pixels,type_id) Expression_units_hierarchy(ID,pareny_id,child_id) Manufacturer(ID,design_provider)

Platforms(ID,array_name,array_name_full,plex_name,geo_platform,data_file_extn, number_x,number_y,chip_description,CDF_name,CDF_file_name,CDF_file_version, CDF_url,number_units,max_units,num_QC_units,design_provider_id,info_url,do wnload_url,prefix,default_accession_no,blastdb_name,mpt_support,exp_support, disable,create,lastmod) Memberships(login_id,name) Normalization_methods(ID,method_name,method_description,citation_id, script_file_name,notes) Applicable_norm_methods(ID,methodid,array_design_id) Platform_exprunits(ID,exprid,array_design_id) Platform_experiment(ID,experiment_id,array_design_id) Platform_organism(ID,organism_id,array_design_id) Data_table(ID,expr_id,normmethodid,hybridization_id,intensity) Statistic(ID,statistic_name,statistic_value double,data_id)

Normalization

DATA MIGRATION Stephen Mueller

Data migration Access to VM is slow Inconsistencies File Names Users that don’t exist

State of Release of project ER Diagram and Schema Complete

Role of views Updating entire database will take place over time Views keep website working

Issues Faced & how they were tackled Continuous learning Continuous requirements gathering Complex data Data inconsistencies

Issues Faced & how they were tackled Getting the data we needed Sometimes didn’t know who to ask Virtual Machine Installing software Accessing for data migration

WEB DEVELOPMENT Brian Nordland

Organism Editor Previously the organism was stored with experiment

Organism Editor

Previously the organism was stored with experiment sample No sense or hierarchy Hierarchy adds future ability for more meaningful info

Organism Editor Uses a nested set model for hierarchies

Organism Editor Uses a nested set model for hierarchies

Organism Editor Uses a nested set model for hierarchies Makes selecting portion of tree easy

Organism Editor Uses a nested set model for hierarchies Makes selecting portion of tree easy SELECT * FROM tree WHERE lft BETWEEN 2 AND 11

Organism Editor Nested Set Model makes retrieval easy Changes more complicated, “re-indexing” required

Future Expansion Organism Editor –Ability to move portions of the tree –Login ability to editor/Integration with PlexDB Make PlexDB Use Our Data –Two-phase process creating views –Change PlexDB Code to use data directly Implement Data Partitioning

Group Member Roles Every member was involved in each aspect of the project, but each member also focused their efforts on coordinating certain tasks

Group Member Roles Project Manager: Jesse Walsh –Responsible for understanding biology concepts –Focused on ER design Web Developer: Brian Nordland –Focused on organism editor Java Developer: Stephen Mueller –Focused on data migration DBA: Arun Chander –Focused on creation of tables

Questions???