MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez 1 Bioinformatics Data Management Lecture 3 Structured.

Slides:



Advertisements
Similar presentations
What is a Database By: Cristian Dubon.
Advertisements

C6 Databases.
1 Basic DB Terms Data: Meaningful facts, text, graphics, images, sound, video segments –A collection of individual responses from a marketing research.
CSE 190: Internet E-Commerce Lecture 10: Data Tier.
Using Relational Databases and SQL Steven Emory Department of Computer Science California State University, Los Angeles Lecture 1: Introduction to Relational.
3-1 Chapter 3 Data and Knowledge Management
A Guide to SQL, Seventh Edition. Objectives Understand the concepts and terminology associated with relational databases Create and run SQL commands in.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 4-1.
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
Concepts of Database Management Sixth Edition
1ISM - © 2010 Houman Younessi Lecture 3 Convener: Houman Younessi Information Systems Spring 2011.
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
Data Base Management System
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
Chapter 4 Relational Databases and Enterprise Systems
Introduction To Databases IDIA 618 Fall 2014 Bridget M. Blodgett.
IST Databases and DBMSs Todd S. Bacastow January 2005.
Dale Roberts 1 Department of Computer and Information Science, School of Science, IUPUI Dale Roberts, Lecturer Computer Science, IUPUI
MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
Session 5: Working with MySQL iNET Academy Open Source Web Development.
Database Lecture # 1 By Ubaid Ullah.
1 Introduction to databases concepts CCIS – IS department Level 4.
ASP.NET Programming with C# and SQL Server First Edition
Database Technical Session By: Prof. Adarsh Patel.
CIS 103 — Applied Computer Technology Last Edited: September 17, 2010 by C.Herbert Using Database Management Systems.
Concepts and Terminology Introduction to Database.
I Copyright © Oracle Corporation, All rights reserved. Introduction.
Normalization (Codd, 1972) Practical Information For Real World Database Design.
10/17/2012ISC471/HCI571 Isabelle Bichindaritz 1 Technologies Databases.
 2004 Prentice Hall, Inc. All rights reserved. 1 Segment – 6 Web Server & database.
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:
Concepts of Database Management Seventh Edition
1.1 CAS CS 460/660 Relational Model. 1.2 Review E/R Model: Entities, relationships, attributes Cardinalities: 1:1, 1:n, m:1, m:n Keys: superkeys, candidate.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Lecture2: Database Environment Prepared by L. Nouf Almujally 1 Ref. Chapter2 Lecture2.
Lecture # 3 & 4 Chapter # 2 Database System Concepts and Architecture Muhammad Emran Database Systems 1.
DataBase Management System What is DBMS Purpose of DBMS Data Abstraction Data Definition Language Data Manipulation Language Data Models Data Keys Relationships.
Prepared By Prepared By : VINAY ALEXANDER ( विनय अलेक्सजेंड़र ) PGT(CS),KV JHAGRAKHAND.
Prepared by The Smartpath Information Systems
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
Database Fundamental & Design by A.Surasit Samaisut Copyrights : All Rights Reserved.
MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez 1 Essential Computing for Bioinformatics Lecture.
Database Systems Lecture 1. In this Lecture Course Information Databases and Database Systems Some History The Relational Model.
Relational Databases: Basic Concepts BCHB Lecture 21 By Edwards & Li Slides:
Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist: Learning with Python 1 Introduction to Python programming for Bioinformatics.
Alex Ropelewski Pittsburgh Supercomputing Center National Resource for Biomedical Supercomputing Bienvenido Vélez
MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.
MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist: Learning.
LECTURE TWO Introduction to Databases: Data models Relational database concepts Introduction to DDL & DML.
MARC: Developing Bioinformatics Programs June 2012 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:
Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Using Molecular Biology to Teach Computer Science High-level Programming with Python Finding Patterns.
MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Essential BioPython Manipulating Sequences with Seq 1.
High-level Programming with Python Expressions and Variables Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer.
MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist: Learning.
Introduction to Database Programming with Python Gary Stewart
Data Resource Management Data Concepts Database Management Types of Databases Chapter 5 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies,
1 A Short Introduction to Analyzing Biological Data Using Relational Databases Part II: Creating a Relational Database to Model Biological Data Alex Ropelewski.
MARC: Developing Bioinformatics Programs June 2012 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:
Fundamental of Database Systems
A Short Introduction to Analyzing Biological Data Using Relational Databases Part III: Writing Simple (Single Table) Queries to Access Relational Data.
Bioinformatics Data Management
Chapter 4 Relational Databases
Tools for Memory: Database Management Systems
Database Basics An Overview.
The Relational Model Textbook /7/2018.
Contents Preface I Introduction Lesson Objectives I-2
INTRODUCTION A Database system is basically a computer based record keeping system. The collection of data, usually referred to as the database, contains.
Presentation transcript:

MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez 1 Bioinformatics Data Management Lecture 3 Structured Databases

The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students from the biological sciences, computer science, and mathematics departments. They have been developed as a part of the NIH funded project “Assisting Bioinformatics Efforts at Minority Schools” (2T36 GM008789). The people involved with the curriculum development effort include: Dr. Hugh B. Nicholas, Dr. Troy Wymore, Mr. Alexander Ropelewski and Dr. David Deerfield II, National Resource for Biomedical Supercomputing, Pittsburgh Supercomputing Center, Carnegie Mellon University. Dr. Ricardo González Méndez, University of Puerto Rico Medical Sciences Campus. Dr. Alade Tokuta, North Carolina Central University. Dr. Jaime Seguel and Dr. Bienvenido Vélez, University of Puerto Rico at Mayagüez. Dr. Satish Bhalla, Johnson C. Smith University. Unless otherwise specified, all the information contained within is Copyrighted © by Carnegie Mellon University. Permission is granted for use, modify, and reproduce these materials for teaching purposes. Most recent versions of these presentations can be found at Bioinformatics Data Management

Structured Databases: Outline Structured Databases at a Glance - Characteristics Advantages of Structured Databases Data Independence Disadvantages of Structured Databases Examples of Structured Databases –Hierarchical Databases –Networked Databases –Relational Databases –XML Databases

Structured Databases at a Glance All information organized in same way (Data Model) Language available to –describe (create) the database –insert data –manipulate data –update Language establishes an abstract data model: Data Independence Programs using language can work across systems Facilitates communication and sharing data These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 4

Data Independence These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 5 Application Data Model Shield apps from changes in “physical” platform specific layers OS and File System Tables, relations Files, directories Objective:

Disadvantages of Structured Databases These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 6 Model may not fit data needs Example: How to represent proteins in table format Approach #1 – Store one residue per column AccNumNameAA1AA2AA3…AA517 AAA16331G-gamma globinMGH AAA51693Aldehyde Dehydrogenase MLR…S Wasted space Must make number of columns = max length of any possible sequence What is this number?

Disadvantages of Structured Databases These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 7 Model may not fit data needs Approach #2 – Store one residue per rows AccNumAAPosAAName AAA163311MG-gamma globin AAA163312MG-gamma globin ………… AAA516931MAldehyde Dehydrogenase ………… Difficult to recover sequence in string form using DML

Disadvantages of Structured Databases These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 8 Model may not fit data needs Approach #3 – Put aminoacid sequence in one attribute AccNumNameSequence AAA16331G-gamma globin“MGHFTEEDKA….” AAA51693Aldehyde Dehydrogenase“MLRAAARFPGP….” …… Must analyze sequence using program outside SQL Loose some benefits of Data Model

A Simple Relational Example Intuitively model consists of tables –Rows are objects or “entities” –Columns are “attributes” of entities –Attributes cross reference other tables These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 9 Accessio nNum NameDiscBySeqBy P20123 Q30321 R323NULL NumNameLastCountry 1JohnDoeEngland 2JaneDoeEngland 3HughNicholasUSA ProteinsScientists Intuitively:  Protein P201 discovered by Jane Doe  Hugh discovered R2 and sequenced P201 Primary key

Structured Databases: Other examples XML Databases Query language = XPATH/XQUERY These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 10

Relational Databases: Outline Introduction and Examples Relational Database Design by Example entities and relational diagrams normal forms SQL (Sequel) Language SQL Data Manipulation –Select –Joins –Updates and deletes –Inserts

Relational Databases: Timeline Originally proposed by E.F. Codd in 1970 First research prototypes in early 80’s: UC Berkeley –System IBM Today the market exceeds $20B annually These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 12 Edgar F. Codd

Relational Databases Products Commercial –Oracle –MS SQL Server –IBM DB2 Open Source –MySQL –Postgres –SQLite These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 13

Example Relational Database Design These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 14 Goal: Store results from multiple sequence search attempts Leverage SQL to analyze large result set Entities to be stored: Matching sequences with scores for each search matrix Acc#DefinitionSourceMatrixeValueSearchDate P14555Group IIA Phospholipase A2 HumanPam E-327/21/07 P81479Phospholipase A2 isozyme IV Indian Green Tree Viper Pam E527/21/07 P14555Group IIA Phospholipase A2 HumanBlosom E-337/20/07 P81479Phospholipase A2 isozyme IV Indian Green Tree Viper Blosom E-547/20/07 Problems: Lots of redundant information

Dealing with Redundancy These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 15 AccNumDefinitionSource P14555Group IIA Phospholipase A2Human P81479Phospholipase A2 isozyme IVIndian Green Tree Viper Acc#DateMatrixeValue P145557/21/07Pam E-32 P814797/21/07Pam E52 P145557/20/07Blosom E-33 P814797/20/07Blosom E-54 Sequences Matches Foreign key Normalization Still redundant

Dealing with Redundancy These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 16 Run#MatrixDate 17/21/07Pam70 27/20/07Blosom80 Sequences Runs Matches Normalization Acc#DefinitionSource P14555Group IIA Phospholipase A2Human P81479Phospholipase A2 isozyme IVIndian Green Tree Viper Acc#Run#eValue P E-32 P E52 P E-33 P E-54

Basic Normal Forms First Normal Form –Table must be flat; no multi-valued attributes Second Normal Form –All non-key attributes determined by whole primary key Third Normal Form –All non-key attributes can ONLY depend on whole primary key directly Others… These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 17

Entity Relationship Diagrams These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 18 sequences Runs N-to-N relations model as table n n Acc#DefinitionSource Sequences Run#MatrixDate Runs Matches Acc#Run#eValue

Entity Relationship Diagrams These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 19 Organism Sequence one-to-many relationships 1 n IDNameDescription 1Cod wormPseudoterranova decipiens 2HumanHomo sapiens Acc#SrcIDName CAA777431Hemoglobin AB Hemoglobin No table necessary for one-to-many rel’s Sequence Organism

Simple Query Language (SQL) Organization –Data definition/schema creation –Data manipulation Insertion Manipulation Updates Removals – A standard (ISO) since 1987 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 20

SQL Data Definition Language (DDL) The CREATE statement These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 21 Sequences Runs Matches Acc#Run#eValue CREATE TABLE Sequences( AccNum int, Definition varchar (255), Source varchar(255) ) CREATE TABLE Runs( RunNum int, Matrix varchar(255), Date date) CREATE TABLE Matches( AccNum int, RunNum int, eValue int) Acc#DefinitionSource Run#MatrixDate

SQL Data Manipulation Language (DML) Foundation is relational algebra An algebra on relations with three basic operations: These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 22 OperationDescription ProjectionSelect subset of attributes of a table SelectionSelect subset of entities in a table JoinMerge two tables into one All three operations yield a new table as the result

Relational Algebra - Projection These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 23 Selects a subset of attributes or columns Acc#DefinitionSource CAM22514Hemoglobin alphaMus Musculus CAO91797Hemoglobin XMus Musculus ABG47031HemoglobinHomo Sapiens Acc#Source CAM22514Mus Musculus CAO91797Mus Musculus ABG47031Homo Sapiens Intuitively: Find organism that each sequence belongs to π Acc#, Source ( Sequences ) Sequences

Relational Algebra - Selection These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 24 Selects a subset of entities or rows Acc#DefinitionSource CAM22514Hemoglobin alphaMus Musculus CAO91797Hemoglobin XMus Musculus ABG47031HemoglobinHomo Sapiens Acc#DefinitionSource CAM22514Hemoglobin alphaMus Musculus CAO91797Hemoglobin XMus Musculus No reason to keep organism is result Sequences σ Source=Mus Musculus (Sequences) Intuitively: Find all sequences from Mus Musculus

Relational Algebra: Projection and Selection These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 25 σ Source= Mus Musculus (Sequences) π Acc#, Definition ( Sequences ) Acc#DefinitionSource CAM22514Hemoglobin alphaMus Musculus CAO91797Hemoglobin XMus Musculus Acc#Definition CAM22514Hemoglobin alpha CAO91797Hemoglobin X π Acc#, Definition (σ Source= Mus Musculus (Sequences) ) Acc#DefinitionSource CAM22514Hemoglobin alphaMus Musculus CAO91797Hemoglobin XMus Musculus ABG47031HemoglobinHomo Sapiens Sequences

Relational Algebra - Natural Join These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 26 Requires attributes for join to have same name Sequences SourceIDName 1Mus Musculus 2Homo Sapiens Organisms Sequences Organisms AccNumDefinitionSourceIDName CAM22514Hemoglobin alpha1Mus Musculus CAO91797Hemoglobin X1Mus Musculus ABG47031Hemoglobin2Homo Sapiens AccNumDefinitionSourceID CAM22514Hemoglobin alpha1 CAO91797Hemoglobin X1 ABG47031Hemoglobin2

Relational Algebra - Inner Join These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 27 AccNumDefinitionSrcID CAM22514Hemoglobin alpha1 ABG47031Hemoglobin2 NPO34535Hemoglobin zeta3 Rows without matching attributes excluded Sequences SrcIDName 1Mus Musculus 2Homo Sapiens Organisms Can name attribute explicitly sequences.SrcID = organisms.SrcID Sequences Organisms AccNumNameSrcIDName CAM22514Hemoglobin alpha1Mus Musculus ABG47031Hemoglobin2Homo Sapiens

Relational Algebra - Left Join These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 28 Tuples (rows) on left table without matches have NULL values on attributes of right table AccNumNameSrcIDName CAM22514Hemoglobin alpha1Mus Musculus ABG47031Hemoglobin2Homo Sapiens NPO34535Hemoglobin zeta3NULL AccNumDefinitionSrcID CAM22514Hemoglobin alpha1 ABG47031Hemoglobin2 NPO34535Hemoglobin zeta3 Sequences SrcIDName 1Mus Musculus 2Homo Sapiens Organisms sequences.SrcID = organisms.SrcID Sequences Organisms

SQL Select - Projection These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 29 Selects a subset of attributes or columns Acc#DefinitionSource CAM22514Hemoglobin alphaMus Musculus CAO91797Hemoglobin XMus Musculus ABG47031HemoglobinHomo Sapiens Acc#Source CAM22514Mus Musculus CAO91797Mus Musculus ABG47031Homo Sapiens π Acc#, Source ( Sequences ) Sequences SELECT Acc#, Source FROM Sequences

SQL Select - Selection These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 30 Selects a subset of entities or rows Acc#DefinitionSource CAM22514Hemoglobin alphaMus Musculus CAO91797Hemoglobin XMus Musculus ABG47031HemoglobinHomo Sapiens Acc#DefinitionSource CAM22514Hemoglobin alphaMus Musculus CAO91797Hemoglobin XMus Musculus No reason to keep organism is result Sequences σ Source=Mus Musculus (Sequences) SELECT * FROM Sequences WHERE Source= ‘Mus Musculus’ “*” means “all” attributes

SQL Select: Projection and Selection These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 31 σ Source= Mus Musculus (Sequences) π Acc#, Definition ( Sequences ) Acc#DefinitionSource CAM22514Hemoglobin alphaMus Musculus CAO91797Hemoglobin XMus Musculus Acc#Definition CAM22514Hemoglobin alpha CAO91797Hemoglobin X π Acc#, Definition (σ Source= Mus Musculus (Sequences) ) Acc#DefinitionSource CAM22514Hemoglobin alphaMus Musculus CAO91797Hemoglobin XMus Musculus ABG47031HemoglobinHomo Sapiens Sequences SELECT SeqNum, Org FROM Sequences WHERE Source= Mus Musculus

SQL Select - Natural Join These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 32 Sequences SourceIDName 1Mus Musculus 1 2Homo Sapiens Organisms Sequences Organisms AccNumDefinitionSourceIDName CAM22514Hemoglobin alpha1Mus Musculus CAO91797Hemoglobin X1Mus Musculus ABG47031Hemoglobin2Homo Sapiens AccNumDefinitionSourceID CAM22514Hemoglobin alpha1 CAO91797Hemoglobin X1 ABG47031Hemoglobin2 SELECT * FROM Sequences NATURAL JOIN Organisms

SQL Select - Inner Join These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 33 AccNumDefinitionSrcID CAM22514Hemoglobin alpha1 ABG47031Hemoglobin2 NPO34535Hemoglobin zeta3 Rows without matching attributes excluded Sequences SrcIDName 1Mus Musculus 2Homo Sapiens Organisms Can name attribute explicitly sequences.SrcID = organisms.SrcID Sequences Organisms AccNumNameSrcIDName CAM22514Hemoglobin alpha1Mus Musculus ABG47031Hemoglobin2Homo Sapiens SELECT * FROM Sequences INNER JOIN Organisms on sequences.SrcID = organisms.SrcID

SQL Select - Left Join These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 34 Tuples (rows) on left table without matches have NULL values on attributes of right table AccNumNameSrcIDName CAM22514Hemoglobin alpha1Mus Musculus ABG47031Hemoglobin2Homo Sapiens NPO34535Hemoglobin zeta3NULL AccNumDefinitionSrcID CAM22514Hemoglobin alpha1 ABG47031Hemoglobin2 NPO34535Hemoglobin zeta3 Sequences SrcIDName 1Mus Musculus 2Homo Sapiens Organisms sequences.SrcID = organisms.SrcID Sequences Organisms SELECT * FROM Sequences LEFT OUTER JOIN Organisms ON Sequences.SrcID = Organisms.SrcID;

SQL UPDATE Statement These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 35 Acc#DateMatrixeValue P145557/21/07Pam E-32 P814797/21/07Pam E52 P145557/20/07Blosom E-33 P814797/20/07Blosom E-54 Acc#DateMatrixeValue P145557/22/07Pam E-32 P814797/22/07Pam E52 P145557/20/07Blosom E-33 P814797/20/07Blosom E-54 UPDATE Matches SET Date = ‘7/22/07’ WHERE Date = ‘7/21/07’ Matches

SQL DELETE Statement These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 36 Acc#DateMatrixeValue P145557/21/07Pam E-32 P814797/21/07Pam E52 P145557/20/07Blosom E-33 P814797/20/07Blosom E-54 Acc#DateMatrixeValue P145557/20/07Blosom E-33 P814797/20/07Blosom E-54 DELETE FROM Matches WHERE Date = ‘7/21/07’ Matches ADVICE: BE CAREFUL WITH DELETE. THERE IS NO EASY UNDO

SQL Select with SORT BY These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 37 Acc#DateMatrixeValue P145557/21/07Pam E-32 P814797/21/07Pam E52 P145557/20/07Blosom E-33 P814797/20/07Blosom E-54 SELECT * FROM Matches ORDER BY eValue ASC Matches Acc#DateMatrixeValue P814797/20/07Blosom E-54 P814797/22/07Pam E52 P145557/20/07Blosom E-33 P145557/22/07Pam E-32

SQL Data Manipulation Grouping Results and Aggregates These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 38

Filling-Up the database - Insert These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 39 Good when you know the attributes of all entities c-priori SrcIDName 1Mus Musculus 2Homo Sapiens Organisms INSERT INTO Organisms Values (3, ‘C. Elegans’) SrcIDName 1Mus Musculus 2Homo Sapiens 3C. Elegans Organisms

Filling-Up the database Approach #2: Input from file (CSV) Under construction Tricky to set data types handled correctly These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 40

Case Study: Analyzing Results of Multiple BLAST Runs with alternative search matrices Steps at a Glance –Install and configure tools Python environment including BioPython Database system (SQLite) –Design the relational database schema –Write Python functions to insert results into database –Analyze data using SQL queries These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 41

Case Study (Step 1): Install Tools Install Python interpreter –Download installer from –Run installer and verify installation Install BioPython –Download installer from –Run installer and verify installation Install SQLiteMan –Download SQLiteMan query browser from –Run installer and verify installation These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 42 Windows version

Case Study (Step 1): Install Tools UNDER CONSTRUCTION These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 43 Mac OS version

Case Study (Step 2):Design Database Schema These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 44 SourceAccession Num DefinitionMatrixeValueSearchDate HumanP14555Group IIA Phospholipase A2 Pam E-327/21/07 Indian Green Tree Viper P81479Phospholipase A2 isozyme IV Pam E527/21/07 HumanP14555Group IIA Phospholipase A2 Blosom E-337/20/07 Indian Green Tree Viper P81479Phospholipase A2 isozyme IV Blosom E-547/20/07

Case Study Step 3: Python Programming Write Python functions to: –Run a BLAST search for the query sequence and a specific matrix –Insert results into Database Run SQL insert queries to populate the database Create a CSV file with all results from BLAST searches and import to DB –Run all functions together running one Blast for each one of a set of matrices These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 45

Case Study Step 3: Python Programming Run a BLAST search for the query sequence and a specific matrix These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 46 From Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML from Bio import Fasta from sys import * import sqlite3 def searchAndStoreBlastSearch(query, matrix_id, filename): # Creates handle to store the results sent by NCBI website results_handle = NCBIWWW.qblast("blastp", "swissprot", query, expect=10, descriptions=2000, alignments=2000, hitlist_size=2000, matrix_name=matrix_id) # Reads results into memory blast_results = results_handle.read() # Creates and opens a file in filesystem for writing save_file = open(filename, 'w') # Store results from memory into the filesystem save_file.write(blast_results) #Close the file handler save_file.close()

Case Study Step 3: Python Programming Insert results into Database: Run multiple SQL insert queries These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 47 def xmlToDatabase(filename, matrix_id, dbCursor, dbConn): blast_fileptr=open(filename) record=NCBIXML.parse(blast_fileptr).next() for alignment in record.alignments: hsp=alignment.hsps[0] storeIntoDatabase(hsp, alignment, matrix_id,dbCursor,dbConn) blast_fileptr.close()

Case Study Step 3: Python Programming Insert results into Database: Run multiple SQL insert queries These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 48 def storeIntoDatabase(hsp, alignment, matrix_id, dbCursor, dbConn): # Insert a row of data into the table (securely) t = (hsp.expect, hsp.score, matrix_id, alignment.accession, alignment.title,) dbCursor.execute("insert into sequences values (?,?,?,?,?)",t) # Save (commit) the changes dbConn.commit()

Case Study Step 3: Python Programming Insert results into Database: Create a CSV file with all results from BLAST searches These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 49 def xmlToCSV(xmlFilename, csvFilename, matrix_id): # Create file handler blast_fileptr=open(xmlFilename) # Parse xml file into biopython object record=NCBIXML.parse(blast_fileptr).next() # Create csv file writer csvFileWriter = csv.writer(open(csvFilename,'wb')) # Iterate over record alignments for alignment in record.alignments: # Extract high score pairwise alignment object hsp=alignment.hsps[0] # Create record vector record = [hsp.expect, hsp.score, matrix_id, alignment.accession, alignment.title,] # Store record vector into csv file csvFileWriter.writerow( record ) # Close file handler blast_fileptr.close()

Case Study Step 3: Python Programming Use all functions to run one Blast for each of a sequence of search matrices These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 50 def doIt(): print "Import fasta file" query_file = open('P00624.fasta') fasta = Fasta.Iterator(query_file) query = fasta.next() query_file.close() print "Creating database interface" dbConnection = sqlite3.connect(dbFileName) dbCursor = dbConnection.cursor() # Create tables dbCursor.execute("DROP TABLE IF EXISTS sequences") dbCursor.execute("CREATE TABLE IF NOT EXISTS sequences (expect real, score int, matrix text, \ accession text, description text)") print "Iterate over matrixes" for i in range(0,len(MatrixName)): print "Sending blast search" searchAndStoreBlastSearch(query, MatrixName[i], FileName[i]) print "Processing blase search" xmlToDatabase(FileName[i], MatrixName[i], dbCursor, dbConnection) print "Program done" Matrixname=['PAM70','BLOSUM80'] FileName=['TestPAM70.xml','TestBLOSUM80.xml'] dbFileName='data.db3'

Case Study Step 3: Python Programming Run all functions together running one Blast for each one of a set of matrices Import CSV version These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 51 MatrixName=['PAM70','BLOSUM80'] TableName=['PAM70','BLOSUM80'] FileName=['TestPAM70.xml','TestBLOSUM80.xml'] CsvFileName=['TestPAM70.csv','TestBLOSUM80.csv'] def doIt(): query_file = open('P00624.fasta') fasta = Fasta.Iterator(query_file) query = fasta.next() query_file.close() for i in range(0,len(MatrixName)): searchAndStoreBlastSearch(query, MatrixName[i], FileName[i]) xmlToCSV(FileName[i], CsvFileName[i], MatrixName[i])

Case Study (Step 3): Import CSV File UNDER CONSTRUCTION These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 52

Case Study (Step 4): Analyze Data Analyze data using SQL Example 1: Finding top matches –Display sequences sorted by average score/run and number of runs where they appeared These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 53 select *, COUNT(matrix) as total from sequences group by accession order by total desc

Case Study (Step 4): Analyze Data These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 54 select *, COUNT(matrix) as total from sequences group by accession order by total desc Display sequences sorted by average score/run and number of runs where they appeared

Case Study (Step 4): Analyze Data Analyze data using SQL Example 2: Finding discriminating matrices –Display sequences that only showed up in one matrix run These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 55 select *, COUNT(matrix) as total from sequences group by accession having total = 1

Case Study (Step 4): Analyze Data These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 56 select *, COUNT(matrix) as total from sequences group by accession having total = 1 Display sequences that only showed up in one matrix run

Case Study: Step 4 Generate Graphic Reports using Excel These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 57