Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alex Ropelewski Pittsburgh Supercomputing Center National Resource for Biomedical Supercomputing Bienvenido Vélez

Similar presentations


Presentation on theme: "Alex Ropelewski Pittsburgh Supercomputing Center National Resource for Biomedical Supercomputing Bienvenido Vélez"— Presentation transcript:

1 Alex Ropelewski ropelews@psc.edu Pittsburgh Supercomputing Center National Resource for Biomedical Supercomputing Bienvenido Vélez Bienvenido.Velez@upr.edu University of Puerto Rico at Mayaguez Department of Electrical and Computer Engineering 1 A Short Introduction to Analyzing Biological Data Using Relational Databases Part IV: Using SQL to Summarize Data Across Multiple Tables

2 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students from the biological sciences, computer science, and mathematics departments. They have been developed as a part of the NIH funded project “Assisting Bioinformatics Efforts at Minority Schools” (2T36 GM008789). The people involved with the curriculum development effort include: Dr. Hugh B. Nicholas, Dr. Troy Wymore, Mr. Alexander Ropelewski and Dr. David Deerfield II, National Resource for Biomedical Supercomputing, Pittsburgh Supercomputing Center, Carnegie Mellon University. Dr. Ricardo González Méndez, University of Puerto Rico Medical Sciences Campus. Dr. Alade Tokuta, North Carolina Central University. Dr. Jaime Seguel and Dr. Bienvenido Vélez, University of Puerto Rico at Mayagüez. Dr. Satish Bhalla, Johnson C. Smith University. Unless otherwise specified, all the information contained within is Copyrighted © by Carnegie Mellon University. Permission is granted for use, modify, and reproduce these materials for teaching purposes. Most recent versions of these presentations can be found at http://marc.psc.edu/http://marc.psc.edu/

3 Learning Objectives Using JOIN clause to join tables together Using aggregate functions to analyze data Using GROUP BY clause to group and summarize data Using HAVING to select which groups to show in a result 3

4 SQL The language of relational databases –Data definition/schema creation –Implements relational algebra operations –Data manipulation Insertion Manipulation Updates Removals – A standard (ISO) since 1987 4

5 Tables 5 RunNumDateMatrix 17/21/07Pam70 27/20/07Blosom80 Sequences Runs Matches AccessionDescriptionSpecies P14555Group IIA Phospholipase A2Human P81479Phospholipase A2 isozyme IVIndian Green Tree Viper P00623Phospholipase A2Eastern Diamondback Rattlesnake AccessionRunNumeValue P1455514.18 E-32 P8147922.68 -E52 P1455523.47 E-33 P8147911.20 E-54 P0062321.21 E-08

6 Desired Result 6 AccessionDescriptionSpeciesMatrixeValueDate P14555Group IIA Phospholipase A2 HumanPam704.18 E-327/21/07 P81479Phospholipase A2 isozyme IV Indian Green Tree Viper Pam702.68 -E527/21/07 P14555Group IIA Phospholipase A2 HumanBlosom803.47 E-337/20/07 P81479Phospholipase A2 isozyme IV Indian Green Tree Viper Blosom801.20 E-547/20/07 P00624Phospholipase A2 Eastern Diamondback Rattlesnake Blosum801.21 E-087/20/07

7 SQL: Joining Tables Tables can be joined together based on common attributes: 7 SELECT Matches.Accession, Description, Species, Matrix, eValue, Date FROM Matches INNER JOIN Runs ON Matches.RunNum=Runs.RunNum INNER JOIN Sequences ON Sequences.Accession=Matches.Accession

8 SQL: JOIN Clause Used to merge two tables together Basic types of joins: –INNER; return tuples where the value of the joined attribute exists in both tables –OUTER; does not require the value of the joined attribute to exists in both tables LEFT OUTER; return all tuples from table listed first even if no match in second table RIGHT OUTER; return all tuples from table listed second even if no match in first table 8

9 SQL: MIN aggregate function Used to collapse attribute values, reporting the minimum value 9 Matches AccessionRunNumeValue P1455514.18 E-32 P8147922.68 E-52 P1455523.47 E-33 P8147911.20 E-54 P0062321.21 E-08 Select Result SELECT MIN(eValue) FROM Matches eValue 1.20 E-54

10 SQL: COUNT aggregate function Used to collapse attribute values, reporting the minimum value 10 Matches AccessionRunNumeValue P1455514.18 E-32 P8147922.68 E-52 P1455523.47 E-33 P8147911.20 E-54 P0062321.21 E-08 Select Result SELECT COUNT(eValue) FROM Matches COUNT(eValue) 5

11 SQL: Aggregate Functions Used to collapse attribute values: –COUNT(attribute) –MIN(attribute) –MAX(attribute) –AVG(attribute) –FIRST(attribute) –LAST(attribute) –SUM(attribute) 11

12 Analyzing Bioinformatics Data How many times was a sequence found from the database searches? 12 SELECT Accession,COUNT(Accession) FROM Matches GROUP BY Accession Results AccessionCount(Accession) P006231 P145552 P814792

13 Analyzing Bioinformatics Data How many times was a sequence found from the database searches? Report accessions, sequence descriptions, and number of times found. 13 SELECT Matches.Accession,Sequences.Description, COUNT(Matches.Accession)FROM Matches INNER JOIN Sequences ON Sequences.Accession=Matches.Accession GROUP BY Matches.Accession Results AccessionDescriptionCount(Accession) P00623Phospholipase A21 P14555Group IIA Phospholipase A22 P81479Phospholipase A2 isozyme IV2

14 Analyzing Bioinformatics Data What sequences were found in only one database search? 14 SELECT Accession,COUNT(Accession) as total FROM Matches GROUP BY Accession HAVING total=1 Results AccessionCount(Accession) P006231

15 Analyzing Bioinformatics Data What sequences were found in only one database search? Report accessions, sequence descriptions, and number of times found. 15 SELECT Matches.Accession,Sequences.Description, COUNT(Matches.Accession) AS total FROM Matches INNER JOIN Sequences ON Sequences.Accession=Matches.Accession GROUP BY Matches.Accession HAVING total=1 Results AccessionDescriptionTotal P00623Phospholipase A21

16 Analyzing Bioinformatics Data What sequences were found in only one database search? Report accessions, sequence descriptions and matrix. 16 SELECT Matches.Accession,Sequences.Description, Runs.Matrix FROM Matches INNER JOIN Sequences ON Sequences.Accession=Matches.Accession JOIN Runs ON Matches.RunNum=Runs.RunNum GROUP BY Matches.Accession HAVING COUNT(Matches.Accession)=1 Results AccessionDescriptionMatrix P00623Phospholipase A2Blosum80

17 Key Concepts The JOIN clause can be used to combine two or more tables into a single result table Aggregate functions can be used to combine attributes from multiple tables and conduct data analysis The GROUP BY clause can be used to form groups of rows and compute attributes for those groups The HAVING clause can be used to select which groups to show in the result 17


Download ppt "Alex Ropelewski Pittsburgh Supercomputing Center National Resource for Biomedical Supercomputing Bienvenido Vélez"

Similar presentations


Ads by Google