Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.

Similar presentations


Presentation on theme: "Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University."— Presentation transcript:

1 Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

2 Data collection queries Scientific protocol –Must be able to reproduce the process Involve multiple resources –Data sources –Applications

3 Expressing scientific protocols Scientific protocols mix design and implementation Design –What the protocols does (tasks) –Scientific objects involved Implementation –How the protocol is executed –Data sources and applications

4 Expressing scientific protocols Scientific protocols are driven by their implementation –Scientists use the resources they know data (quality) access to data format, limits, etc. –Scientists may not exploit better resources because they do not know them Queries should be driven by the design, the implementation should meet the design needs

5 Example* - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs The alternative splicing pipeline will provide a complete characterization of variations in proteins due to splice variation or SNPs evident in repositiories of contiguous genome sequence data and expressed sequence tags (ESTs). The pipeline applies secondary structure, tertiary structure, domain motif detection and sequence comparison tools to proteins encoded by genes with alternatively splice forms or SNPs. *Courtesy of Dr. Marta Janer, Institute for Systems Biology

6 Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides.

7 Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides. Data sources

8 Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides. tools

9 Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides. tasks

10 Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides. Scientific objects

11 Pipeline Selecting Target Proteins * SMART Swiss - Prot BIND DIP CEY2H sigpep blast x D.mel Step 1 = retrieve all proteins from SMART and Swiss-Prot with textual search with the keyword “apoptosis” Step 2 = retrieve all proteins from Swiss-Prot with a signal peptide feature and the keyword “apoptosis” Step 3 = retrieve their binding partners from DIP, BIND and the C.elegans dataset Step 4 = run through a signal peptide prediction program such as SigPep to check for the presence of signal peptides in each of the sequences Step 5 = homology search using BLAST of the retrieved sequences with proteins predicted from the Drosophila melanogaster genome might yield additional candidates Output = final set of signal peptide proteins involved in apoptosis *Courtesy of Dr. Terry Gaasterland, The Rockefeller University

12 Design and implementation StepTask Implementation InputRelevant keyword for which the proteins are required Step 1All proteins with keyword and with signal feature peptide must be retrieved SMART Swissprot Step 2Binding partners of all of these proteins are retrievedDIP BIND Step 3Integration into final set is run through a signal peptide prediction program SigPep Step 4Homology search of the retrieved sequences with proteins predicted from the specific genome yield additional candidates BLAST

13 Expressing scientific pipelines with BioNavigation Queries are expressed at a conceptual level (design) DNA Seq. Disease Gene Citation Protein Seq. Conceptual level Scientific classes

14 Conceptual graph Labeled edges –Scientific meaningful edges Gene Nucleotide Sequence DNA RNA mRNA Protein isA transcribesTo isTranscribedFrom isTranslatedFrom translatesTo

15 Conceptual graph Gene Nucleotide Sequence DNA RNA mRNA Protein isA transcribesTo isTranscribedFrom isTranslatedFrom translatesTo IsRelatedTo

16 Mapping to physical resources OMIM Gen- Bank Pub- Med HUGO NCBI Protein DNA Seq. Disease Gene Citation Protein Seq. Conceptual level Physical level Data Sources Scientific classes

17 Mapping to physical resources OMIM Gen- Bank Pub- Med HUGO NCBI Protein DNA Seq. Disease Gene Citation Protein Seq. Conceptual level Physical level Data Sources Scientific classes

18 Exploring biological metadata “Return all citations that are related to some disease or condition” Diabetes : 11 Aging : 71 Cancer : 391 OMIM NUCLEOTIDEPROTEIN PUBMED (P1) (P2)(P3) Link: Entrez provides an index with the Links in the display option from each entry Parse: Parsing each entry to retrieve its related entries All: Entrez provides an index with the Links in the display option which allows to look at a set of entries at a time

19 Selecting biological resources 3 resources that look the same –Are they the same? 3 paths that will retrieve PubMed entries related to citations –Do they have the same semantics?

20 Results for the disease conditions diabetes, aging and cancer P1P2P3 Diabetes Link43,890 42,969 59,959 Parse 43,747 43,090 51,906 All44,037 43,581 49,719 Aging Link48,393 51,712 60,129 Parse 48,398 51,855 61,260 All48,393 51,474 60,938 Cancer Link56,315 54,487 62,686 Parse 56,315 54,607 63,367 All56,532 52,488 60,033

21 Overlap results for the disease conditions diabetes P1P2P3 Link P1 100% 25.82% 21.95% P2 25.28% 100% 70.00% P3 29.98% 97.68% 100% Parse P1 100% 23.93% 22.87% P2 29.18% 100% 81.20% P3 33.60% 97.81% 100% All P1 100% 24.75% 24.29% P2 24.64% 100% 79.49% P3 27.42% 90.68% 100%

22 Evaluating resources Similar applications –Different outputs Similar data sources –Different output Number of resources –Different output Order of resources –Different output

23 Exploiting semantics of resources Number of entries Characterization of entries (number of attributes) Time

24 Exploiting the semantics of links

25 BioNavigation (joint work with Louiqa Raschid and Maria-Esther Vidal) Conceptual graph –No labeled links Queries –Regular expressions of concepts ESearch –Path cardinality - number of instances of paths of the result. For a path of length 1 between two sources S1 and S2, it is the number of pairs (e1, e2) of entries e1 of S1 linked to an entry e2 of S2. –Target Object Cardinality – number of distinct objects retrieved from the final data source. –Evaluation Cost – cost of the evaluation plan, which involves both the local processing cost and remote network access delays.

26 Work in progress Conceptual graph –Labeled links Queries –Complex dataflows Physical graph –Access to a BioMetaDatabase –Data sources –Applications

27 Representing the conceptual graph in Protégé

28 Visualization Limitations in Protégé Using the GraphViz plugin –Shows only IsA hierarchy

29 TgiViz plugin

30 Conclusion Scientists need support to select resources to express their protocols Semantics of resources may be exploited to enhance the data collection process Need for a repository of biological metadata (BioMetaDatabase)


Download ppt "Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University."

Similar presentations


Ads by Google