Source Page Understanding for Heterogeneous Molecular Biological Data

Slides:



Advertisements
Similar presentations
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Advertisements

Microarray Data Analysis Day 2
Pubcrawler. Semantic Web  “The Semantic Web will bring structure to the meaningful content of Web pages, creating an environment where software.
Social bookmarking EMBL Centre for Computational Biology 30 th of May, 2006 Michael Kuhn.
Classification Biology History Carolus Linnaeus (1707–1778) was born. His great work, the Systema Naturae, ran through twelve editions during his lifetime.
Bioinformatics “Other techniques raise more questions than they answer. Bioinformatics is what answers the questions those techniques generate.” SheAvery
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
COG and GO tutorial.
6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.
1 Semi-Automatic Semantic Annotation for Hidden-Web Tables Cui Tao & David W. Embley Data Extraction Research Group Department of Computer Science Brigham.
Toward Making Online Biological Data Machine Understandable Cui Tao.
ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
1 Data Integration and Extraction over Molecular Biological Data Cui Tao supported by NSF.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Seed-based Generation of Personalized Bio-Ontologies for Information Extraction Cui Tao & David W. Embley Data Extraction Research Group Department of.
Scheme Matching and Data Extraction over HTML Tables from Heterogeneous Sources Cui Tao March, 2002 Founded by NSF.
Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.
Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University,
POC tutorial #1: Introduction This tutorial will run automatically in Quicktime. To run the tutorial at your own pace use the internal controllers within.
SOLUTION: Source page understanding – Table interpretation Table recognition Table pattern generalization Pattern adjustment Information extraction & semantic.
Table Interpretation by Sibling Page Comparison Cui Tao & David W. Embley Data Extraction Group Department of Computer Science Brigham Young University.
1 Ontology Generation Based on a User-Specified Ontology Seed Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University.
1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
SPH 247 Statistical Analysis of Laboratory Data 1 May 12, 2015 SPH 247 Statistical Analysis of Laboratory Data.
Essential Bioinformatics and Biocomputing Module (Tutorial) Biological Databases Lecturer: Chen Yuzong Jan 2003 TAs: Cao Zhiwei Lee Teckkwong, Bernett.
Copyright OpenHelix. No use or reproduction without express written consent1.
Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management.
SPH 247 Statistical Analysis of Laboratory Data 1May 14, 2013SPH 247 Statistical Analysis of Laboratory Data.
Gene Ontology Project
Chani & Malki present: Project adviser: Dr. Ron Wides The OdzFinder.
What is an Ontology? An ontology is a specification of a conceptualization that is designed for reuse across multiple applications and implementations.
Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010.
Classifying the Diversity of Life Targets: 17. State the goals of taxonomy. 18. Describe how evolutionary biology and molecular biology influence classification.
Copyright OpenHelix. No use or reproduction without express written consent1.
BIOLOGICAL DATABASES. BIOLOGICAL DATA Bioinformatics is the science of Storing, Extracting, Organizing, Analyzing, and Interpreting information in biological.
Ontologies Working Group Agenda MGED3 1.Goals for working group. 2.Primer on ontologies 3.Working group progress 4.Example sample descriptions from different.
Agenda for 2-13 Complete Short Answer Questions on Unit 6 Review Pollinate Plants and Check on Flies Classification PowerPoint Cladogram Construction and.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Chapter 1 Humans In The World Of Biology. The Scientific Method Scientific Method –A procedure that is used to solve problems or answer questions. –A.
Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania.
COURSE OF BIOINFORMATICS Exam_30/01/2014 A.
Web services and genome annotation in GRID by DNA Data Bank of Japan (DDBJ) Center for Information Biology and DNA Data Bank of Japan National Institute.
Of 24 lecture 11: ontology – mediation, merging & aligning.
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
Create your Domain Model. Session Outline caCORE Build Process Review of UML Modeling Lesson 1: Model a Data Service Lesson 2: Create a UML Model for.
` Comparison of Gene Ontology Term Annotations Between E.coli K12 Databases REDDYSAILAJA MARPURI WESTERN KENTUCKY UNIVERSITY.
Organism CDE Standard Candidate VCDE, January 22, 2008 VCDE Small Group: Riki Ohira, Dianne Reeves, Mukesh Sharma, Grace Stafford, Baris Suzek, Lynne Wilkens.
BME435 BIOINFORMATICS.
Biological Databases By: Komal Arora.
GO : the Gene Ontology & Functional enrichment analysis
Development of the Amphibian Anatomical Ontology
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Modeling DNA Structure
Functional Annotation of the Horse Genome
Overview Gene Ontology Introduction Biological network data
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Batyr Charyyev.
Systematics Systematics is the science of categorizing organisms into like groups and establishing their relationship relative to each other. Eight major.
Humans In The World Of Biology
Unit: Classification How are living things classified. SB3b
TAXONOMY.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

Source Page Understanding for Heterogeneous Molecular Biological Data Cui Tao Supported by NSF 1

Introduction Online biological data: Highly diverse in both granularity and variety In various formats Uses different terminologies, ID systems, units, … To automatically understand heterogeneous source pages is a challenge Extraction ontology based source page understanding 2/23/2019 2

Extraction Ontology (Partial) 2/23/2019 2

Extraction Ontology (Partial) 2/23/2019 2

Extraction Ontology (Partial) 2/23/2019 2

Extraction Ontology (Partial) 2/23/2019 2

Extraction Ontology (Partial) 2/23/2019 2

Extraction Ontology Construction Knowledge sources Gene Ontology Thousands of terms All Species Toolkit Total of 1,231,935 names Protein databases Thousands of protein names Regular expressions, keywords (Molecular Function, Biological Process, Cellular Component) 2/23/2019 3

Source Page Understanding 2/23/2019 4

2/23/2019 4

2/23/2019 4

Source Page Understanding Three steps: Recognize attributes and values Find attribute-value pairs Map attribute-value pairs to target concepts Two techniques: Sibling page comparison Seed ontology recognition 2/23/2019 5

Sibling Page Comparison 2/23/2019 6

Sibling Page Comparison 2/23/2019 6

Sibling Page Comparison 2/23/2019 6

Sibling Page Comparison Attribute 2/23/2019 6

Sibling Page Comparison 2/23/2019 6

Sibling Page Comparison 2/23/2019 7

Seed Ontology Recognition What is a seed ontology? Why do we use a seed ontology? 2/23/2019 8

2/23/2019 9 Homo sapiens; human; zinc ion binding; nucleus; zinc ion binding; nucleic acid binding; linear; NP_079345; 9606; Eukaryota; Metazoa; Chorata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo; Homo sapiens; human; GTTTTTGTGTT……….ATAAGTGCATTAACGGCCCACATG; FLJ14299 msdspagsnprtpessgsgsgg………tagpyyspyalygqrlasasalgyq; hypothetical protein FLJ14299; 8; eight; “8:?p\s?12”; “8:?p11.2”; “8:?p11.23”; : “37,?612,?680”; “37,?610,?585”; 2/23/2019 9

Seed Ontology Recognition 2/23/2019 10

2/23/2019 11 Homo sapiens; human; nucleus; zinc ion binding; nucleic acid binding; 9606; Eukaryota; Metazoa; Chorata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo; zinc ion binding; nucleic acid binding; NP_079345; nucleus; linear; NP_079345; FLJ14299 GTTTTTGTGTT……….ATAAGTGCATTAACGGCCCACATG; msdspagsnprtpessgsgsgg………tagpyyspyalygqrlasasalgyq; 8; eight; “8:?p\s?12”; “8:?p11.2”; “8:?p11.23”; : hypothetical protein FLJ14299; “37,?612,?680”; 2/23/2019 “37,?610,?585”; 11

Evaluation “Training”: Test: Determine thresholds Set up rules for recognizing attribute-value patterns Determine rules of combining different pair-wise comparisons Refine Seed Ontologies Test: Structure recognition: Test on column/row level Measure Precision/Recall values Mapping recognition: Test on Concept level 2/23/2019 12

Contribution Will contribute to both information extraction technology and bioinformatics Can understand both structures and semantics of source pages in the molecular biology domain automatically 2/23/2019 13