Toward Making Online Biological Data Machine Understandable Cui Tao.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Pubcrawler. Semantic Web  “The Semantic Web will bring structure to the meaningful content of Web pages, creating an environment where software.
Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.
SPICE! An Ontology Based Web Application By Angela Maduko and Felicia Jones Final Presentation For CSCI8350: Enterprise Integration.
Page 1 Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence Integrating Multiple Data Sources using a Standardized XML.
Using the Semantic Web to Construct an Ontology- Based Repository for Software Patterns Scott Henninger Computer Science and Engineering University of.
Gene Ontology John Pinney
The KB on its way to Web 2.0 Lower the barrier for users to remix the output of services. Theo van Veen, ELAG 2006, April 26.
Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University.
Towards Semantic Web Mining Bettina Berndt Andreas Hotho Gerd Stumme.
6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
1 Data Integration and Extraction over Molecular Biological Data Cui Tao supported by NSF.
A Tool to Support Ontology Creation Based on Incremental Mini-Ontology Merging Zonghui Lian Data Extraction Research Group Supported by.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
Seed-based Generation of Personalized Bio-Ontologies for Information Extraction Cui Tao & David W. Embley Data Extraction Research Group Department of.
1 Extracting RDF Data from Unstructured Sources Based on an RDF Target Schema Tim Chartrand Research Supported By NSF.
Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University,
Towards Semantic Web: An Attribute- Driven Algorithm to Identifying an Ontology Associated with a Given Web Page Dan Su Department of Computer Science.
SOLUTION: Source page understanding – Table interpretation Table recognition Table pattern generalization Pattern adjustment Information extraction & semantic.
1 Ontology Generation Based on a User-Specified Ontology Seed Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University.
1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages.
1 Information Integration and Source Wrapping Jose Luis Ambite, USC/ISI.
Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang
Overview of Bioinformatics A/P Shoba Ranganathan Justin Choo National University of Singapore A Tutorial on Bioinformatics.
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Amarnath Gupta Univ. of California San Diego. An Abstract Question There is no concrete answer …but …
Ontology Development Kenneth Baclawski Northeastern University Harvard Medical School.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Author: James Allen, Nathanael Chambers, etc. By: Rex, Linger, Xiaoyi Nov. 23, 2009.
Grant Number: IIS Institution of PI: Arizona State University PIs: Zoé Lacroix Title: Collaborative Research: Semantic Map of Biological Data.
1 Technologies for distributed systems Andrew Jones School of Computer Science Cardiff University.
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
Knowledge Modeling, use of information sources in the study of domains and inter-domain relationships - A Learning Paradigm by Sanjeev Thacker.
An Aspect of the NSF CDI InitiativeNSF CDI: Cyber-Enabled Discovery and Innovation.
Mining Structured vs. Unstructured Data Where is the structure and where did the semantics go? Rahim Yaseen SAP Labs LLC.
Transparent access to multiple bioinformatics information sources (TAMBIS) Goble, C.A. et al. (2001) IBM Systems Journal 40(2), Genome Analysis.
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.
Object Oriented Multi-Database Systems An Overview of Chapters 4 and 5.
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
1 Context-Aware Internet Sharma Chakravarthy UT Arlington December 19, 2008.
AT&T Government Solutions, Inc. Patrick Emery Lewis Hart or
Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #4 Vision for Semantic Web.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
An Ontological Approach to Financial Analysis and Monitoring.
Semantic Data Extraction for B2B Integration Syntactic-to-Semantic Middleware Bruno Silva 1, Jorge Cardoso 2 1 2
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Facilitating Document Annotation Using Content and Querying Value.
Cloud based linked data platform for Structural Engineering Experiment
Development of the Amphibian Anatomical Ontology
Lecture #11: Ontology Engineering Dr. Bhavani Thuraisingham
A prototypical tool to discover architecture changes based on multiple monitoring data sources for a distributed system Patrick Schäfer, , Munich.
ece 627 intelligent web: ontology and beyond
Source Page Understanding for Heterogeneous Molecular Biological Data
Answering Cross-Source Keyword Queries Over Biological Data Sources
Supporting High-Performance Data Processing on Flat-Files
Context-Aware Internet
Presentation transcript:

Toward Making Online Biological Data Machine Understandable Cui Tao

6/21/2015 Motivation  Huge evolving amount of Bio-databases  The molecular biology database collection  2004: total 548, 162 more than 2003  2005: total 719, 171 more than 2004  Different access capabilities  From web services-level interfaces to basic HTTP form interfaces  From simple lists, keyword queries to full-featured Boolean queries  Different query languages  Syntactic heterogeneity  Flat files with/without format definitions  Relational databases  Structured/unstructured HTML files  Semantic heterogeneity  Different identifiers  Different perspectives  Different terminologies  Different units  Sometimes the information a user needs spans multiple sources  Making online biological data machine understandable is important and challenging 2

6/21/2015 Motivation  To help biologists:  Perform background research  Gain insight into relationships and interactions among different research discoveries  Build up research strategies inspired by others’ hypotheses 3

6/21/2015 System Overview Located Sources Locate Sources Obtain Pages Understand Pages (Extract) Indexes Source URLs Semantic Web Pages Understood Pages Retrieved Pages Cache PagesEnrich Ontology Gene Extraction Ontology Seed Ontologies Execute Query 4

6/21/2015 Research Issues  Source page understanding  Attribute-value pair discovery?  Aligning with an ontology?  Source location through semantic indexing  Metadata vs. instance data indexing?  Use of indexes in query processing?  Ontology evolution  Adjustments to ISA and Part-Of hierarchies?  Addition of attributes? 5

6/21/2015 Thesis Statement  Automatically understands the structure of source pages  Automatically converts source pages into semantic web pages  Semantically indexes biological resources  Semi-automatically updates the ontology Build a proof-of-concept prototype that resolves the research issues: 6

6/21/2015 Outline  Extraction ontology  Source page understanding  Source location through semantic indexing  Ontology enrichment 7

6/21/2015 Extraction Ontology (Partial) 8

6/21/2015 Extraction Ontology (Partial) 8

6/21/2015 Extraction Ontology (Partial) 8

6/21/2015 Extraction Ontology (Partial) 8

6/21/2015 Extraction Ontology (Partial) 8

6/21/2015 Extraction Ontology Construction  Knowledge sources  Gene Ontology  Thousands of terms  All Species Toolkit  Total of 1,231,935 names  Protein databases  Thousands of protein names  Regular expressions, keywords (Molecular Function, Biological Process,Molecular FunctionBiological Process, Cellular ComponentCellular Component) 9

6/21/2015 Source Page Understanding 10

6/21/

6/21/

6/21/2015 Source Page Understanding  Three steps:  Recognize attributes and values  Find attribute-value pairs  Map attribute-value pairs to target concepts  Two techniques:  Sibling page comparison  Seed ontology recognition 11

6/21/2015 Sibling Page Comparison 12

6/21/2015 Sibling Page Comparison 12

6/21/2015 Sibling Page Comparison 12

6/21/2015 Sibling Page Comparison Attribute 12

6/21/2015 Sibling Page Comparison 12

6/21/2015 Sibling Page Comparison 13

6/21/2015 Seed Ontology Recognition  What is a seed ontology?  A seed ontology contains as much information as we can collect for one object in a specified application domain with respect to the extraction ontology.  Why do we use a seed ontology? 14

6/21/2015 Seed Ontology Recognition Marker Name: ABP1 Forward Primer: CTTATGCTGCGAGTGCAGTC Reverse Primer: AGCAATGGAGAAGTTCCTACC 14

6/21/2015 nucleus; zinc ion binding; nucleic acid binding; zinc ion binding; nucleic acid binding; linear; NP_079345; 9606; Eukaryota; Metazoa; Chorata; Craniata; Vertebrata; Euteleostom i; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo; NP_079345; Homo sapiens; human; GTTTTTGTGTT………. ATAAGTGCATTAACGG CCCACATG; FLJ14299 msdspagsnprtpessgsgsgg ………tagpyyspyalygqrlasa salgyq; hypothetical protein FLJ14299; 8; eight; “8:?p\s?12”; “8:?p11.2”; “8:?p11.23”; : “37,?612,?680”; “37,?610,?585”; 15

6/21/2015 Seed Ontology Recognition 16

6/21/2015 nucleus; zinc ion binding; nucleic acid binding; zinc ion binding; nucleic acid binding; linear ; NP_079345; 9606; Eukaryota; Metazoa; Chorata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo; NP_079345; Homo sapiens; human; GTTTTTGTGTT………. ATAAGTGCATTAACG GCCCACATG; FLJ14299 msdspagsnprtpess gsgsgg………tagp yyspyalygqrlasasal gyq; hypothetical protein FLJ14299; 8; eight; “8:?p\s?12”; “8:?p11.2”; “8:?p11.23”; : “37,?612,?680”; “37,?610,?585”; 17

6/21/2015 Source Location through Semantic Indexing  Motivation:  Hundreds of available biological repositories  Time consuming to browse all of them  Leads quickly to needed sources for a query  Solution − semantic indexing:  Meta-data  Data 18

6/21/2015 Source Location through Semantic Indexing − Meta-Data Source Organism Accession Number Protein Name Length in Amino Acid Molecular Weight in Da ProtoNet 19

6/21/2015 Source Location through Semantic Indexing − Meta-Data Protein Name = “Hypothetical protein FLJ14299” Length in Amino Acid = ? Length in Amino Acid Protein Name ProtoNet 20

6/21/2015 Source Location through Semantic Indexing − Data Semantic Web Semantic indexing Query 21

6/21/2015 Ontology Enrichment  Likely to have “imperfect” ontologies  Incomplete ISA and Part-Of hierarchies  Incomplete lexicons  Incomplete with respect to concepts  Can enrich semi-automatically  Two possibilities:  Data frame enrichment  Object set and relationship set enrichment 22

6/21/2015 Ontology − Data Frame Enrichment 23

6/21/2015 Ontology --- Object Set and Relationship Set Enrichment Source Target 24

6/21/2015 Ontology − Object Set and Relationship Set Enrichment Source Organism Accession Number Protein Name Length in Amino Acid Molecular Weight in Da 25

6/21/2015 Research Plan  Build and test the system step by step  Provide experimental evidence that issues have been resolved  Source page understanding  Source location from semantic indexing  Ontology enrichment 26

6/21/2015 Research Plan – Source Page Understanding  Training set  Choose thresholds  Set up rules  Combine results from different techniques  Refine the seed ontologies  Test set  Detect attributes and values  Form attribute-value pairs  Recognize mappings between source attribute-value pairs to target concepts 27

6/21/2015 Research Plan – Others  Source location through semantic indexing  Ontology enrichment  Data frame enrichment  Concept and relationship set enrichment 28

6/21/2015 Delimitations  Extraction ontology: will not cover all the concepts, relationships, and values in the molecular biology domain  Source page understanding: only deals with structured/semi-structured source pages  Data frame enrichment: will not do automatic regular expression enrichment  Object set and relationship set enrichment: will be limited to enriching ISA and Part-Of hierarchies and simple attribute additions  Prototype system: will use an available front-end query interface; will not do further integration beyond synchronization with the target gene extraction ontology 29

6/21/2015 Contributions  Will contribute to both information extraction technology and bioinformatics  Can find appropriate sources, retrieve needed information, understand a source page, and extract useful information automatically  Can convert understood source pages into semantic web pages automatically  Can enrich ontologies semi-automatically  Can likely be extended to other domains 30