Information Discovery on Vertical Domains Vagelis Hristidis Assistant Professor School of Computing and Information Sciences Florida International University.

Slides:



Advertisements
Similar presentations
Relevance Feedback Limitations –Must yield result within at most 3-4 iterations –Users will likely terminate the process sooner –User may get irritated.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
Social networks, in the form of bibliographies and citations, have long been an integral part of the scientific process. We examine how to leverage the.
Searching and Exploring Biomedical Data Vagelis Hristidis School of Computing and Information Sciences Florida International University.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
OntoBlog: Informal Knowledge Management by Semantic Blogging Aman Shakya 1, Vilas Wuwongse 2, Hideaki Takeda 1, Ikki Ohmukai 1 1 National Institute of.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Information Retrieval in Practice
Search Engines and Information Retrieval
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
Vagelis Hristidis, Florida International University, Miami Eduardo Ruiz, Florida International University, Miami Alejandro Hernández, Florida International.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Information Retrieval: Human-Computer Interfaces and Information Access Process.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
Overview of Search Engines
Information Retrieval in Practice
Bidirectional Expansion for Keyword Search on Graph Databases Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan.
Amarnath Gupta Univ. of California San Diego. An Abstract Question There is no concrete answer …but …
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Search Engines and Information Retrieval Chapter 1.
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
Grant Number: IIS Institution of PI: Arizona State University PIs: Zoé Lacroix Title: Collaborative Research: Semantic Map of Biological Data.
Automated Creation of a Forms- based Database Query Interface Magesh Jayapandian H.V. Jagadish Univ. of Michigan VLDB
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
Extracting Semantic Constraint from Description Text for Semantic Web Service Discovery Dengping Wei, Ting Wang, Ji Wang, and Yaodong Chen Reporter: Ting.
Guava: Capturing the Intrinsic Organization of Knowledge in User Interfaces James Terwilliger and Lois Delcambre Computer Science Department Portland State.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Flexible Text Mining using Interactive Information Extraction David Milward
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Search Engine Architecture
Facilitating Document Annotation using Content and Querying Value.
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
Algorithmic Detection of Semantic Similarity WWW 2005.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
PubMed …featuring more than 20 million citations for biomedical literature from MEDLINE, life science journals, and online books.
Keyword Searching and Browsing in Databases using BANKS Charuta Nakhe, Arvind Hulgeri, Gaurav Bhalotia, Soumen Chakrabarti, S. Sudarshan Presented by Sushanth.
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
Welcome to CPSC 534B: Information Integration Laks V.S. Lakshmanan Rm. 315.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
Facilitating Document Annotation Using Content and Querying Value.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
Information Retrieval in Practice
Improving Data Discovery Through Semantic Search
Search Engine Architecture
Associative Query Answering via Query Feature Similarity
Probabilistic Ranking of Database Query Results
Networked Information Resources
Web Mining Department of Computer Science and Engg.
Introduction to Information Retrieval
Search Engine Architecture
Probabilistic Ranking of Database Query Results
Presentation transcript:

Information Discovery on Vertical Domains Vagelis Hristidis Assistant Professor School of Computing and Information Sciences Florida International University (FIU), Miami

Need for Information Discovery Amount of available data increases Needle in the haystack problem Some applications: ◦ Web ◦ Desktop search ◦ Data Warehousing ◦ Bibliographic database ◦ Homes, cars search, e.g., realtor.com, autotrader.com ◦ Scientific domains, e.g.,  genes, proteins, publications in biology,  elements and interactions of components in chemistry  Patient hospitalizations, physician info, procedure outcomes in hospitals Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 2

Strengths and Limitations of Current Approaches Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 3 Web Search + Scalability + Handle free text + Exploit content and link structure to achieve ranking + Simple keyword queries - Limited query expressive power - Generic, domain-independent ranking algorithms - Return pages, not answers Database Querying + Efficient + Handle structured data + Well-defined theory and answers - Must learn query language, e.g. SQL - No automatic ranking of results Keyword Search in Databases + Simple keyword queries + exploit links (e.g., primary-foreign keys) - Generic ranking – typically size of result - No domain semantics

Research Objective Allow effective and efficient information discovery on vertical domains Strategy: ◦ Exploit associations between entities ◦ Model domain semantics, e.g., patient entity is critical for medical practitioner, but not for biologist ◦ Model users of a domain ◦ Use knowledge of domain experts,and existing knowledge structures (e.g., domain ontologies) ◦ Exploit user feedback ◦ Go beyond plain keyword search. Explore best search interface for each domain, e.g., faceted search Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 4

Specific Domains Studied (or being studied) Products marketplace Biological databases Clinical databases Bibliographic Patents Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 5

Specific Domains Studied (or being studied) Products marketplace Biological databases Clinical databases Bibliographic Patents Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 6

Products Marketplace Project started while visiting Microsoft Research at Redmond, in Summer 2003 SQL Returns Unordered Sets of Results Overwhelms Users of Information Discovery Applications How Can Ranking be Introduced, Given that ALL Results Satisfy Query? Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 7

8 Products Marketplace (cont’d) Example – Realtor Database House Attributes: Price, City, Bedrooms, Bathrooms, SchoolDistrict, Waterfront, BoatDock, Year Query: City =`Seattle’ AND Waterfront = TRUE Too Many Results! Intuitively, Houses with lower Price, more Bedrooms, or BoatDock are generally preferable Vagelis Hristidis - FIU - Information Discovery on Vertical Domains

9 Products Marketplace (cont’d) Rank According to Unspecified Attributes [VLDB’04,TODS’06] Score of a Result Tuple t depends on Global Score: Global Importance of Unspecified Attribute Values ◦ E.g., Newer Houses are generally preferred Conditional Score: Correlations between Specified and Unspecified Attribute Values ◦ E.g., Waterfront  BoatDock Many Bedrooms  Good School District Vagelis Hristidis - FIU - Information Discovery on Vertical Domains

10 Products Marketplace (cont’d) Key Problems Given a Query Q, How to Combine the Global and Conditional Scores into a Ranking Function. Use Probabilistic Information Retrieval (PIR). How to Calculate the Global and Conditional Scores. Use Query Workload and Data. Vagelis Hristidis - FIU - Information Discovery on Vertical Domains

Products Marketplace (cont’d) Other Projects Select the best attributes to output – attribute ordering problem [SIGMOD’06] ◦ E.g., Color is important for sports cars but not much for family cars Product Advertising: Select best attributes to display for a product to maximize its visibility among its competitors [ICDE’08, TKDE’09] ◦ Use past query workload ◦ Maximize number of past queries for which the product is returned Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 11

Specific Domains Studied (or being studied) Products marketplace Biological databases Clinical databases Bibliographic Patents Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 12

Biological Databases [EDBT’09] With University of Maryland Intuitive but powerful query language, based on soft (ranking) and hard (pruning) filters Goal is to improve the user experience of users of PubMed Exploit associations between entities (genes, proteins, publications) Example of Query: Find the most important publications on “cancer” that are related to the “TNF” gene through a protein. Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 13

Results Navigation in PubMed with BioNav [ICDE’09, TKDE’10] With SUNY Buffalo. Most publications in PubMed annotated with Medical Subject Headings (MeSH) terms. Present results in MeSH tree. Propose navigation model and smart expansion techniques that may skip tree levels. Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 14

BioNav: Exploring PubMed Results Static Navigation Tree for query “prothymosin” MESH (313) Amino Acids, Peptides, and Proteins (310) Proteins (307) Nucleoproteins (40) Biological Phenomena, … (217) Cell Physiology (161) Cell Growth Processes (99) Genetic Processes (193) Gene Expression (92) Transcription, Genetic (25) 95 more nodes 2 more nodes 45 more nodes 4 more nodes 3 more nodes 15 more nodes 10 more nodes 1 more node Histones (15) - Query Keyword: prothymosin - Number of results: Navigation Tree stats: # of nodes: 3941 depth: 10 total citations: Big tree with many duplicates! 15Vagelis Hristidis, Searching and Exploring Biomedical Data

BioNav: Exploring PubMed Results Reveal to the user a selected set of descendent concepts that: (a)Collectively contain all results (b)Minimize the expected user navigation cost Not all children of the root are necessarily revealed as in static navigation. 16 Vagelis Hristidis, Searching and Exploring Biomedical Data

BioNav Evaluation 17 Vagelis Hristidis, Searching and Exploring Biomedical Data

Specific Domains Studied (or being studied) Products marketplace Biological databases Clinical databases Bibliographic Patents Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 18

XOntoRank: Use Ontologies to Search Electronic Medical Records [ICDE’09] With Miami Children’s Hospital, Indiana University School of Medicine, IBM Almaden. Latest EMR format: HL7 CDA – XML-based Algorithm to enhance keyword search using ontological knowledge (e.g., SNOMED) 19 Vagelis Hristidis, Searching and Exploring Biomedical Data

20 SAMPLE CDA FRAGMENT Vagelis Hristidis, Searching and Exploring Biomedical Data

XOntoRank: Example 1 q = {“bronchitis”, “albuterol”} result = 21 Vagelis Hristidis, Searching and Exploring Biomedical Data

XOntoRank: Example 2 q = {“asthma”, “albuterol”} result = ??? 22 Vagelis Hristidis, Searching and Exploring Biomedical Data

XOntoRank A CDA node may be associated to a query keyword w through ontology. XOntoRank first assigns scores to ontological concepts ◦ OntoScore OS(): Semantic relevance of a concept c in the ontology to a query keyword w. Then, given these scores, assign Node Scores NS() to document nodes Other aggregation functions are possible. 23 Vagelis Hristidis, Searching and Exploring Biomedical Data

Computing OntoScore of Concept Given Query Keyword Three ways to view the ontology graph: ◦ As an unlabeled, undirected graph. ◦ As a taxonomy. ◦ As a complete set of relationships. 24 Vagelis Hristidis, Searching and Exploring Biomedical Data

Authority Flow Ranking in EMRs A subset of the electronic health record dataset. Work under submission. Query: “pericardial effusion” 25 Vagelis Hristidis, Searching and Exploring Biomedical Data

ObjectRank on EMRs: Authority Flow Ranking Schema of the EMR dataset 26 Vagelis Hristidis, Searching and Exploring Biomedical Data

User Study 27 Vagelis Hristidis, Searching and Exploring Biomedical Data

Explaining Subgraph 28 Vagelis Hristidis, Searching and Exploring Biomedical Data

User Study Results Mean SensitivityMean Specificity BM25: Traditional Information Retrieval Ranking Function CO: Clinical ObjectRank (Authority Flow) 29 Vagelis Hristidis, Searching and Exploring Biomedical Data

Other challenges of Searching EMRs [NSF Symposium on Next Generation of Data Mining ’07] Entity and Association Semantics Negative Statements Personalization Treatment of Time and Location Attributes Free Text Embedded in CDA Document Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 30

Syntax vs. Semantics in Schema 31 Example – query “Asthma Theophylline” More details at [Hristidis et al. NSF Symposium on Next Generation of Data Mining ’07] Vagelis Hristidis, Searching and Exploring Biomedical Data

Specific Domains Studied (or being studied) Products marketplace Biological databases Clinical databases Bibliographic Patents Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 32

Bibliographic Databases Work started while at UCSD Exploit citations link structure to create query specific ranking [VLDB’04, TODS’08] Demo available for Database literature at Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 33

Bibliographic Databases (cont’d) Query Reformulation Work with U of Maryland [ICDE’08] Based on user selected results Perform query expansion – add/change weight of query keywords Adjust authority flow weights Currently working on applying these ideas to queries on PubMed. Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 34

Explaining Query Results – Explaining Subgraph Target Object: “Modeling Multidimensional databases” paper. Explaining Subgraph Creation 1.BFS in reverse direction from target object. 2.BFS in forward direction from base set objects (authority sources). 3.Subgraph contains all nodes/edges traversed in forward direction. 4.Compute explaining authority flow along each edge by eliminating the authority leaving the subgraph (iterative procedure). 5.Structure-based reformulation: High-flow edges in explaining subgraph receive weight boost.

Specific Domains Studied (or being studied) Products marketplace Biological databases Clinical databases Bibliographic Patents Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 36

Search Patents Special characteristics of patents: Patents are organized into classes and subclasses. Patents have links to external publications and to other patents. Patents are organized to various sections (abstract, claims, description and images). Patents use specific legal wording in the claims section. Further, claims have references to other claims, that is, claims can be viewed as a graph. Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 37 Demo at PatentsSearcher.comPatentsSearcher.com

End - Thank You For more information, please go to: Supported by ◦ NSF CAREER, ◦ NSF grant IIS : III-CXT-Small: Information Discovery on Domain Data Graphs, ◦ DHS grant 2009-ST : Information Delivery and Knowledge Discovery for Hurricane Disaster Management, Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 38

Extra Slides Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 39

40 CDA Document – Tree View Vagelis Hristidis, Searching and Exploring Biomedical Data