Presented By – Yogesh A. Vaidya. Introduction What are Structured Web Community Portals? Advantages of SWCP powerful capabilities for searching, querying.

Slides:



Advertisements
Similar presentations
Schema Matching and Query Rewriting in Ontology-based Data Integration Zdeňka Linková ICS AS CR Advisor: Július Štuller.
Advertisements

Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Database Systems: Design, Implementation, and Management Tenth Edition
Information Retrieval in Practice
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
9/6/2001Database Management – Fall 2000 – R. Larson Information Systems Planning and the Database Design Process University of California, Berkeley School.
SWE Introduction to Software Engineering
Modified from Sommerville’s originalsSoftware Engineering, 7th edition. Chapter 8 Slide 1 System models.
Managing Distributed Collections: Evaluating Web Page Change, Movement, and Replacement Richard Furuta and Frank Shipman Center for the Study of Digital.
1 Computer Systems & Architecture Lesson 1 1. The Architecture Business Cycle.
Connecting Diverse Web Search Facilities Udi Manber, Peter Bigot Department of Computer Science University of Arizona Aida Gikouria - M471 University of.
Software Process and Product Metrics
Overview of Search Engines
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 7 Slide 1 System models l Abstract descriptions of systems whose requirements are being.
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Systems Analysis – Analyzing Requirements.  Analyzing requirement stage identifies user information needs and new systems requirements  IS dev team.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
Describing Methodologies PART II Rapid Application Development*
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Effective User Services for High Performance Computing A White Paper by the TeraGrid Science Advisory Board May 2009.
System models Abstract descriptions of systems whose requirements are being analysed Abstract descriptions of systems whose requirements are being analysed.
ITEC224 Database Programming
PERSONALIZED SEARCH Ram Nithin Baalay. Personalized Search? Search Engine: A Vital Need Next level of Intelligent Information Retrieval. Retrieval of.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Of 33 lecture 10: ontology – evolution. of 33 ece 720, winter ‘122 ontology evolution introduction - ontologies enable knowledge to be made explicit and.
Querying Structured Text in an XML Database By Xuemei Luo.
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
System models l Abstract descriptions of systems whose requirements are being analysed.
Modified by Juan M. Gomez Software Engineering, 6th edition. Chapter 7 Slide 1 Chapter 7 System Models.
Information Systems Engineering. Lecture Outline Information Systems Architecture Information System Architecture components Information Engineering Phases.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
 Examine two basic sources for implicit relevance feedback on the segment level for search personalization. Eye tracking Display time.
Systems Analysis and Design in a Changing World, Fourth Edition
CIS/SUSL1 Fundamentals of DBMS S.V. Priyan Head/Department of Computing & Information Systems.
Cmpe 589 Spring 2006 Lecture 2. Software Engineering Definition –A strategy for producing high quality software.
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Chapter 4 Automated Tools for Systems Development Modern Systems Analysis and Design Third Edition 4.1.
Team Members Ming-Chun Chang Lungisa Matshoba Steven Preston Supervisors Dr James Gain Dr Patrick Marais.
Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences by Ahmed Radwan, Lucian Popa, Ioana R. Stanoi, Akmal Younis Presented.
Software Engineering Introduction.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
1 CS 430: Information Discovery Lecture 5 Ranking.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Data Models. 2 The Importance of Data Models Data models –Relatively simple representations, usually graphical, of complex real-world data structures.
Of 24 lecture 11: ontology – mediation, merging & aligning.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Human Computer Interaction Lecture 21 User Support
Human Computer Interaction Lecture 21,22 User Support
Preface to the special issue on context-aware recommender systems
Abstract descriptions of systems whose requirements are being analysed
Tools of Software Development
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Toshiyuki Shimizu (Kyoto University)
Information Retrieval
Metadata in the modernization of statistical production at Statistics Canada Carmen Greenough June 2, 2014.
Chapter 1 Database Systems
Information Retrieval and Web Design
Presentation transcript:

Presented By – Yogesh A. Vaidya

Introduction What are Structured Web Community Portals? Advantages of SWCP powerful capabilities for searching, querying and monitoring community information. Disadvantage of current approaches No standardization of process. Domain specific development. Current approaches in research community Top Down, Compositional and Incremental Approach CIMPLE Workbench Toolset for development DBLife case study

What are Community Portals? These are portals that collect and integrate relevant data obtained from various sources on the web. These portals enable members to discover, search, query, and track interesting community activities.

What are Structured Web Community Portals? They extract and integrate information from raw Web Pages to present a unified view of entities and relationships in the community. These portals can provide users with powerful capabilities for searching, querying, aggregating, browsing, and monitoring community information. They can be valuable for communities in a wide variety of domains ranging from scientific data management to government agencies and are an important aspect of Community Information Management.

Example

Current Approaches towards SWCP used in Research Community These approaches maximize coverage by looking at the entire web to discover all possible relevant web sources. They then apply sophisticated extraction and integration techniques to all data sources. They expand the portal by periodically running the source discovery step. These solutions achieve good coverage and incur relatively little human effort. Examples: Citeseer, Cora, Rexa, Deadliner

Disadvantages of Current Approaches These techniques are difficult to develop, understand and debug That is, they require builders to be well versed in complex machine learning techniques. Due to monolithic nature of the employed techniques, it is difficult to optimize the run-time and accuracy of these solutions.

Top-Down Compositional and Incremental Approach First, select a small set of important community sources. Next, create plans that extract and integrate data from these sources to generate entities and relationships. These plans can act on different sources and use different extraction and integration operators and are hence not monolithic. Executing this operators yields an initial structured portal. Then expand the community by monitoring certain sources for mentions of new sources.

Step 1: Selecting Initial Data Sources Basis for selecting small set of initial data sources is phenomenon that applies to Web Community data sources. That is, 20% of sources often cover 80% of interesting community activities. Thus the sources used should be highly relevant to the community. Example: For the Database research community, the portal builder can select homepages of top conferences (SIGMOD, PODS, VLDB, ICDE) and the most active researchers (such as, PC members of top conferences, or those with many citations.)

Selection of Sources contd… In order to assist the portal builder B to select sources, a tool called RankSource is developed that provides B with the most relevant data sources. Here B first collects as many community sources as possible (using methods like focused crawling, query search engine etc). B applies RankSource to these sources to get sources in decreasing order of relevance to the community. B examines the ranked list, starting from top and selects truly relevant data sources.

RankSource principles: Three relevance ranking strategies used by RankSource tool: PageRank only PageRank + Virtual Links PageRank + Virtual Links + TF-IDF

PageRank only This version exploits the intuition that Community sources often link to highly relevant sources Sources linked to/by relevant sources are also highly relevant. Formula used for giving rank: n P(u) = (1-d) + d∑ i=1 P(v i )/c(v i ). PageRank only version achieves limited accuracy because some highly relevant sources are often not linked

PageRank + Virtual Links This is based on the assumption that if a highly relevant source discusses an entity, then all other sources which discuss that entity may also be highly relevant. Here, first create virtual links between sources that mention overlapping entities, and then do PageRank. From results, the accuracy reduces after using PageRank + Virtual Links. The reason for reduction in accuracy is that virtual links are created even with those sources for which a particular entity may not be their main focus.

PageRank + Virtual Links + TF-IDF From previous case, we want to ensure that we add a virtual link only if both the sources are relevant for the concerned entity. Hence in this case use of TF-IDF metric for measuring relevance of the document is done. TF is term frequency given by number of times entity occurs in a source divided by total number of entities (including duplicates) in the source. IDF is inverse document frequency given by logarithm of total number of sources divided by number of sources in which a particular entity occurs.

Contd… TF-IDF score is given by TF*IDF. Next, for each source filter out all entities for which TF-IDF score is below a certain level Ѳ. After this filtering, apply PageRank + Virtual Links on the results obtained. This approach gives better accuracy than both of the previously discussed techniques.

Step 2: Constructing the E-R graph After selecting the sources, B first defines E-R schema G that captures entities and relationships of interest to the community. This schema consists of many types of entities (e.g., paper, person) and relationships (e.g., write-paper, co-author). B then crawls daily (actual frequency dependent on domain/developer) to first create a snapshot W of all the concerned data web pages. B then applies plan P day which consists of other plans to extract entities and relations from this snapshot to create a daily E-R graph. B then merges these daily graphs obtained (using plan P global ) to create a global E-R graph on which it offers various services like querying, aggregating etc.

Create Daily Plan P day Daily Plan (P day ) takes daily snapshot W (of web pages to work on) and E-R schema G as input and produces as output a daily E-R graph D. Here, for each entity type e in G, B creates a Plan P e that discovers all entities of type e from W. Next, for each relation type r, B creates a plan P r that discovers all instances of relation r that connects the entities discovered in the first step.

Workflow of P day in database domain:

Plans to discover entities Two plans, Default Plan and Source Aware Plan, are used for discovering entities. Default Plan (P default ) uses three operators ExtractM to find all mentions of type e in web pages in W, MatchM to find matching mentions, that is, those referring to the same real world entity, thus forming groups g1,g2,…,gk and CreateE to create an entity for each group of mentions The problem of encapsulating and matching mentions as encapsulated by ExtractM and MatchM is known to be difficult and complex implementations are available.

Default Plan contd… But in our case, since we are looking at community- specific, highly relevant sources, a simple dictionary- based solution that matches a collection of entity names N against the pages in W to find mentions gives accurate results in most of the cases. Assumption behind this is that in a community entity names are often designed to be as distinct as possible.

Source Aware Plan Simple dictionary-based extraction and matching may not be appropriate for all cases especially where ambiguous sources are possible. In DBLife, DBLP is an ambiguous source since it contains information pertaining to other communities too. To match entities in DBLP sources, a source aware plan is used, which is a stricter version of an earlier default plan.

Source Aware Plan Contd… First apply simple default plan or P name (only Extract and Match operators) to all unambiguous sources to get Results R, which is groups of related mentions. Add to this result set R mentions extracted from DBLP to form result set U. For matching mentions in U, not just names is used, but also their context in the form of related persons. Two names are matched only if they have similar names and are related by at least one related person.

Plans for extracting and matching mentions:

Plans to find Relations: Plans for finding relations are generally domain specific and vary from relation to relation. Hence we try to find types of relations that commonly occur in community portals and then create plan templates for each one. First amongst these relation types is Co-occurrence relations. In Co-occurrence relations, we compute CoStrength between the concerned entities and register the match if the score of CoStrength is greater than a certain threshold. CoStrength is calculated using ComputeCoStrength operator, for which input is entity pair(e,f) and output is number that quantifies how often and closely mentions of e and f occur. Example: Co-occurrence relations are: write-paper(person,paper), co-author(person,person)

Label Relations Second type of relations we focus on are Label Relations. Example: A label relation is served(person,committee) We use the ExtractLabel operator to find an instance of a label immediate to a mention of an entity (e.g. Person). This operator can thus be used by B to make plans that find label relations.

Plans for finding relations:

Neighborhood Relations: Third type of relations we focus on are Neighborhood relations. Example: A neighborhood relation is talk(person, organization) These relations are similar to label relations with the difference that we here consider a window of words instead of immediate occurrence.

Decomposition of Daily Plan For a real world community portal, the E-R schema is very large and hence it would be overwhelming to cover all entities and relations in a single plan. Decomposition aids splitting of tasks across people and also makes evolution of schema possible in a smooth manner. Decomposition is followed by the merge phase in which E-R fragments from individual plans are merged into a complete E-R graph. We use MatchE and EnrichE operators to find matching entities from different graph fragments, and merge them into a single day E-R graph.

Decomposition and Merging of Plans:

Creating Global plan Generating global plan is similar to constructing a daily plan from individual E-R fragments. Hereto operators MatchE and EnrichE are used.

Step 3: Maintaining and Expanding The Top-down, compositional and incremental approach makes maintaining the portal relatively easy. The main assumption is that relevance of data sources changes very slowly over time and new data sources would be mentioned within the community in certain sources. In expanding, we add new relevant data sources. Approach used for finding new sources is to look for them only at certain locations (e.g. looking for conference announcements at DBWorld.)

CIMPLE Workbench This workbench consists of an initial portal “shell” (with many built-in administrative controls,) a methodology on populating the shell, a set of operator implementations, and a set of plan optimizers. The workbench facilitates easy and quick development since a portal builder can make use of built-in facilities. Workbench provides implementations of operators and also gives facility for builders to have their own implementations.

Case Study: DBLife DBLife is a structured community portal for the database community. It was developed as a proof of concept for developing community portals using the top-down approach and the CIMPLE workbench. Portal is a proof of development of SWCP achieving high accuracy with current set of relatively simple operations.

DBLife lifecycle:

Conclusion & Future Work Experience with developing DBLife suggests that this approach can effectively exploit common characteristics of Web communities to build portals quickly and accurately using simple extraction/integration operators, and to evolve them over time efficiently with little human effort. Still a lot of research problems need to be studied: How to develop a better compositional framework? How to make such a framework as declarative as possible? What technologies (XML, relational, etc) need to be used to store portal data?