Query Processing in Data Integration + a (corny) ending

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
A (corny) ending. 2 Course Outcomes After this course, you should be able to answer: –How search engines work and why are some better than others –Can.
Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the.
Information Integration + a (corny) ending 5/4 An unexamined life is not worth living.. --Socrates  Mandatory blog qns  Final on next Tuesday 9:50—11:40.
Search Engines and Information Retrieval
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.
Interactive Review + a (corny) ending 12/05  Project due today (with extension)  Homework 4 due Friday  Demos (to the TA) as scheduled.
The Last Lecture Agenda –1:40-2:00pm Integrating XML and Search Engines—Niagara way –2:00-2:10pm My concluding remarks (if any) –2:10-2:45pm Interactive.
The PageRank Citation Ranking “Bringing Order to the Web”
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Information Retrieval in Practice
1 5/4: Final Agenda… 3:15—3:20 Raspberry bars »In lieu of Google IPO shares.. Homework 3 returned; Questions on Final? 3:15--3:40 Demos of student projects.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
Overview of Web Data Mining and Applications Part I
Chapter 5: Information Retrieval and Web Search
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Adversarial Information Retrieval The Manipulation of Web Content.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Search Engines and Information Retrieval Chapter 1.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Chapter 6: Information Retrieval and Web Search
Search Engine Optimization: A Survey of Current Best Practices Author - Niko Solihin Resource -Grand Valley State University April, 2013 Professor - Soe-Tsyr.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the.
Data Mining for Web Intelligence Presentation by Julia Erdman.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Creating User Interfaces Ideas & Trends Homework: Post constructive comments. Work on project.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Data Mining Concepts and Techniques Course Presentation by Ali A. Ali Department of Information Technology Institute of Graduate Studies and Research Alexandria.
Mining of Massive Datasets Edited based on Leskovec’s from
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
CPS 49S Google: The Computer Science Within and its Impact on Society Shivnath Babu Spring 2007.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Data Mining - Introduction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Book web site:
Data mining in web applications
Information Retrieval in Practice
Information Organization: Overview
Personalized Social Image Recommendation
Office Hours: 1-2pm T/Th 8/23
CSE591: Data Mining by H. Liu
Course Outcomes After this course, you should be able to answer:
Robotic Search Engines for the Physical World
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
1/21/10 Viewing the Coure in terms of IR, DB, Soc Net, ML adapted to web Start of IR.
Introduction to Information Retrieval
CS246: Information Retrieval
Web archives as a research subject
Information Organization: Overview
Discussion Class 9 Google.
CSE591: Data Mining by H. Liu
9/8/ :03 PM © 2006 Microsoft Corporation. All rights reserved.
Presentation transcript:

Query Processing in Data Integration + a (corny) ending 4/30 Project 3 due today Demos (to the TA) as scheduled FHW+presentation due 5/8 Agenda today: 3:15—3:30: Soft Joins 3:30—4:00: Query processing in data integration 4:00—4:30: End review

May 8th 2:40—4:30pm Each student gives a 5 min presentation 19x5=95min Also get a hard copy of the review with you 15min buffer + wrapup I’ll get refreshments; you keep us all awake.

What is the problem that the paper is addressing? Why is the problem interesting? What is the solution that the authors propose? What is your criticism of the solution presented? How is it related to what we learned in the class?

Course Outcomes What did you think these were going to be?? REVIEW After this course, you should be able to answer: How search engines work and why are some better than others Can web be seen as a collection of (semi)structured databases? If so, can we adapt database technology to Web? Can useful patterns be mined from the pages/data of the web? REVIEW

Main Topics Approximately three halves plus a bit: Information retrieval Information integration/Aggregation Information mining other topics as permitted by time REVIEW

Adapting old disciplines for Web-age Information (text) retrieval Scale of the web Hyper text/ Link structure Authority/hub computations Databases Multiple databases Heterogeneous, access limited, partially overlapping Network (un)reliability Datamining [Machine Learning/Statistics/Databases] Learning patterns from large scale data REVIEW

Topics Covered Introduction Text retrieval; vectorspace ranking Indexing/Retrieval issues Correlation analysis & Latent Semantic Indexing Search engine technology Social Networks  Anatomy of Google etc Clustering Text Classification (m) Filtering/Personalization Web & Databases: Why do we even care? XML and handling semi-structured data Semantic web and its standards (RDF/RDF-S/OWL...) Information Extraction  Data/Information Integration/aggregation Query Processing in Data Integration: Gathering and Using Source Statistics

Topics Covered Introduction (1) Text retrieval; vectorspace ranking (3) Correlation analysis & Latent Semantic Indexing (2) Indexing; Crawling; Exploiting tags in web pages (2) Social Network Analysis (2) Link Analysis in Web Search (A/H; Pagerank) (3+) Clustering (2) Text Classification (1) Filtering/Recommender Systems (2) Why do we even care about databases in the context of web (1) XML and handling semi-structured data + Semantic Web standards (3) Information Extraction (2) Information/data Integration (2+) Discussion Classes: ~3+

Finding“Sweet Spots” in computer-mediated cooperative work Big Idea 1 Finding“Sweet Spots” in computer-mediated cooperative work It is possible to get by with techniques blythely ignorant of semantics, when you have humans in the loop All you need is to find the right sweet spot, where the computer plays a pre-processing role and presents “potential solutions” …and the human very gratefully does the in-depth analysis on those few potential solutions Examples: The incredible success of “Bag of Words” model! Bag of letters would be a disaster ;-) Bag of sentences and/or NLP would be good ..but only to your discriminating and irascible searchers ;-)

Collaborative Computing AKA Brain Cycle Stealing AKA Computizing Eyeballs Big Idea 2 A lot of exciting research related to web currently involves “co-opting” the masses to help with large-scale tasks It is like “cycle stealing”—except we are stealing “human brain cycles” (the most idle of the computers if there is ever one ;-) Remember the mice in the Hitch Hikers Guide to the Galaxy? (..who were running a mass-scale experiment on the humans to figure out the question..) Collaborative knowledge compilation (wikipedia!) Collaborative Curation Collaborative tagging Paid collaboration/contracting Many big open issues How do you pose the problem such that it can be solved using collaborative computing? How do you “incentivize” people into letting you steal their brain cycles?

Tapping into the Collective Unconscious AKA “Wisdom of the Crowds” Big Idea 3 Another thread of exciting research is driven by the realization that WEB is not random at all! It is written by humans …so analyzing its structure and content allows us to tap into the collective unconscious .. Meaning can emerge from syntactic notions such as “co-occurrences” and “connectedness” Examples: Analyzing term co-occurrences in the web-scale corpora to capture semantic information (today’s paper) Analyzing the link-structure of the web graph to discover communities DoD and NSA are very much into this as a way of breaking terrorist cells Analyzing the transaction patterns of customers (collaborative filtering)

If you don’t take Autonomous/Adversarial Nature of the Web into account, then it is gonna getcha.. Big Idea 4 Most “first-generation” ideas of web make too generous an assumption of the “good intentions” of the source/page/email creators. The reasonableness of this assumption is increasingly going to be called into question as Web evolves in an uncontrolled manner… Controlling creation rights removes the very essence of scalability of the web. Instead we have to factor in adversarial nature.. Links can be manipulated to change page importance So we need “trust rank” Fake annotations can be added to pages and images So we need ESP-game like self-correcting annotations.. Fake/spam mails can be sent (and the nature of the spam mails can be altered to defeat simple spam classifiers…) So we need adversarial classification techniques Fake pages (in large numbers) can be created and put on the web (although, as of now, I don’t yet see the economic motive for this) So we can not see web as the collective unconscious.. and co-occurrence may not imply semantic proximity.

Anatomy may be likened to a harvest-field. First come the reapers, who, entering upon untrodden ground, cut down great store of corn from all sides of them. These are the early anatomists of Europe Then come the gleaners, who gather up ears enough from the bare ridges to make a few loaves of bread. Such were the anatomists of last. Last of all come the geese, who still contrive to pick up a few grains scattered here and there among the stubble, and waddle home in the evening, poor things, cackling with joy because of their success. Gentlemen, we are the geese. --John Barclay English Anatomist

Information Integration on Web still rife with uncut corn Unlike anatomy of Barclay’s day, Web is still young. We are just figuring out how to tap its potential …You have great stores of uncut corn in front of you. …… go cut some of your own!