Interactive Review + a (corny) ending 12/05  Project due today (with extension)  Homework 4 due Friday  Demos (to the TA) as scheduled.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Technology Roadmap Project Harold Flescher VP-Elect, Technical Activities August 2008, Region 1 Meeting.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
CS 431 The Semester in Elevator Speak Carl Lagoze – Cornell University May 5, 2004.
Listening non-stop for 150min per week, for 16 weeks –4000$ (your tuition).. Re-viewing all the lecture videos on Youtube –100000$ (in lost girl friends/boy.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
A (corny) ending. 2 Course Outcomes After this course, you should be able to answer: –How search engines work and why are some better than others –Can.
Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the.
Information Integration + a (corny) ending 5/4 An unexamined life is not worth living.. --Socrates  Mandatory blog qns  Final on next Tuesday 9:50—11:40.
Listening non-stop for 150min per week, for 16 weeks –4000$ (your tuition).. Watching Rao sip 30+ doppio machchiatos –30$ (aggravation fee).. Catching.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
1 Semantic Web: Vaporware or Worthy Dream? Slides adapted from Nick Kushmerick Rose colored glasses are never made in bi-focals because no-body wants to.
The Last Lecture Agenda –1:40-2:00pm Integrating XML and Search Engines—Niagara way –2:00-2:10pm My concluding remarks (if any) –2:10-2:45pm Interactive.
Listening non-stop for 150min per week, for 16 weeks –4000$ (your tuition).. Catching up on your beauty sleep in the class –300$ (chairs not very comfy)
1 5/4: Final Agenda… 3:15—3:20 Raspberry bars »In lieu of Google IPO shares.. Homework 3 returned; Questions on Final? 3:15--3:40 Demos of student projects.
Query Processing in Data Integration + a (corny) ending
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
Listening non-stop for 150min per week, for 16 weeks –4000$ (your tuition).. Catching up on your beauty sleep in the class –300$ (chairs not very comfy)
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
Overview of Web Data Mining and Applications Part I
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Guide to Civics GSEs Resource Alignment How to align resources, educational materials, or programs to the Civics GSEs.
3.02 The Information Superhighway
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Aardvark Anatomy of a Large-Scale Social Search Engine.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Choosing a Topic and Forming a Research Question Introduction Choosing and narrowing a topic Forming a research question Talk About It Your Turn Tech Tools.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
Ontology-Based Information Extraction: Current Approaches.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Overviews of ITCS 6161/8161: Advanced Topics on Database Systems Dr. Jianping Fan Department of Computer Science UNC-Charlotte
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the.
Data Mining for Web Intelligence Presentation by Julia Erdman.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Sight Words.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Information Design Trends Unit Five: Delivery Channels Lecture 2: Portals and Personalization Part 2.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
MAN UP BIBLE SERIES Men Who Care Enough to Share Lesson Three.
Mining of Massive Datasets Edited based on Leskovec’s from
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
CS 784: Advanced Topics in Data Management This semester’s focus: Data Science AnHai Doan.
Session 5: How Search Engines Work. Focusing Questions How do search engines work? Is one search engine better than another?
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Data Mining - Introduction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Book web site:
Decision support systems (DSS)
Chapter 1- Introduction
Office Hours: 1-2pm T/Th 8/23
CSE591: Data Mining by H. Liu
Course Outcomes After this course, you should be able to answer:
CSE 635 Multimedia Information Retrieval
1/21/10 Viewing the Coure in terms of IR, DB, Soc Net, ML adapted to web Start of IR.
CS246: Information Retrieval
CSE591: Data Mining by H. Liu
Presentation transcript:

Interactive Review + a (corny) ending 12/05  Project due today (with extension)  Homework 4 due Friday  Demos (to the TA) as scheduled

2 Course Outcomes After this course, you should be able to answer: –How search engines work and why are some better than others –Can web be seen as a collection of (semi)structured databases? If so, can we adapt database technology to Web? –Can useful patterns be mined from the pages/data of the web? What did you think these were going to be?? REVIEW

3 Main Topics Approximately three halves plus a bit: –Information retrieval –Information integration/Aggregation –Information mining –other topics as permitted by time REVIEW

4 Adapting old disciplines for Web-age Information (text) retrieval –Scale of the web –Hyper text/ Link structure –Authority/hub computations Databases –Multiple databases Heterogeneous, access limited, partially overlapping –Network (un)reliability Datamining [Machine Learning/Statistics/Databases] –Learning patterns from large scale data REVIEW

Topics Covered 1.Introduction (8/22;) 2.Text retrieval; vectorspace ranking 3.Indexing/Retrieval issues 4.Correlation analysis & Latent Semantic Indexing 5.Search engine technology 6.Anatomy of Google etc 7.Clustering 8.Text Classification  (m) 9.Filtering/Personalization 10.Web & Databases: Why do we even care? 11.XML and handling semi- structured data 12.Semantic web and its standards (RDF/RDF- S/OWL...) 13.Information Extraction  14.Data/Information Integration/aggregation 15.Query Processing in Data Integration: Gathering and Using Source Statistics 16.Bridging Information Retrieval and Databases 17.Social Networks 

Finding“Sweet Spots” in computer-mediated cooperative work It is possible to get by with techniques blythely ignorant of semantics, when you have humans in the loop –All you need is to find the right sweet spot, where the computer plays a pre-processing role and presents “potential solutions” –…and the human very gratefully does the in-depth analysis on those few potential solutions Examples: –The incredible success of “Bag of Words” model! Bag of letters would be a disaster ;-) Bag of sentences and/or NLP would be good –..but only to your discriminating and irascible searchers ;-)

Collaborative Computing AKA Brain Cycle Stealing AKA Computizing Eyeballs A lot of exciting research related to web currently involves “co-opting” the masses to help with large-scale tasks –It is like “cycle stealing”—except we are stealing “human brain cycles” (the most idle of the computers if there is ever one ;-) Remember the mice in the Hitch Hikers Guide to the Galaxy? (..who were running a mass-scale experiment on the humans to figure out the question..) –Collaborative knowledge compilation (wikipedia!) –Collaborative Curation –Collaborative tagging –Paid collacoration/contracting Many big open issues –How do you pose the problem such that it can be solved using collaborative computing? –How do you “incentivize” people into letting you steal their brain cycles? Pay them! (Amazon mturk.com )

Tapping into the Collective Unconscious Another thread of exciting research is driven by the realization that WEB is not random at all! –It is written by humans –…so analyzing its structure and content allows us to tap into the collective unconscious.. Meaning can emerge from syntactic notions such as “co-occurrences” and “connectedness” Examples: –Analyzing term co-occurrences in the web-scale corpora to capture semantic information (today’s paper) –Analyzing the link-structure of the web graph to discover communities DoD and NSA are very much into this as a way of breaking terrorist cells –Analyzing the transaction patterns of customers (collaborative filtering)

9 Rao: I could've taught more...I could've taught more, if I'd just...I could've taught more... T&U: Rao, there are twenty people who are mad at you because you taught too much. Look at them. Rao: If I'd made more time...I wasted so much time, you have no idea. If I'd just... T&U: There will be generations (of bitter people) because of what you did. Rao: I didn't do enough. T&U: You did so much. Rao: This slide. We could’ve removed this slide. Why did I keep the slide? Two minutes, right there. Two minutes, two more minutes.. This music, a bit on p2p. This review. Two points on custom portals. I could easily have made two for it. At least one. I could’ve gotten one more point across. One more. One more point. A point, Sree. For this. I could've gotten one more point across and I didn't.  Adieu with an Oscar Schindler Routine.. Schindler: I could've got more...I could've got more, if I'd just...I could've got more... Stern: Oskar, there are eleven hundred people who are alive because of you. Look at them. Schindler: If I'd made more money...I threw away so much money, you have no idea. If I'd just... Stern: There will be generations because of what you did. Schindler: I didn't do enough. Stern: You did so much. Schindler: This car. Goeth would've bought this car. Why did I keep the car? Ten people, right there. Ten people, ten more people...(He rips the swastika pin from his lapel) This pin, two people. This is gold. Two more people. He would've given me two for it. At least one. He would've given me one. One more. One more person. A person, Stern. For this. I could've gotten one more person and I didn't. Top few things I would have done if I had more time Information extraction; Automated annotation Record/Ontology/Schema matching issues Customized portal generation P2P mediation Services—and service standards Security issues... Be less demanding more often (or even once…)

“It is not what you have covered, but rather what you have uncovered” Mr. Andersen, May I be excused? My brain is full.

11 A Farside treasury… 494 students Okay, folks Google can be improved With LSI. We need data integration, Clustering, which Google doesn’t do much, we need db/IR integration.. Blah blah Google blah blah blah Blah. Blah blah blah blah blah blah, Blah blah blah Google blah blah Blah blah blah blah blah blah blah..

Interactive Review (Format) Each of you will get about 4min to hold forth on any of the following:  topics covered in the course that particularly caught your fancy (and why)  intriguing connections *between* the various topics covered in the course that struck you  where you think this area ought to go  what topics should have been covered more  what topics--if any--got overplayed

Anatomy may be likened to a harvest-field. First come the reapers, who, entering upon untrodden ground, cut down great store of corn from all sides of them. These are the early anatomists of Europe Then come the gleaners, who gather up ears enough from the bare ridges to make a few loaves of bread. Such were the anatomists of last. Last of all come the geese, who still contrive to pick up a few grains scattered here and there among the stubble, and waddle home in the evening, poor things, cackling with joy because of their success. Gentlemen, we are the geese. --John Barclay English Anatomist

Information Integration on Web still rife with uncut corn Unlike anatomy of Barclay’s day, Web is still young. We are just figuring out how to tap its potential …You have great stores of uncut corn in front of you. ……