1/21/10 Viewing the Coure in terms of IR, DB, Soc Net, ML adapted to web Start of IR.

Slides:

Advertisements

Similar presentations

Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.

Advertisements

Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.

A (corny) ending. 2 Course Outcomes After this course, you should be able to answer: –How search engines work and why are some better than others –Can.

Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the.

Information Integration + a (corny) ending 5/4 An unexamined life is not worth living.. --Socrates  Mandatory blog qns  Final on next Tuesday 9:50—11:40.

Search Engines and Information Retrieval

Interactive Review + a (corny) ending 12/05  Project due today (with extension)  Homework 4 due Friday  Demos (to the TA) as scheduled.

The Last Lecture Agenda –1:40-2:00pm Integrating XML and Search Engines—Niagara way –2:00-2:10pm My concluding remarks (if any) –2:10-2:45pm Interactive.

Information Retrieval in Practice

1 5/4: Final Agenda… 3:15—3:20 Raspberry bars »In lieu of Google IPO shares.. Homework 3 returned; Questions on Final? 3:15--3:40 Demos of student projects.

Query Processing in Data Integration + a (corny) ending

Web Mining Research: A Survey

Overview of Web Data Mining and Applications Part I

CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.

Search Engines and Information Retrieval Chapter 1.

C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )

Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.

Data Mining By Dave Maung.

Search Engine Architecture

Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the.

CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.

WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.

Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.

A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.

Social Information Processing March 26-28, 2008 AAAI Spring Symposium Stanford University

CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.

Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.

Data mining in web applications

Information Retrieval in Practice

NoSQL: Graph Databases

Recommender Systems & Collaborative Filtering

Information Organization: Overview

Information Retrieval (in Practice)

MIS2502: Data Analytics Advanced Analytics - Introduction

Machine Learning overview Chapter 18, 21

Machine Learning overview Chapter 18, 21

Information Retrieval and Web Search

Search Engine Architecture

Introduction to Web Mining

Personalized Social Image Recommendation

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Information Retrieval and Web Search

Information Retrieval and Web Search

Given two randomly chosen web-pages p1 and p2, what is the

Social Knowledge Mining

Office Hours: 1-2pm T/Th 8/23

כריית מידע -- מבוא ד"ר אבי רוזנפלד.

CSE591: Data Mining by H. Liu

Course Outcomes After this course, you should be able to answer:

CSE 635 Multimedia Information Retrieval

Web Mining Department of Computer Science and Engg.

Introduction to Information Retrieval

CS246: Information Retrieval

Search Engine Architecture

CS246: Web Information Systems -- Introduction

ROLE OF «electronic virtual enhanced research-engaged student teams» WEB PORTAL IN SOLUTION OF PROBLEM OF COLLABORATION INTERNATIONAL TEAMS INSIDE ONE.

Web Mining Research: A Survey

Information Retrieval and Web Design

Information Organization: Overview

Information Retrieval and Web Design

Introduction to Web Mining

Information Retrieval and Web Search

Machine Learning overview Chapter 18, 21

CSE591: Data Mining by H. Liu

Presentation transcript:

1/21/10 Viewing the Coure in terms of IR, DB, Soc Net, ML adapted to web Start of IR

Yippee!

Course Overview (take 2) 2/16/2019 10:33 AM Copyright © 2001 S. Kambhampati

Web as a collection of information Web viewed as a large collection of__________ Text, Structured Data, Semi-structured data (connected) (dynamically changing) (user generated) content (multi-media/Updates/Transactions etc. ignored for now) So what do we want to do with it? Search, directed browsing, aggregation, integration, pattern finding How do we do it? Depends on your model (text/Structured/semi-structured) 2/16/2019 10:33 AM Copyright © 2001 S. Kambhampati

Copyright © 2001 S. Kambhampati Structure A generic web page containing text An employee record [English] [SQL] [XML] A movie review How will search and querying on these three types of data differ? Semi-Structured 2/16/2019 10:33 AM Copyright © 2001 S. Kambhampati

Structure helps querying Expressive queries Give me all pages that have key words “Get Rich Quick” Give me the social security numbers of all the employees who have stayed with the company for more than 5 years, and whose yearly salaries are three standard deviations away from the average salary Give me all mails from people from ASU written this year, which are relevant to “get rich quick” keyword SQL XML 2/16/2019 10:33 AM Copyright © 2001 S. Kambhampati

Copyright © 2001 S. Kambhampati How to get Structure? When the underlying data is already strctured, do unwrapping Web already has a lot of structured data! Invisible web…that disguises itself ..else extract structure Go from text to structured data (using quasi NLP techniques) ..or annotate metadata to add structure Semantic web idea.. 2/16/2019 10:33 AM Copyright © 2001 S. Kambhampati

Adapting old disciplines for Web-age Information (text) retrieval Scale of the web Hyper text/ Link structure Authority/hub computations Social Network Analysis Ease of tracking/centrally representing social networks Databases Multiple databases Heterogeneous, access limited, partially overlapping Network (un)reliability Datamining [Machine Learning/Statistics/Databases] Learning patterns from large scale data IR Social Networks Web Databases Datamining 2/16/2019 10:33 AM Copyright © 2001 S. Kambhampati

Information Retrieval Web-induced headaches Scale (billions of documents) Hypertext (inter-document connections) Bozo users Decentralization (lack of quality guarantees) & simplifications Easier to please “lay” users Consequently Emphasis of precision over recall Focus on “trustworthiness” in addition to “relevance” Indexing and Retrieval algorithms that are ultra fast Traditional Model Given a set of documents A query expressed as a set of keywords Return A ranked set of documents most relevant to the query Evaluation: Precision: Fraction of returned documents that are relevant Recall: Fraction of relevant documents that are returned Efficiency 2/16/2019 10:33 AM Copyright © 2001 S. Kambhampati

Copyright © 2001 S. Kambhampati Social Networks Friends vs. Soulmates Web-induced headaches Scale (billions of entities) Implicit vs. Explicit links Hypertext (inter-entity connections easier to track) Interest-based links & Simplifications Global view of social network possible… Consequently Ranking that takes link structure into account Authority/Hub Recommendations (collaborative filtering; trust propagation) Traditional Model Given a set of entities (humans) And their relations (network) Return Measures of centrality and importance Propagation of trust (Paths through networks) Many uses Spread of diseases Spread of rumours Popularity of people Friends circle of people 2/16/2019 10:33 AM Copyright © 2001 S. Kambhampati

Information Integration Database Style Retrieval Traditional Model (relational) Given: A single relational database Schema Instances A relational (sql) query Return: All tuples satisfying the query Evaluation Soundness/Completeness efficiency Web-induced headaches Many databases With differing Schemas all are partially complete overlapping heterogeneous schemas access limitations Network (un)reliability Consequently Newer models of DB Newer notions of completeness Newer approaches for query planning 2/16/2019 10:33 AM Copyright © 2001 S. Kambhampati

Learning Patterns (from web and users) Traditional classification learning (supervised) Given a set of structured instances of a pattern (concept) Induce the description of the pattern Evaluation: Accuracy of classification on the test data (efficiency of learning) Mining headaches Training data is not obvious (relevance) Training data is massive Training instances are noisy and incomplete Consequently Primary emphasis on fast classification Even at the expense of accuracy Also on getting by with a little labeled data + a lot more unlabeled data 2/16/2019 10:33 AM Copyright © 2001 S. Kambhampati

Finding“Sweet Spots” in computer-mediated cooperative work Big Idea 1 It is possible to get by with techniques blythely ignorant of semantics, when you have humans in the loop All you need is to find the right sweet spot, where the computer plays a pre-processing role and presents “potential solutions” …and the human very gratefully does the in-depth analysis on those few potential solutions Examples: The incredible success of “Bag of Words” model! Bag of letters would be a disaster ;-) Bag of sentences and/or NLP would be good ..but only to your discriminating and irascible searchers ;-) 2/16/2019 10:33 AM Copyright © 2001 S. Kambhampati

Copyright © 2001 S. Kambhampati Collaborative Computing AKA Brain Cycle Stealing AKA Computizing Eyeballs Big Idea 2 A lot of exciting research related to web currently involves “co-opting” the masses to help with large-scale tasks It is like “cycle stealing”—except we are stealing “human brain cycles” (the most idle of the computers if there is ever one ;-) Remember the mice in the Hitch Hikers Guide to the Galaxy? (..who were running a mass-scale experiment on the humans to figure out the question..) Collaborative knowledge compilation (wikipedia!) Collaborative Curation Collaborative tagging Paid collaboration/contracting Many big open issues How do you pose the problem such that it can be solved using collaborative computing? How do you “incentivize” people into letting you steal their brain cycles? Pay them! (Amazon mturk.com ) Make it fun (ESP game) 2/16/2019 10:33 AM Copyright © 2001 S. Kambhampati

Tapping into the Collective Unconscious Big Idea 3 Another thread of exciting research is driven by the realization that WEB is not random at all! It is written by humans …so analyzing its structure and content allows us to tap into the collective unconscious .. Meaning can emerge from syntactic notions such as “co-occurrences” and “connectedness” Examples: Analyzing term co-occurrences in the web-scale corpora to capture semantic information (today’s paper) Analyzing the link-structure of the web graph to discover communities DoD and NSA are very much into this as a way of breaking terrorist cells Analyzing the transaction patterns of customers (collaborative filtering) 2/16/2019 10:33 AM Copyright © 2001 S. Kambhampati

2007

2008 Water’s getting aggressive

2010

It’s a Jungle out there (adversarial Web & Arms Race) Big Idea 4 Web is authority-free zone! Anyone can put up any information and get indexed.. Everyone is trying to trip you up… Need to keep “adversarial” aspect constantly in view Adversarial IR (focus on Trust in addition to Relevance) Adversarial mining (the class is being changed even as you are learning) Classic example: Spam mail 2/16/2019 10:33 AM Copyright © 2001 S. Kambhampati