Office Hours: 1-2pm T/Th 8/23

Slides:

Advertisements

Similar presentations

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.

Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.

A (corny) ending. 2 Course Outcomes After this course, you should be able to answer: –How search engines work and why are some better than others –Can.

Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the.

Information Integration + a (corny) ending 5/4 An unexamined life is not worth living.. --Socrates  Mandatory blog qns  Final on next Tuesday 9:50—11:40.

Search Engines and Information Retrieval

Interactive Review + a (corny) ending 12/05  Project due today (with extension)  Homework 4 due Friday  Demos (to the TA) as scheduled.

The Last Lecture Agenda –1:40-2:00pm Integrating XML and Search Engines—Niagara way –2:00-2:10pm My concluding remarks (if any) –2:10-2:45pm Interactive.

Information Retrieval in Practice

1 5/4: Final Agenda… 3:15—3:20 Raspberry bars »In lieu of Google IPO shares.. Homework 3 returned; Questions on Final? 3:15--3:40 Demos of student projects.

Query Processing in Data Integration + a (corny) ending

FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.

Overview of Web Data Mining and Applications Part I

CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.

Search Engines and Information Retrieval Chapter 1.

C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )

Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.

Data Mining By Dave Maung.

Collaborative Information Retrieval - Collaborative Filtering systems - Recommender systems - Information Filtering Why do we need CIR? - IR system augmentation.

McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, All Rights Reserved Chapter 7 Storing Organizational Information - Databases.

Search Engine Architecture

Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the.

CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.

Conceptual structures in modern information retrieval Claudio Carpineto Fondazione Ugo Bordoni

A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.

Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.

Information Retrieval in Practice

NoSQL: Graph Databases

How do Web Applications Work?

Recommender Systems & Collaborative Filtering

Information Organization: Overview

Information Retrieval (in Practice)

MIS2502: Data Analytics Advanced Analytics - Introduction

Machine Learning overview Chapter 18, 21

Machine Learning overview Chapter 18, 21

Let’s Go Surfing: Internet Safety

Information Retrieval and Web Search

Search Engine Architecture

Introduction to Web Mining

Personalized Social Image Recommendation

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Information Retrieval and Web Search

Information Retrieval and Web Search

Given two randomly chosen web-pages p1 and p2, what is the

Social Knowledge Mining

Course Outcomes After this course, you should be able to answer:

Overview of Machine Learning

CSE 635 Multimedia Information Retrieval

1/21/10 Viewing the Coure in terms of IR, DB, Soc Net, ML adapted to web Start of IR.

Web Mining Department of Computer Science and Engg.

Introduction to Information Retrieval

Planning and Storyboarding a Web Site

CS246: Information Retrieval

Search Engine Architecture

CS246: Web Information Systems -- Introduction

CS 345A Data Mining Lecture 1

Web Mining Research: A Survey

CS 345A Data Mining Lecture 1

Information Organization: Overview

Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.

Introduction to Web Mining

Information Retrieval and Web Search

Machine Learning overview Chapter 18, 21

CS 345A Data Mining Lecture 1

CSE591: Data Mining by H. Liu

Presentation transcript:

Office Hours: 1-2pm T/Th 8/23 Viewing the Coure in terms of IR, DB, Soc Net, ML adapted to web Start of IR

Course Overview (take 2) Life can only be understood backwards; but it must be lived forwards. --Soren Kierkegaard 11/17/2018 10:02 AM Copyright © 2001 S. Kambhampati

Web as a collection of information Web viewed as a large collection of__________ Text, Structured Data, Semi-structured data (connected) (dynamically changing) (user generated) content (multi-media/Updates/Transactions etc. ignored for now) So what do we want to do with it? Search, directed browsing, aggregation, integration, pattern finding How do we do it? Depends on your model (text/Structured/semi-structured) 11/17/2018 10:02 AM Copyright © 2001 S. Kambhampati

Copyright © 2001 S. Kambhampati Structure A generic web page containing text An employee record [English] [SQL] [XML] A movie review How will search and querying on these three types of data differ? Semi-Structured 11/17/2018 10:02 AM Copyright © 2001 S. Kambhampati

Structure helps querying Expressive queries Give me all pages that have key words “Get Rich Quick” Give me the social security numbers of all the employees who have stayed with the company for more than 5 years, and whose yearly salaries are three standard deviations away from the average salary Give me all mails from people from ASU written this year, which are relevant to “get rich quick” keyword SQL XML 11/17/2018 10:02 AM Copyright © 2001 S. Kambhampati

Copyright © 2001 S. Kambhampati How to get Structure? When the underlying data is already strctured, do unwrapping Web already has a lot of structured data! Invisible web…that disguises itself ..else extract structure Go from text to structured data (using quasi NLP techniques) ..or annotate metadata to add structure Semantic web idea.. Pandora employees adding features to music.. Structure is so important that we are willing to pay people to add structure or hope that people will be disciplined enough to annotate their pages with structure. Pandora’s employees make the features… Magellan went around the world three times. On one of those trips he died. On which trip did he die? 11/17/2018 10:02 AM Copyright © 2001 S. Kambhampati

Adapting old disciplines for Web-age Information (text) retrieval Scale of the web Hyper text/ Link structure Authority/hub computations Social Network Analysis Ease of tracking/centrally representing social networks Databases Multiple databases Heterogeneous, access limited, partially overlapping Network (un)reliability Datamining [Machine Learning/Statistics/Databases] Learning patterns from large scale data IR Social Networks Web Databases Datamining 11/17/2018 10:02 AM Copyright © 2001 S. Kambhampati

Information Retrieval Web-induced headaches Scale (billions of documents) Hypertext (inter-document connections) Bozo users Decentralization (lack of quality guarantees) Hard for users to figure out quality Godfather & Eggplants & simplifications Easier to please “lay” users Consequently Emphasis of precision over recall Focus on “trustworthiness” in addition to “relevance” Indexing and Retrieval algorithms that are ultra fast Traditional Model Given a set of documents A query expressed as a set of keywords Return A ranked set of documents most relevant to the query Evaluation: Precision: Fraction of returned documents that are relevant Recall: Fraction of relevant documents that are returned Efficiency Relevance is like correctness—a local property; while quality is like optimality—a global property 11/17/2018 10:02 AM Copyright © 2001 S. Kambhampati

Copyright © 2001 S. Kambhampati 8/25 "You can't connect the dots looking forward; you can only connect them looking backwards. So you have to trust that the dots will somehow connect in your future," 11/17/2018 10:02 AM Copyright © 2001 S. Kambhampati

Copyright © 2001 S. Kambhampati Social Networks Friends vs. Soulmates Web-induced headaches Scale (billions of entities) Implicit vs. Explicit links Hypertext (inter-entity connections easier to track) Interest-based links & Simplifications Global view of social network possible… Consequently Ranking that takes link structure into account Authority/Hub Recommendations (collaborative filtering; trust propagation) Traditional Model Given a set of entities (humans) And their relations (network) Return Measures of centrality and importance Propagation of trust (Paths through networks) Many uses Spread of diseases Spread of rumours Popularity of people Friends circle of people 11/17/2018 10:02 AM Copyright © 2001 S. Kambhampati

Information Integration Database Style Retrieval Traditional Model (relational) Given: A single relational database Schema Instances A relational (sql) query Return: All tuples satisfying the query Evaluation Soundness/Completeness efficiency Web-induced headaches Many databases With differing Schemas all are partially complete overlapping heterogeneous schemas access limitations Network (un)reliability Consequently Newer models of DB Newer notions of completeness Newer approaches for query planning 11/17/2018 10:02 AM Copyright © 2001 S. Kambhampati

Learning Patterns (from web and users) Traditional classification learning (supervised) Given a set of structured instances of a pattern (concept) Induce the description of the pattern Evaluation: Accuracy of classification on the test data (efficiency of learning) Mining headaches Training data is not obvious (relevance) Training data is massive But much of it unlabeled Training instances are noisy and incomplete Consequently Primary emphasis on fast classification Even at the expense of accuracy Also on getting by with a little labeled data + a lot more unlabeled data [Dantzig Story] 11/17/2018 10:02 AM Copyright © 2001 S. Kambhampati

Finding“Sweet Spots” in computer-mediated cooperative work Big Idea 1 It is possible to get by with techniques blythely ignorant of semantics, when you have humans in the loop All you need is to find the right sweet spot, where the computer plays a pre-processing role and presents “potential solutions” …and the human very gratefully does the in-depth analysis on those few potential solutions Examples: The incredible success of “Bag of Words” model! Bag of letters would be a disaster ;-) Bag of sentences and/or NLP would be good ..but only to your discriminating and irascible searchers ;-) Don’t overthink the solution…. The surprising thing is not why they don’t work better, but why they work at all? 11/17/2018 10:02 AM Copyright © 2001 S. Kambhampati

Big Ideas and Cross Cutting Themes 11/17/2018 10:02 AM Copyright © 2001 S. Kambhampati

Copyright © 2001 S. Kambhampati Collaborative Computing AKA Brain Cycle Stealing AKA Computizing Eyeballs Big Idea 2 A lot of exciting research related to web currently involves “co-opting” the masses to help with large-scale tasks It is like “cycle stealing”—except we are stealing “human brain cycles” (the most idle of the computers if there is ever one ;-) Remember the mice in the Hitch Hikers Guide to the Galaxy? (..who were running a mass-scale experiment on the humans to figure out the question..) Collaborative knowledge compilation (wikipedia!) Collaborative Curation Collaborative tagging Paid collaboration/contracting Many big open issues How do you pose the problem such that it can be solved using collaborative computing? How do you “incentivize” people into letting you steal their brain cycles? Pay them! (Amazon mturk.com ) Make it fun (ESP game) 11/17/2018 10:02 AM Copyright © 2001 S. Kambhampati

Tapping into the Collective Unconscious Big Idea 3 Another thread of exciting research is driven by the realization that WEB is not random at all! It is written by humans …so analyzing its structure and content allows us to tap into the collective unconscious .. Meaning can emerge from syntactic notions such as “co-occurrences” and “connectedness” Examples: Analyzing term co-occurrences in the web-scale corpora to capture semantic information (gmail) Statistical machine translation with massive corpora Analyzing the link-structure of the web graph to discover communities DoD and NSA are very much into this as a way of breaking terrorist cells Analyzing the transaction patterns of customers (collaborative filtering) 11/17/2018 10:02 AM Copyright © 2001 S. Kambhampati

2007

2008 Water’s getting aggressive

2010

It’s a Jungle out there (adversarial Web & Arms Race) Big Idea 4 Web is authority-free zone! Anyone can put up any information and get indexed.. Everyone is trying to trip you up… (snopes.com) Need to keep “adversarial” aspect constantly in view Adversarial IR (focus on Trust in addition to Relevance) Adversarial mining (the class is being changed even as you are learning) Classic example: Spam mail Mother may be making those sites too.. 11/17/2018 10:02 AM Copyright © 2001 S. Kambhampati

2011