CS246: Web Information Systems -- Introduction

Slides:



Advertisements
Similar presentations
CS246: Web Information Systems
Advertisements

Intro to CIT 594
Welcome to CS 395/495 Measurement and Analysis of Online Social Networks.
Computer Network Fundamentals CNT4007C
1 INE 1020 Introduction to Internet Engineering Tutorial 3 Discussion on Homework 1.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Computer Networks CEN 5501C Spring, 2008 Ye Xia (Pronounced as “Yeh Siah”)
Course Introduction Software Engineering
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Overviews of ITCS 6161/8161: Advanced Topics on Database Systems Dr. Jianping Fan Department of Computer Science UNC-Charlotte
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Search Engine and SEO Presented by Yanni Li. Various Components of Search Engine.
CS 541 Lecture Slides Sunil Prabhakar CS541 Database Systems.
Information Retrieval and Web Search Course overview Instructor: Rada Mihalcea.
ITCS 6265 Details on Project & Paper Presentation.
Department of Computer Science, Florida State University CGS 3066: Web Programming and Design Spring
Lecture 1 Page 1 CS 236 Online Introduction CS 236 On-Line MS Program Networks and Systems Security Peter Reiher.
Computer Networks CNT5106C
Welcome to EECS 395/495 Online Advertising: A Systems Approach.
Data mining in web applications
Information Retrieval in Practice
Welcome to EECS 395/495 IoT Networks Seminar
Welcome to EECS 395/495 Networking Problems in Cloud Computing
Computer Network Fundamentals CNT4007C
Course Overview - Database Systems
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
CS6501 Advanced Topics in Information Retrieval Course Policy
Microgrid Concepts and Distributed Generation Technologies
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
CMSC 471 Principles of Artificial Intelligence Course Overview
Course Introduction 공학대학원 데이타베이스
Computer Networks CNT5106C
CNT 4704 Computer Communication Networking (not “analysis”)
Purpose of Class To prepare students for research and advanced work in security topics To familiarize students working in other networking areas with important.
ITCS 6157/8157: Visual Database
CNT 4704 Computer Communication Networking (not “analysis”)
CMSC 471 Introduction to Artificial Intelligence section 1 Course Overview Spring 2017.
Search Engine Architecture
CNT 4704 Computer Communication Networking (not “analysis”)
September 27 – Course introductions; Adts; Stacks and Queues
Autonomous Cyber-Physical Systems: Course Introduction
Department of Computer Science, Florida State University
CS246: Web Information Systems
Computer Networks CNT5106C
Course Overview - Database Systems
Systems Programming Intro
CNT 4704 Analysis of Computer Communication Networks
CNT 4704 Analysis of Computer Communication Networks
Intro to CIT 594
EE422C Software Design and Implementation II
Unit# 5: Internet and Worldwide Web
Course Summary ChengXiang “Cheng” Zhai Department of Computer Science
CS4501: Information Retrieval Course Policy
CS246: Information Retrieval
Search Engine Architecture
Course Overview CS 4640 Programming Languages for Web Applications
Intro to CIT 594
Intro to CIT 594
Sampath Jayarathna Cal Poly Pomona
Computer Networks CNT5106C
Welcome! Knowledge Discovery and Data Mining
Lecture 1a- Introduction
Lecture 1: Overview of CSCI 485 Notes: I presented parts of this lecture as a keynote at Educator’s Symposium of OOPSLA Shahram Ghandeharizadeh Director.
Lecture 1: Overview of CSCI 485 Notes: I presented parts of this lecture as a keynote at Educator’s Symposium of OOPSLA Shahram Ghandeharizadeh Associate.
Course Overview CS 4640 Programming Languages for Web Applications
CSC 241: Introduction to Computer Science I
CS249 Advanced Seminar: Learning From Text
CS2013 LECTURE 1 John Hurley Cal State LA.
CGS 3066: Web Programming and Design Fall 2019
Presentation transcript:

CS246: Web Information Systems -- Introduction Professor Junghoo “John” Cho

Today’s Topics Introduction Course overview Course logistics Paper reading assignments Class project

Instructor TA Instructor: Junghoo “John” Cho Office: 3531H Boelter Hall Email: cho@cs.ucla.edu Research area: search engine, text analysis and mining TA

Tell us about YOU Name Department & Program Before coming to UCLA Brief history at UCLA Technical/research interests Expectation from the class

Overview: Information Galore Biblio sever Legacy database Plain text files CS246 by John Cho

Central Problem How to manage/access information on the Internet? Three major approaches Central indexing E.g., Web search engine Dynamic integration E.g., comparison shopping services Data extraction E.g., spamming companies

Topic: Search Engine (Central Indexing) CS246 by John Cho

Topic: Search Engine (Central Indexing) Web: collection of passive HTML pages Find Web pages relevant to a query Traditional Information Retrieval: Web = collection of HTML pages HTML page = a bag of words More than that? Links, structure of the Web User access patterns HTML tags (markups) CS246 by John Cho

Topic: Dynamic Integration Amazon.com Cars.com Truecar.com Apartments.com CS246 by John Cho

Topic: Dynamic Integration Mediator Wrapper Wrapper Wrapper The second approach that I mentioned is dynamic integration. In this approach, instead of considering the Web as a collection of static and passive pages, we consider the Web as a collection of active and intelligent services. For example, if we think about the Amazon Web site, it is not just a set of static Web pages. It provides some search functionality over its contents, so that the users can submit some queries and the site returns a list of relevant results. So in this approach, the main idea is that instead of considering them as passive pages, we will exploit these rich functionalities that the individual sources provide. So there is a central program, called mediator, with which the users interact, and based on the user’s submitted queries the mediator figures out which source to access in what order, and contact the right sources to get relevant results. However, because the sources or Web sites always have some differences, there is a small layer, called wrappers, on top of each source, which translates the users’ queries to the native queries that individual sources understand. Source 1 Source 2 Source n CS246 by John Cho

Topic: Data Extraction Structured data WWW Beatles $10 Madonna $20 NSync $20 How can we extract “structured data” from unstructured/free text automatically? CS246 by John Cho

Focus of CS246 Central Indexing a.k.a. “Search Engine” Information Retrieval (IR) System Systems backed by 50-year research Challenge: How can we support intuitive “natural-language based” search on Web documents? How can machines “discover” Web pages? How can machines “understand” text and query? How can machines identify “relevant” document? How can machines return documents quickly? Possibly other approaches as well depending on student interest…

Course Workload Lectures: 3-4 hours per week Paper reading assignments Submit short summary for each paper About one paper per week Project Split into four parts Use ElasticSearch, Mallet, and Java to index and analyze Wikipedia documents Students must know (or is willing to learn) basic Java and Unix One midterm

Course Objective Learn theory (lectures) Information retrieval (IR) models, data structures and algorithms Search engine architecture and algorithms Learn practical tools (projects) Theory without practice is an “intellectual pastime”! Learn how to “implement” learned theory using real tools Learn how to read paper (paper reading assignments) “Reading papers” is what you do as a software engineer or research scientist Reading papers is hard! Papers are written with “different” objectives than textbooks It takes time and practice to learn ”reading papers”

Prerequisite Introductory database, e.g., CS143 e.g.: query? SQL? Basic algorithms and data structures Basic probability and statistics P(A|C), Bayes rule, … Programming experience Project is done using basic Java

Paper Reading Why: What: How: Something that you will do all the time as a software engineer or researcher Learn to be critical and communicate well Acquire knowledge to conduct research and/or project What: “Classic papers” in information retrieval, search engines, etc Roughly one paper per week How: Read papers and write a short summary All papers are available on our class Web site http://oak.cs.ucla.edu/classes/cs246/papers.html Username: cs246, password: papers CS246 by John Cho

How to Read Papers Key: Understand the “big picture” first Details are often VERY DIFFICULT to get Try to understand the high-level idea first and go into the details only if needed Understanding “big picture” What is the problem? Why is it important? Why is it difficult? What has this paper done? What others have done?

Paper Summary Summary is required so that everyone reads assigned papers Without this, no one will read them… Required components: at most 3 paragraph Summary (1 paragraph): your own words This paper discusses how to optimize queries with... Comments/criticisms (1-2 paragraphs): the good & the bad It addresses a real problem and the solution is interesting … But I feel the experiments are not realistic because... Optional: questions, as many as you want Why the authors assume that queries are independent? Summaries are graded as “Excellent”, “Great”, or “Poor” Most reviews will get “Great” unless it is written extremely poorly/well (say, less than 10% will get either “Excellent” or “Poor”)

Class Project Why: How: What: Learn how to apply learned theory to real system and dataset using practical tools How: Must be completed individually Must be done inside a “virtual machine” provided by us Ubuntu 16.04, Java JDK 8, Gradle 4.1, ElasticSearch 5.6.2, Mallet 2.0.8 No installation, set up, and configuration problems Students must feel comfortable using Unix/Linux commands and programming in Java What: Index Wikipedia dataset using ElasticSearch Customize ranking function of ElasticSearch Implement a simple spell checker Perform SVD and topic analysis on real dataset CS246 by John Cho

Exam Midterm: 9th week Monday during class

Grading Paper reviews: 10% Project: 50% Midterm: 40%

Textbook No required textbook Optional reference Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze It is useful if you need to read more details on a certain topic.

Communication All course materials are available on our Web site http://oak.cs.ucla.edu/classes/cs246 Piazza will be used for Online discussion and general question Personal communication between a student and course staff Please send a “private message” on Piazza for personal communications CCLE will be used for assignment submissions Paper summary Project deliverable Course staff has regular office hours

Announcements First paper summary is due this Sunday “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page (yes, they are them!) First project will be released this Sunday Project overview is already online Please take a look at our class Web site Please sign up for Piazza if you have not CS246 by John Cho