CS246: Web Information Systems

CS246: Web Information Systems
Junghoo “John” Cho Spring 2016 CS246 by John Cho

Course Information Web page: http://oak.cs.ucla.edu/cs246/
Topic: Web information management Time: MW 10: :50 am Instructor: Junghoo “John” Cho office: 3531H Boelter Hall please use subject “CS246: …” office hours: Tue 2:30-3:30 pm. CS246 by John Cho

Who is this class for? Strong interest in research
Interest in Web information systems Time commitment: Around 2-3 papers every week Typically one full day of paper reading One indepedent project Similar to paper writing In fact we read papers from past student projects! Or interesting application implementation CS246 by John Cho 3 3

Today’s Topics Overview of the course topics Course logistics
Paper reading assignments Class project CS246 by John Cho

Prerequisite Introductory database, e.g., CS143
e.g.: query? SQL? Basic algorithms and data structures Basic probability and statistics P(A|C), Bayes rule, … Design and implementation experience Basic C++ Quick test: Grab a sample paper See if you can read, understand and build it CS246 by John Cho

Tell Us About You Name Department & Program Before coming to UCLA
Brief history at UCLA Technical/research interests Expectation from the class CS246 by John Cho

Information Galore Biblio sever Legacy database Plain text files
The advent of powerful personal computers and the World-Wide Web unbelievably empowered an average person. Previously we had to go through a lot of intermediaries to do a very simple task by today’s standards, but now average people can do a lot of things by themselves. Just as a simple example, let us think about flight reservation. If we want to purchase a plane ticket, we can directly go to a travel site, like Travelocity or Orbitz, to get all relevant information and purchase tickets. These Web sites are amazing because they let us know all different choices that are currently available and let us decide. If we think about what it was like 10 years ago, when we had to go through a travel agent, the travel agents had essentially the total control and they selectively gave out information based on their interest. Now we can decide what we want to do based on the available information. And obviously this is not just limited to flight reservation. If we want to buy books, videos, electronics, or whatever, we have an access to an amazing amount of information and we can make an informed decision, by going over the available information, if we put enough effort. The examples that I have given so far, are more at the consumer side of information, but if we think about it, more important change has occurred at the producer side. 10 years ago, who could imagine that I could assemble this professionally looking presentation without any help from professional designers just in several hours? Also, if I want to torture other people with my amazing voice then I can simply record my singing with my computer and burn a CD or create an MP3 music. This is also true for videos, photos or any documents that I want to share with other people. I can simply create them using my computer and post it on the Web, so that anyone can access it. So this is amazing. Virtually there is nothing to prevent us from publishing our ideas to the Internet and we can access anyone’s idea with a click of mouse. Legacy database Plain text files CS246 by John Cho

Central Problem How to manage/access information on the Web?
Three major approaches Central indexing E.g., Web search engine Dynamic integration E.g., comparison shopping services Data extraction E.g., spamming companies CS246 by John Cho

Topic: Web Search (Central Indexing)
So what is central caching and indexing? In this approach, we consider the information on the Web as static or passive Web pages, and essentially what we do is we collect all the Web pages at a central location, analyze it, and build an index on top of it. And using this index, when users ask certain queries, when the users want to access certain information, we identify relevant Web pages and direct the users to the appropriate pages. Most of the Web search engines, like Google or Altavista, follow this approach. CS246 by John Cho

Topic: Web Search (Central Indexing)
Web: collection of passive HTML pages Find Web pages relevant to a query Traditional Information Retrieval: Web = collection of HTML pages HTML page = a bag of words More than that? Links, structure of the Web User access patterns HTML tags (markups) CS246 by John Cho

Topic: Dynamic Integration
Amazon.com Cars.com 401carfinder.com Apartments.com CS246 by John Cho

Topic: Dynamic Integration
Mediator Wrapper Wrapper Wrapper The second approach that I mentioned is dynamic integration. In this approach, instead of considering the Web as a collection of static and passive pages, we consider the Web as a collection of active and intelligent services. For example, if we think about the Amazon Web site, it is not just a set of static Web pages. It provides some search functionality over its contents, so that the users can submit some queries and the site returns a list of relevant results. So in this approach, the main idea is that instead of considering them as passive pages, we will exploit these rich functionalities that the individual sources provide. So there is a central program, called mediator, with which the users interact, and based on the user’s submitted queries the mediator figures out which source to access in what order, and contact the right sources to get relevant results. However, because the sources or Web sites always have some differences, there is a small layer, called wrappers, on top of each source, which translates the users’ queries to the native queries that individual sources understand. Source 1 Source 2 Source n CS246 by John Cho

Topic: Data Extraction
Structured data WWW Beatles $10 Madonna $20 NSync $20 How can we extract “structured data” from free text automatically? CS246 by John Cho

Main Course Workload Paper reading Independent projects
Paper reading assignments Class discussion We mainly focus on “central indexing” Independent projects CS246 by John Cho

High-Level Goal Learn core ideas and techniques
Some of the techniques can be useful for other fields Learn how to read papers Hopefully learn what it is like to do research Sometimes very frustrating but often very rewarding CS246 by John Cho

Paper Reading Why: About 20 papers from Before the class:
Something that you will do all the time as a researcher Learn to be critical and communicate well Acquire knowledge to conduct research/project About 20 papers from Conferences: SIGMOD, VLDB, WWW, and … Before the class: Everyone: read and review the paper During the class: Instructor: present his own understanding and lead class discussion Everyone: participate!!! CS246 by John Cho

How to Get Papers From the class homepage
Some of the materials password protected User name: cs246 Password: papers Let me know if any problem CS246 by John Cho

How to Read Papers Understand the “Big Picture” What is the problem?
Why is it important? Why is it difficult? What has this paper done? What others have done? CS246 by John Cho

Paper Reviews (1) Due by the preceding Sunday
Submit through our Web submission interface on the class Web page Required components: at most 3 paragraph Summary (1 paragraph): your own words This paper discusses how to optimize queries with... Comments/criticisms (1-2 paragraphs): the good & the bad It addresses a real problem and the solution is interesting … But I feel the experiments are not realistic because... Optional: questions, as many as you want Why the authors assume that queries are independent? CS246 by John Cho

Paper Reviews (2) Please volunteer for “grading reviews” Important learning experience 1 extra point All reviews will get full score unless they are written extremely poorly If the grader finds an extremely poorly written review, it will be graded as “poor” (less than 10%) Sign up for grading at: CS246 by John Cho

Class Project Why: 40% of the class Team of up to 3
Work on a specific problem and learn to find a solution 40% of the class Team of up to 3 Topic: any problem related to the general problem Open style Rigorous study of a research problem or Any interesting system implementation CS246 by John Cho

Class Project Schedule
Important Milestones Group formation: 4/06 (2nd week Wed) Project proposal: 4/17 (3rd week Sun) Project progress: 5/04 (6th week Wed) Final report: 5/22 (8th week Sun) Project presentation: 9th and 10th weeks You are responsible to stay on track Make appointments with instructor as needed CS246 by John Cho

Project: Please Remember
Put your aims high and be realistic Expect to read at least 4-5 papers along the way Start early Don’t do it right before the deadline Always unexpected obstacles Some students could not finish in previous quarters Please, please start early You are responsible to be on track CS246 by John Cho

Grading Midterm: 40% Paper reviews: 20% Project: 40% CS246 by John Cho

Announcements First review due Sunday 4/03
Two papers for class 3 and 4 Graph structure in the Web The Anatomy of a Large-Scale Hypertextual … CS246 by John Cho

CS246: Web Information Systems

Similar presentations

Presentation on theme: "CS246: Web Information Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS246: Web Information Systems

Similar presentations

Presentation on theme: "CS246: Web Information Systems"— Presentation transcript:

Similar presentations

About project

Feedback