Information Retrieval and Web Search

Slides:



Advertisements
Similar presentations
What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley School of Information Management and Systems.
Advertisements

8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley.
MIS 470: Information Systems Project Yong Choi School of Business Administration CSU, Bakersfield.
CS 166 DATABASE MANAGEMENT SYSTEMS Dr Eamonn Keogh uci
Welcome to AC122 Payroll Accounting 1. AC122 Payroll Accounting Seminar 1 Jim Eads, CPA, MST, MSF 2.
The ePortfolio and Student Evaluation A training presentation by: Amy Cannady Robin Drewry Bonnie Hicks.
COMP Introduction to Programming Yi Hong May 13, 2015.
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
Welcome to CS 115! Introduction to Programming. Class URL Write this down!
Introduction to Data Structures
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
General Education Office
Advanced Legal Writing Seminar: Wednesdays, 10:00 p.m. EST Office Hours: Mondays from 3 – 5 p.m. EST, and by appointment AIM sign-in: cssouthall
IST 210: Organization of Data
ACIS 3504 Accounting Systems and Controls. 2 Dr. Linda Wallace  Office: Pamplin 3092  
Information Retrieval and Web Search Course overview Instructor: Rada Mihalcea.
Syllabus Highlights CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
COP4020 INTRODUCTION FALL COURSE DESCRIPTION Programming Languages introduces the fundamentals of the design and implementation of programming languages.
Information Retrieval CIS-462 Dr. Samir Tartir 2013/2014 First Semester.
GLOBAL MARKETING MANAGEMENT INTRODUCTION. WELCOME I am Hudson Rogers and I will be the instructor for this course. Over the next few weeks we will explore.
IST 210: ORGANIZATION OF DATA Introduction IST210 1.
CM202 Class Hints. Use a heading on all your papers. No cover (title) page. Following instructions counts on your grade. The heading should include: Name,
Disclosure Statement Mrs. Stevenson Foods 1. Introduction Food affects our budget, time, health and social lives.
We will begin at 9 PM This is an Audio Seminar. Please be sure to adjust your audio. When reviewing the archived seminar this document will provide the.
Web Application Development Instructor: Matthew Schurr Please sign in on the sheet at the front of the room when you arrive.
Syllabus Highlights CSE 1310 – Introduction to Computers and Programming Alexandra Stefan University of Texas at Arlington 1.
CSc 120 Introduction to Computer Programing II
INTRO TO PHOTO: DIGITAL
MMW 15 TA: Kyle Knabb Week 1.
ACIS 3504 Accounting Systems and Controls
Theory and Practice of Web Technology
It’s called “wifi”! Source: Somewhere on the Internet!
University of California, Berkeley
MIS323 Business Telecommunications
3 - STORAGE: DATA CAPACITY CALCULATIONS
INTRO TO PHOTO: DIGITAL
Computer Science 102 Data Structures CSCI-UA
Adding Assignments and Learning Units to Your TSS Course
Online Composition with Georgie Ziff
Course Overview - Database Systems
Lecture 0 Course Information
EECE 310 Software Engineering
Academic Communication Lesson 3
Go on the website under the tab “A.P. Exam”
Welcome to Physics 1D03.
Which of these counts as a media text?
Welcome to College English 2!
Intro to CIT 594
MIS323 Business Telecommunications
PHYS 202 Intro Physics II Catalog description: A continuation of PHYS 201 covering the topics of electricity and magnetism, light, and modern physics.
Welcome to College English 2!
September 18th – September 20th
Blackboard Tutorial (Student)
SUPER SUCCESS SERIES TIME MANAGEMENT VOL. 1
ACIS 3504 Accounting Systems and Controls
Topics in Applied Microbiology
English 9 with Mrs. Priole
Welcome to College English 2!
Stevens Library’s Guide to Research
Blackboard Beginner Level Training
Bus 100: Business communications
Topics in Applied Microbiology
Syllabus Highlights CSE 1310 – Introduction to Computers and Programming Alexandra Stefan University of Texas at Arlington.
Presentation and project
HU-JRNL 1: Ethics Prof. Vaccaro Jobs, Resumes & Branding
Class Rules and Explanations
Information Retrieval CIS-462
Presentation and project
CS 474/674 – Image Processing Fall Prof. Bebis.
Welcome to College English 2!
Presentation transcript:

Information Retrieval and Web Search Vasile Rus, PhD vrus@memphis.edu www.cs.memphis.edu/~vrus/teaching/ir-websearch/

Outline Administrivia Why Information Retrieval? Information Overload

General Information Web Site: Instructor TA Vasile Rus, PhD http://www.cs.memphis.edu/~vrus/teaching/ir-websearch/ Instructor Vasile Rus, PhD Office: 323 Dunn Hall Office Hours: 323 Dunn Hall; T-R 10:00-11:00AM Phone: x5259 E-mail: vrus@memphis.edu TA Shanshan Gao Office hours: TBD

Why Attending this Class ? will help you cope with the information overload problem will allow you to design and implement solutions for handling large collections of information is FUN! (hopefully)

Syllabus Week 1: Introduction to IR and Web Search Week 2: Introduction to PERL Week 3: Classic IR: Boolean and Vectorial Models Week 4: More IR Models Week 5: Evaluation in IR Week 6: Query Operations and Languages Week 7: Text Properties, Text Operations Week 8: NO CLASS – FALL BREAK, Indexing and Searching, Review Week 9: MIDTERM, WWW and Web Search Intro

Syllabus (cont’d) Week 10: Web Search Week 11: Text Categorization Week 12: Text Clustering Week 13: Question Answering Week 14: Advanced IR Models, THANKSGIVING HOLIDAY Week 15: Project Presentations, Review Week 16: Final Exam

To be successful you need to Read the syllabus Understand the structure of the course Read the general policies Attend classes and participate by asking questions or/and contributing with related remarks Explore the course website

To be successful you need to Try to enjoy the programming assignments Don't limit yourself to what is asked in class

Grading Project (30%) Assignments 6-8 (or more) 2 Exams Midterm (15%) Final (15%) Active Participation, Presentations (5%)

Grading Grade Letter Grade 90-100+ A 80-89 B 70-79 C 60-69 D 0-59 F 2.5 above or below the cut-off will earn you a + or – in front of your grade. For example: 89 has a letter equivalent of B+ Exception: 90-91 will give you A-, 92 to 96 will give you A, anything above 97 means A+.

Other Issues Attendance can help you when on borderline PhD Students need to make a class presentation (besides project presentation) General announcements are posted on the web site frequently! Please check it out as often as possible If you notice any inconsistencies on the website (broken links, misspellings, etc.) please notify me Thank you!

Bibliography REQUIRED: Baeza-Yates & Ribeiro-Neto Modern Information Retrieval (required) RECOMMENDED (!) Frakes & Baeza-Yates Information Retrieval: Data Structures and Algorithms C. Manning, P. Raghavan, and H. Schutze: Introduction to Information Retrieval

Office Hours and Extra Help During the following times I'll be available in my office TR: 10:00AM - 11:00AM By appointment You must send me an email to set up an appointment If you just knock on my door without notice the chances are that I'll be busy TA’s office hours can be found on the website Please use the office hours!

Assignment Submission Submissions: You will have on average one-two weeks from the date the work is assigned Late submissions are not accepted In exceptional cases you may have a 48-hour grace period at the cost of 50% of the grade (you should ask for it before the due date)

Programming Assignments Programming submissions are Electronic (using a form or email) AND on paper should contain your name as part of the file name and the assignment number e.g.: vasileRus.Assignment01.sh (the code) should be well indented and contain lots of comments see the Recommended code-style guidelines on the website Each file should contain a header as given in the next slide If multiple files are submitted, pack them using gzip, tar, etc.

File Header /************************************* * Name: FileName, Package name if necessary * Assignment: assignment ID * Description: a text describing the assignment * Author: Your Name * Date: put here the due date * Comments: any comments you think are necessary *************************************/

Plagiarism Plagiarism Plagiarism is not tolerated. If caught, you'll be given grade 0 (zero) and disciplinary actions will be taken It's OK to help some of your friends who may have problems This is actually a good learning tool but it is not OK to share code or answers. If they need, help/discuss with them but never show them your code I may (and I will) ask you to demonstrate and explain your programs

Exams During exams you should sit as far from each other as possible As rule of thumb, leave at least one chair between you and any other student Usually, all exams are closed book Exams are normally made of: true-false questions multiple-choice questions “open” questions (programming or not) There are no make-up exams

Questions

Information Overload “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden)

Information Overload

Coping With It! “reserve large blocks of time on your calendar, don’t answer the phone, and return calls in short bursts once or twice a day” (Drucker, 1967)

Coping With It! some combination of focusing, filtering, and forgetting It requires a tremendous amount of self-discipline, and we can’t do it alone: in our teams and across the whole organization, we need to establish a set of norms that support a more productive way of working. “Multitasking is not heroic; it’s counterproductive” http://www.mckinsey.com/insights/organization/recovering_from_information_overload

Coping With It! We have to admit, for example, that we do feel satisfied when we can respond quickly to requests and that doing so somewhat validates our desire to feel so necessary to the business that we rarely switch off. There’s nothing wrong with these feelings, but we need to consider them alongside their measurable cost to our long-term effectiveness. No one would argue that burning up all of a company’s resources is a good strategy for long-term success, and that is equally true of its leaders and their mental resources.

What kinds of information are there? Text books, periodicals, WWW, memos, ads published/refeered Film Photos, other Images Broadcast TV, Radio Telephone Conversations Databases

How much information is there How much information is there? (Estimates courtesy of Hal Varian and Peter Lyman) Original: http://www.sims.berkeley.edu/emc Newer: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/

How Much Information? Stored Information Communicated Print Film Optical Magnetic Communicated Internet Broadcast Phone Mail

Print Annual Production Books 968,735 = 8 Terabytes (compressed image) Newspapers 22643 = 25 Terabytes Journals 40000 = 2 Terabytes Magazines 80000 = 10 Terabytes Office Documents 12x10^9 pages = 312 Terabytes TOTAL 357 Terabytes

Print Library of Congress Printed book collection About 18 Million books About 130 Terabytes (compressed image) For all of LC we should also assume 13M photographs, 5MB each = 65 TB 4M maps, say 200 TB 500K files, 1GB each = 500 TB 3.5M sound recordings, ~2000 TB Grand total: 3 petabytes (~3000 terabytes) Books in Print (which you can buy TODAY) 3.2 Million titles About 26 Terabytes

Film and Image Film Photographs = 410 Petabytes per year Movies = 16 Terabytes (Commercial Production of about 4000 films) X-Rays = 12 Petabytes

Optical Media CD-Music 90,000 items = 58 TB CD-ROM 3,000 items = 3 TB DVD-Video 5,000 items = 22 TB Total 83 TB

Magnetic Media Audio Tape 184,200,000 = 184.2 Petabytes Video Tape 355,000,000 = 1420 Floppy disks = 0.07 Removable disks = 1.69 Hard Disks = 500

Totals Stored Per Year Medium Type of content Terabytes/Year Terabytes/Year Upper Bound Lower Bound Paper Books 8 7 Newspapers 25 20 Periodicals 12 12 Office documents 312 312 SUBTOTAL 357 351 Film Photographs 410,000 100,000 Cinema 16 16 X-Rays 12,000 12,000 SUBTOTAL 422,000 112,016 Optical Music CDs 58 40 Data CDs 3 3 DVDs 22 22 SUBTOTAL 83 65 Magnetic Camcorder 300,000 300,000 Disk drives 2,555,000 1,000,20 SUBTOTAL 2,855,000 1,300,200 TOTAL 3,277,440 1,412,632

Human Memory Landauer 86: Human brain holds 200MB looked at rate of information intake and rate of forgetting, and amount of information adults need for normal tasks 6B people on earth implies total memory of all people alive about 1,200 petabytes Another way: estimate that people take in a byte/sec lifetime 250,000 days or 2B sec result is 2 GB (doesn’t count synthesizing new info)

Summary Administrivia Why Information Retrieval

Next Introduction to Information Retrieval