CSCI 572: Information Retrieval and Search Engines: Summer 2011 Prof. Chris A. Mattmann
May-18-11CS572-Summer2011CAM-2 The Class Will give you a complete treatment of the area of search engines and information retrieval –The fundamental building blocks of the web and search engines The Search Engine Architecture proposed by Brin/Page Understanding algorithms for ranking pages Understanding technologies for characterizing, downloading, parsing, indexing, searching and disseminating web content –Advanced topics in search engines such as BigData and distributed computation Will equip you with the necessary skills to design complex, real- world search engines
May-18-11CS572-Summer2011CAM-3 General class information Lecture, but… –You can participate –You should participate –You will participate, that is, if you want to do well :) Breakdown of points –20% participation –40% research paper presentation –40% course project
May-18-11CS572-Summer2011CAM-4 General class information Syllabus/website: –Visit it often, as the schedule may change! –This is where all of your course project info and presentation info will be posted –This site will point you to required reading (research papers), and to lectures that you can download before class
May-18-11CS572-Summer2011CAM-5 What we’ll cover Theory –Understanding of basic information retrieval –Search engine querying –Search engine ranking –Architecture of search engines and technologies –Design Patterns Practice –Modern search engine technologies from Apache
May-18-11CS572-Summer2011CAM-6 Course Presentation Each week, we’ll read a few research papers on search engines For the first part of the course (5 weeks), I’ll lecture on the general topics that the research papers cover –The search engine architecture: fetching, parsing, indexing, querying, distributed computation, etc. For the last part of the course (~5 weeks), each one of you will present on one of the research papers we covered in the first 5 weeks
May-18-11CS572-Summer2011CAM-7 Course Presentation What I’m looking for (~20 minutes of presentation, with ~5 mins questions at the end) –You understood the paper –Discussion of related work and background –Discussion of why should I care about the topic And more importantly why your fellow classmates should care –Relation of your paper to the lecture slides I gave on the topic –Simple summarization and description of the algorithm and/or technology introduced in the paper –What were the results/contributions/conclusions of the paper –Your evaluation of Pros of the paper –Your evaluation of Cons of the paper
May-18-11CS572-Summer2011CAM-8 Course Presentation What I’m NOT looking for –Plagiarism –Repetition –Cutting/Pasting out of the paper –Regurgitation –You to follow the EXACT set of bullets that I gave on the prior slide You should be looking to be innovative – show the class and me that you really understood what was in the paper –Treat it like a conference presentation
May-18-11CS572-Summer2011CAM-9 Course Project You will get to leverage one or a combination of several Apache software technologies –OODT, Nutch, Tika, Lucene, Solr, Hadoop, HBase, Hive, Cassandra, etc. You will make a significant contribution to one or more of the above communities Deliverables –A 2 page project proposal –A 2 page mid-term project report –Source code and final demonstration to me at end of class
May-18-11CS572-Summer2011CAM-10 Course Project Deliverables –Your project proposal should include: Demonstration that you’ve researched your particular idea with pointers to issue trackers and mailing lists Objectives section Approach section Identification of deliverables section Timeline/Schedule –Your mid term report should include: Current status Blockers to completion Planned mitigation to blockers
May-18-11CS572-Summer2011CAM-11 Me Graduated with my Ph.D. in Computer Science from USC in 2007 –Advisor: Dr. Nenad Medvidovic Was a student at USC from –B.S., Computer Science 2001 –M.S., Computer Science 2003 My research interests –The intersection of software architectures, and large-scale data dissemination –Software connector selection –Bayesian decision theory –Reinforcement learning –Search Engines
May-18-11CS572-Summer2011CAM-12 So…today Quick lecture on characterizing the web Read the papers linked from the syllabus Be ready for Tuesday +1 week (not next Tuesday) as this is a 10-week course and we are going to dive in