Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the.

Slides:



Advertisements
Similar presentations
CSIS-390 History Dr. Eric Breimer. Syllabus 1. Google “Eric Breimer” 2. Click on first link 3. Click on CSIS Click on Syllabus.
Advertisements

Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the.
CSE 471/598 Introduction to Artificial Intelligence (aka the very best subject in the whole-wide-world) The Class His classes are hard; He is not.
Search Engines and Information Retrieval
XP Browser and Basics1. XP Browser and Basics2 Learn about Web browser software and Web pages The Web is a collection of files that reside.
CS/CMPE 535 – Machine Learning Outline. CS Machine Learning (Wi ) - Asim LUMS2 Description A course on the fundamentals of machine.
© 2010, Robert K. Moniot Chapter 1 Introduction to Computers and the Internet 1.
Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.
The Last Lecture Agenda –1:40-2:00pm Integrating XML and Search Engines—Niagara way –2:00-2:10pm My concluding remarks (if any) –2:10-2:45pm Interactive.
INTERNET DATABASE Chapter 9. u Basics of Internet, Web, HTTP, HTML, URLs. u Advantages and disadvantages of Web as a database platform. u Approaches for.
1 5/4: Final Agenda… 3:15—3:20 Raspberry bars »In lieu of Google IPO shares.. Homework 3 returned; Questions on Final? 3:15--3:40 Demos of student projects.
OCT1 Principles From Chapter One of “Distributed Systems Concepts and Design”
Multimedia & the WWW Week 1 Introduction To….. Today’s Agenda Who I am Who I am Who you are survey & discussion Who you are survey & discussion Course.
Introduction to Web Pages. Slide 2 Lecture Overview Evolution of the Internet and Web Web Protocols.
Interpret Application Specifications
Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the.
Browser and Basics Tutorial 1. Learn about Web browser software and Web pages The Web is a collection of files that reside on computers, called.
Web Programming Language Dr. Ken Cosh Week 1 (Introduction)
* The basic components of a web site are: * Content – information displayed or accepted from users * Static – content that doesn’t change for different.
CHAPTER THE INTERNET, THE WEB, AND ELECTRONIC COMMERCE 22.
CSCI 323 – Web Development Chapter 1 - Setting the Scene We’re going to move through the first few chapters pretty quick since they are a review for most.
INTRODUCTION TO WEB DATABASE PROGRAMMING
1 Accessing the Global Database The World Wide Web.
The Personalised University Clifford Sanders Online Projects Manager Gareth McAleese Web Development Manager.
Internet Fundamentals and Background
DHTML. What is DHTML?  DHTML is the combination of several built-in browser features in fourth generation browsers that enable a web page to be more.
The WWW and HTML CMPT 281. Outline Hypertext The Internet The World-Wide-Web How the WWW works Web pages Markup HTML.
Search Engines and Information Retrieval Chapter 1.
2013Dr. Ali Rodan 1 Handout 1 Fundamentals of the Internet.
1 Web Server Administration Chapter 1 The Basics of Server and Web Server Administration.
Digital Media Dr. Jim Rowan ITEC The Internet your computer DHCP: your browser (Safari)(client) webpages and other stuff yahoo.com (server)
Chapter 8 The Internet: A Resource for All of Us.
Programming the Web Web = Computer Network + Hypertext.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
20-753: Fundamentals of Web Programming 1 Lecture 1: Introduction Fundamentals of Web Programming Lecture 1: Introduction.
Chapter 4 Networking and the Internet. © 2005 Pearson Addison-Wesley. All rights reserved 4-2 Chapter 4: Networking and the Internet 4.1 Network Fundamentals.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Microsoft Internet Explorer and the Internet Using Microsoft Explorer 5.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
MySQL and PHP Internet and WWW. Computer Basics A Single Computer.
Overviews of ITCS 6161/8161: Advanced Topics on Database Systems Dr. Jianping Fan Department of Computer Science UNC-Charlotte
The First Computer The Abacus At least 2500BC in Mesopotamia Used by merchants to calculate transactions.
Introduction to Programming the WWW I CMSC Summer 2003 Lecture 7.
Text Based Information Retrieval Text Based Information Retrieval H02C8A H02C8B Marie-Francine Moens Karl Gyllstrom Katholieke Universiteit Leuven.
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
CMSC104 Problem Solving and Computer Programming Spring 2011 Section 04 John Park.
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
World Wide Web “WWW”, "Web" or "W3". World Wide Web “WWW”, "Web" or "W3"
CS 541 Lecture Slides Sunil Prabhakar CS541 Database Systems.
Unit 9: Distributing Computing & Networking Kaplan University 1.
Department of Computer Science, Florida State University CGS 3066: Web Programming and Design Spring
Fall CSE330/CIS550: Introduction to Database Management Systems Prof. Susan Davidson Office: 278 Moore Office hours: TTh
Internet Infrastructure Min Ding Smeal College of Business Administration Pennsylvania State University.
JavaScript and Ajax (Internet Background) Week 1 Web site:
(class #2) CLICK TO CONTINUE done by T Batchelor.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
JavaScript and Ajax (Internet Background)
CNIT 131 Internet Basics & Beginning HTML
Some Common Terms The Internet is a network of computers spanning the globe. It is also called the World Wide Web. World Wide Web It is a collection of.
Data Mining: Concepts and Techniques Course Outline
Given two randomly chosen web-pages p1 and p2, what is the
Office Hours: 1-2pm T/Th 8/23
Course Outcomes After this course, you should be able to answer:
1/21/10 Viewing the Coure in terms of IR, DB, Soc Net, ML adapted to web Start of IR.
Internet Protocols IP: Internet Protocol
How to Navigate MSA-U Need help?
CGS 3066: Web Programming and Design Fall 2019
Presentation transcript:

Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the end) CSE 494/598 Information Retrieval, Mining and Integration on the Internet

6/27/ :00 PMCopyright © 2001 S. Kambhampati Contact Info Instructor: Subbarao Kambhampati (Rao) – –URL: rakaposhi.eas.asu.edu/rao.html –Course URL: rakaposhi.eas.asu.edu/cse494 rakaposhi.eas.asu.edu/cse494 –Class: T/Th 3:15-4:30 (BY 210) –Office hours: T/Th 4:30-5:30 (BY 560) TA: tbd

6/27/ :00 PMCopyright © 2001 S. Kambhampati Course Outcomes After this course, you should be able to answer: –How search engines work and why are some better than others –Can web be seen as a collection of (semi)structured databases? If so, can we adapt database technology to Web? –Can useful patterns be mined from the pages/data of the web? What did you think these were going to be??

6/27/ :00 PMCopyright © 2001 S. Kambhampati Main Topics Approximately three halves plus a bit: –Information retrieval –Information integration/Aggregation –Information mining –other topics as permitted by time

6/27/ :00 PMCopyright © 2001 S. Kambhampati Books (or lack there of) There are no required text books –Primary source is a set of readings that I will provide (see “readings” button in the homepage) Relative importance of readings is signified by their level of indentation There are some good reference books (which should be available in the bookstore) –* Modeling the Internet and the Web Baldi, Frasconi and Smyth –Modern Information Retrieval (Baeza-Yates et. Al) –Mining the web (Soumen Chakrabarti) –Data on the web (Abiteboul et al).

6/27/ :00 PMCopyright © 2001 S. Kambhampati Pre-reqs Useful course background –CSE 310 Data structures (Also 4xx course on Algorithms) –CSE 412 Databases –CSE 471 Intro to AI + some of that math you thought you would never use.. –MAT 342 Linear Algebra Matrices; Eigen values; Eigen Vectors; Singular value decomp –Useful for information retrieval and link analysis (pagerank/Authorities-hubs) –ECE 389 Probability and Statistics for Engg. Prob solving Discrete probabilities; Bayes rule… –Useful for datamining stuff (e.g. naïve bayes classifier) You are primarily responsible for refreshing your memory... Homework Ready…

6/27/ :00 PMCopyright © 2001 S. Kambhampati What this course is not (intended tobe) This course is not intended to –Teach you how to be a web master –Expose you to all the latest x-buzzwords in technology XML/XSL/XPOINTER/XPATH –(okay, may be a little). –Teach you web/javascript/java/jdbc etc. programming [] there is a difference between training and education. If computer science is a fundamental discipline, then university education in this field should emphasize enduring fundamental principles rather than transient current technology. -Peter Wegner, Three Computing Cultures

6/27/ :00 PMCopyright © 2001 S. Kambhampati Neither is this course allowed to teach you how to really make money on the web

6/27/ :00 PMCopyright © 2001 S. Kambhampati Personal Motivation My research group is schizophrenic –Plan-yochan: Planning, Scheduling, CSP, a bit of learning etc. –Db-yochan: Information integration, retrieval, mining etc. rakaposhi.eas.asu.edu/i3 Involved in ET-I 3 initiative (enabling technologies for intelligent information integration) Did a fair amount of publications, tutorials and workshop organization..

6/27/ :00 PMCopyright © 2001 S. Kambhampati Grading etc. –Projects/Homeworks (~45%) –Midterm / final (~40%) –Participation (~15%) Reading (papers, web - no single text) Class interaction (***VERY VERY IMPORTANT***) –will be evaluated by attendance, attentiveness, and occasional quizzes Subject to (minor) Changes 471 and 598 students are treated as separate clusters while awarding final letter grades (no other differentiation)

6/27/ :00 PMCopyright © 2001 S. Kambhampati Projects (tentative) One big project + may be one or two mini ones –Big One: extending and experimenting with a mini- search engine Project description available online (tentative) Expected background –Competence in JAVA programming (Gosling level is fine; Fledgling level probably not..). We will not be teaching you JAVA

6/27/ :00 PMCopyright © 2001 S. Kambhampati Occupational Hazards.. Caveat: Life on the bleeding edge –494 midway between 4xx class & 591 seminars It is a “SEMI-STRUCTURED” class. –No required text book (recommended books, papers) –Need a sense of adventure..and you are assumed to have it, considering that you signed up voluntarily Only being offered for the third time.. –Expect online and interactive debugging of the class.. –Did I mention that bit about sense of adventure I assume you have it--since you are taking a course that is not on the core :-) Silver Lining?

6/27/ :00 PMCopyright © 2001 S. Kambhampati Life with a homepage.. I will not be giving any handouts –All class related material will be accessible from the web-page Home works may be specified incrementally –(one problem at a time) –The slides used in the lecture will be available on the class page The slides will be “loosely” based on the ones I used in f02 (these are available on the homepage) –However I reserve the right to modify them until the last minute (and sometimes beyond it). When printing slides avoid printing the hidden slides

6/27/ :00 PMCopyright © 2001 S. Kambhampati Course Overview

6/27/ :00 PMCopyright © 2001 S. Kambhampati Web as a collection of information Web viewed as a large collection of__________ –Text, Structured Data, Semi-structured data – (multi-media/Updates/Transactions etc. ignored for now) So what do we want to do with it? –Search, directed browsing, aggregation, integration, pattern finding How do we do it? –Depends on your model (text/Structured/semi-structured)

6/27/ :00 PMCopyright © 2001 S. Kambhampati Structure How will search and querying on these three types of data differ? A generic web page containing text A movie review [English] [SQL] [XML] Semi-Structured An employee record

6/27/ :00 PMCopyright © 2001 S. Kambhampati Structure helps querying Expressive queries Give me all pages that have key words “Get Rich Quick” Give me the social security numbers of all the employees who have stayed with the company for more than 5 years, and whose yearly salaries are three standard deviations away from the average salary Give me all mails from people from ASU written this year, which are relevant to “get rich quick” Efficient searching –equality vs. “similarity” –range-limited search

6/27/ :00 PMCopyright © 2001 S. Kambhampati Does Web have Structured data? Isn’t web all text? –The invisible web Most web servers have back end database servers They dynamically convert (wrap) the structured data into readable english – => The capital of India is New Delhi. –So, if we can “unwrap” the text, we have structured data! »(un)wrappers, learning wrappers etc… –Note also that such dynamic pages cannot be crawled... –The (coming) Semi-structured web Most pages are at least “semi”-structured XML standard is expected to ease the presenatation/on-the-wire transfer of such pages. (BUT…..)

6/27/ :00 PMCopyright © 2001 S. Kambhampati Adapting old disciplines for Web-age Information (text) retrieval –Scale of the web –Hyper text/ Link structure –Authority/hub computations Databases –Multiple databases Heterogeneous, access limited, partially overlapping –Network (un)reliability Datamining [Machine Learning/Statistics/Databases] –Learning patterns from large scale data

6/27/ :00 PMCopyright © 2001 S. Kambhampati Information Retrieval Traditional Model –Given a set of documents A query expressed as a set of keywords –Return A ranked set of documents most relevant to the query –Evaluation: Precision: Fraction of returned documents that are relevant Recall: Fraction of relevant documents that are returned Efficiency Web-induced headaches –Scale (billions of documents) –Hypertext (inter-document connections) Consequently –Ranking that takes link structure into account Authority/Hub –Indexing and Retrieval algorithms that are ultra fast

6/27/ :00 PMCopyright © 2001 S. Kambhampati Information Integration Database Style Retrieval Traditional Model (relational) –Given: A single relational database –Schema –Instances A relational (sql) query –Return: All tuples satisfying the query Evaluation –Soundness/Completeness –efficiency Web-induced headaches Many databases all are partially complete overlapping heterogeneous schemas access limitations Network (un)reliability Consequently Newer models of DB Newer notions of completeness Newer approaches for query planning

6/27/ :00 PMCopyright © 2001 S. Kambhampati

6/27/ :00 PMCopyright © 2001 S. Kambhampati Further headaches brought on by Semi-structured retrieval If everyone puts their pages in XML –Introducing similarity based retrieval into traditional databases –Standardizing on shared ontologies...

6/27/ :00 PMCopyright © 2001 S. Kambhampati Learning Patterns (Web/DB mining) Traditional classification learning (supervised) –Given a set of structured instances of a pattern (concept) –Induce the description of the pattern Evaluation: –Accuracy of classification on the test data –(efficiency of learning) Mining headaches –Training data is not obvious –Training data is massive –Training instances are noisy and incomplete Consequently –Primary emphasis on fast classification Even at the expense of accuracy –80% of the work is “data cleaning”

6/27/ :00 PMCopyright © 2001 S. Kambhampati Now for a look at the course overview

6/27/ :00 PMCopyright © 2001 S. Kambhampati Readings for next week The chapter on Text Retrieval, available in the readings list –(alternate/optional reading) Chapter 2 of Information Retrieval (Models of text)

6/27/ :00 PMCopyright © 2001 S. Kambhampati Web as a bow-tie 39% 21% 19% 14% 7% Probability that two pages are connected: ( ) * ( ) =.348 Reference: The Web as a Graph. PODS 2000: 1-10PODS 2000 Ravi KumarRavi Kumar, Prabhakar Raghavan, Sridhar RajagopalanSridhar Rajagopalan, D. Sivakumar,D. Sivakumar Andrew TomkinsAndrew Tomkins, Eli Upfal:Eli Upfal Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the end)

6/27/ :00 PMCopyright © 2001 S. Kambhampati

6/27/ :00 PMCopyright © 2001 S. Kambhampati The Internet (Big and Getting Bigger) Moore’s Law –Semiconductor density doubling Internet Equivalent –1996: 40M people connected –1997: 100M –1998 : Traffic vol doubles in 100 days First moments after big bang –and a small crash...

6/27/ :00 PMCopyright © 2001 S. Kambhampati History 1945First electronic digital computer ENIAC 1960Ted Nelson proposes Xanadu 1961Len Kleinrock paper on packet switching 1965Gordon Moore proposes law 1965First network experiment 1966Design of ARPAnet 1968Doug Engelbart: mouse, windows, videoconf 1968ARPAnet contract to BBN 1969First ARPAnet message UCLA -> SRI

6/27/ :00 PMCopyright © 2001 S. Kambhampati History 1970ARPAnet spans country, has 5 nodes 1971ARPAnet has 15 nodes 1972First programs, FTP spec 1973Ethernet operation at Xerox PARC 1974Intel launches 8080; TCP design 1975Gates/Allen write Basic for Altair Apple Computer formed by Jobs/Wozniak hosts on ARPAnet 1978TCP split into TCP and IP 1979Visicalc

6/27/ :00 PMCopyright © 2001 S. Kambhampati History 1981Microsoft has 40 employees; IBM PC 1982Sun formed 1983ARPAnet uses TCP/IP -> birth of internet 1983Design of DNS 1984launch of Macintosh; 1000 hosts on ARPAnet 1985Symbolic.com first registered domain name ,000 hosts on Internet 1990Cisco Systems goes public $288 M Tim Berners-Lee creates WWW at CERN

6/27/ :00 PMCopyright © 2001 S. Kambhampati History 1992Bouchers amendment allows ecommerce 1993Mosaic developed at UIUC Web grows by 341,000% in a year 1994Netscape, Amazon, Archtext formed 1995 Netscape IPO, Windows Amazon IPO

6/27/ :00 PMCopyright © 2001 S. Kambhampati Connecting on the WWW Server OS Web Server Internet Client OS Web Browser

6/27/ :00 PMCopyright © 2001 S. Kambhampati Server-Side View Database-driven content Lots of Users Scalability Load balancing Often implemented with cluster of PCs 24x7 Reliability Transparent upgrades Clients Internet

6/27/ :00 PMCopyright © 2001 S. Kambhampati Network View Internet

6/27/ :00 PMCopyright © 2001 S. Kambhampati Client-Side View Web Sites Internet Content rendering engine Tags, positioning, movement Scripting language interpreter Document object model Events Programming language itself Link to custom Java VM Security access mechanisms Plugin architecture + plugins

6/27/ :00 PMCopyright © 2001 S. Kambhampati Client-Side… Impact Many different browsers –{Netscape, IE, Lynx, …}  Version  OS Each supports different tags, DOM, languages… Strategies: –Page branching –Internal branching (javascript control in each page) –Designing for the common denominator Custom APIs with javascript libraries

6/27/ :00 PMCopyright © 2001 S. Kambhampati Input Student Info Name Background –444? –451? –461? –Other 4xx? Languages –Javascript? –Java? –Others: