Introduction to Web Science Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-

Slides:



Advertisements
Similar presentations
The Structure of the Web Mark Levene (Follow the links to learn more!)
Advertisements

Scale Free Networks.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Analysis and Modeling of Social Networks Foudalis Ilias.
Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Web Graph Characteristics Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)
CS 345A Data Mining Lecture 1
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
CS 728 Lecture 4 It’s a Small World on the Web. Small World Networks It is a ‘small world’ after all –Billions of people on Earth, yet every pair separated.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
Search Engines and their Public Interfaces: Which APIs are the Most Synchronized? Frank McCown and Michael L. Nelson Department of Computer Science, Old.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Peer-to-Peer and Social Networks Random Graphs. Random graphs E RDÖS -R ENYI MODEL One of several models … Presents a theory of how social webs are formed.
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Introductions Search Engine Development COMP 475 Spring 2009 Dr. Frank McCown.
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
Web Characterization: What Does the Web Look Like?
Lecturer: Ghadah Aldehim
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Web Archiving Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Data Structures & Algorithms and The Internet: A different way of thinking.
Crawling Slides adapted from
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Week 3 LBSC 690 Information Technology Web Characterization Web Design.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Web Searching. How does a search engine work? It does NOT search the Web (when you make a query) It contains a database with info on numerous Web sites.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Optimal Link Bombs are Uncoordinated Sibel Adali Tina Liu Malik Magdon-Ismail Rensselaer Polytechnic Institute.
Mathematics of Networks (Cont)
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.
The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval (9) Prof. Dragomir R. Radev
Brass: A Queueing Manager for Warrick Frank McCown, Amine Benjelloun, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk,
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Netlogo demo. Complexity and Networks Melanie Mitchell Portland State University and Santa Fe Institute.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
CS 115: COMPUTING FOR THE SOCIO-TECHNO WEB FINDING INFORMATION WITH SEARCH ENGINES.
Data mining in web applications
22C:145 Artificial Intelligence
Dr. Frank McCown Comp 250 – Web Development Harding University
Introduction to Web Mining
CS246 Web Characteristics.
Network Science: A Short Introduction i3 Workshop
Data Mining Chapter 6 Search Engines
CS246: Web Characteristics
CS 345A Data Mining Lecture 1
CS 345A Data Mining Lecture 1
Introduction to Web Mining
CS 345A Data Mining Lecture 1
Presentation transcript:

Introduction to Web Science Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 3.0 Unported LicenseCreative Commons Attribution-NonCommercial- ShareAlike 3.0 Unported License

What is Web Science? Web Science is the interdisciplinary study of the Web as an entity and phenomenon. It includes studies of the Web’s properties, protocols, algorithms, and societal effects.

Background Web Science initiative launched in Nov 2006 by University of Southampton and MIT Nigel ShadboltTim Berners-LeeWendy HallJames HendlerDaniel Weitzer Images from

Degrees in Web Science Undergrad BS in Information Technology and Web Science at Rensselaer Polytechnic Institute MS degree in Web Science at – University of Southampton – University of San Francisco – Aristotle University of Thessaloniki – RPI (Web Science emphasis) Not many, but remember, it’s a new science!

Communications of the ACM July 2008

“Given the breadth of the Web and its inherently multi-user (social) nature, its science is necessarily interdisciplinary, involving at least mathematics, CS, artificial intelligence, sociology, psychology, biology, and economics. We invite computer scientists to expand the discipline by addressing the challenges following from the widespread adoption of the Web and its profound influence on social structures, political systems, commercial organizations, and educational institutions.”

Web Science is Interdisciplinary O'Hara and Hall, Web Science, ALT Online Newsletter, May 6, 2008

Some questions of study: How is the Web structured? What is its size? How can unstructured data mined from the Web be combined in meaningful ways? How does information/misinformation spread on the Web? How can we discover its origin? How can the Web used to effectively harness the collective intelligence of its users? How can trust be measured on the Web? How can privacy be maintained on the Web? What do events gathered from online social networks tell us about the human condition? Has the Web changed how humans think? Why is this important? Huge implications for web search!

How is the Web structured? Graph Theory: Pages are nodes & links are directed edges link

Web Graph Num of In-links Total Web pages Normal/Gaussian Distribution Random Graph Typical Web Graph Power-law Distribution Num of In-links Total Web pages

Small World Network Six degrees of separation Most pages are not neighbors but most pages can be reached from others by a small number of hops Many hubs- pages with many inlinks Robust for random node deletions Other examples: road maps, networks of brain neurons, voter networks, and social networks

Bow-Tie Structure of the Web Broder et. al (Graph Structure of the Web, 2000) Examined a large web graph (200M pages, 1.5B links) 17 Million nodes

Bow-Tie Structure 75% of pages do not have a direct path from one page to another Ave distance is 16 clicks when path exists and 7 clicks when undirected path exists Diameter of SCC is at least 28 (max shortest distance between any two nodes) Diameter of entire Web is at least 500 (most distant node in IN to OUT) Broder et al., Graph Structure of the Web, 2000

Web Structure’s Implications If we want to discover every web page on the Web, it’s impossible since there are many pages that aren’t linked to Finding popular pages is easy, but finding pages with few in-links (the long tail) is more difficult How do we know when new pages are added to the Web or removed? Incoming links could tell us something about the “importance” of a page when searching the Web for information (e.g., PageRank) Link structure of the Web can be artificially manipulated

How large is the Web? 1 trillion unique URLs

How did Google discover all these URLs? By crawling the web

Web Crawler Web crawlers are used to fetch a page, place all the page’s links in a queue, and continue the process for each URL in the queue Figure: McCown, Lazy Preservation: Reconstructing Websites from the Web Infrastructure, Dissertation, 2007

Problems with Web Crawling Slow because crawlers limit how frequently they make requests to the same server (politeness policy) Many pages are disconnected from the SCC, password- protected, or protected by robots.txt There are an infinite number of pages (e.g., calendar) so crawlers limit how deeply they crawl Web pages are continually being added and removed Deep web: Many pages are only accessible behind a web form (e.g., US patent database). Deep web is magnitudes larger than surface web, and 2006 study 1 shows only 1/3 is indexed by big three search engines 1 He et al., Accessing the deep web, CACM 2007

What Counts? Many duplicate pages (30% of web pages are duplicates or near-duplicates 1 ) – How do we efficiently compare across a large corpus? Some pages change every time they are requested – How can we automatically determine what is an insignificant difference? Many spammy pages (14% of web pages 2 ) – How can we detect these? 1 Fetterly et al., On the evolution of clusters of near-duplicate web pages, J of Web Eng, Ntoulas et al., Detecting spam web pages through content analysis, WWW 2006

Some Observations Crawling a significant amount of the Web is hard Different search engines have different pages indexed, but they don’t share these differences with each other (company secret) So if we wanted to estimate the Web’s size but don’t want to try to crawl the Web ourselves, could we use the search engines themselves to estimate the Web’s size?

Capture-Recapture Method Statistical method used to estimate population size (originally fish and wildlife populations) Example: How many fish are in the lake? – Catch S 1 fish from the lake, tag them, and return them to the lake – Then catch and put back S 2 fish, noting which are tagged (S 1,2 ) – S 1 /N = S 1,2 /S 2 so population N = S 1 × S 2 /S 1,2 N S1S1 S2S2 S 1,2

Estimate Web Population Lawrence and Giles 1 used capture-recapture method to estimate web page population – Submitted 575 queries to sets of 2 search engines – S1 = All pages returned by SE1 – S2 = All pages returned by SE2 – S1,2 = All pages returned by both SE1 and SE2 – Size of indexable Web (N) = S 1 × S 2 /S 1,2 Estimated size of indexable Web in 1998 = 320 million pages Recent measurements using similar methods find lower bound of 21 billion pages 2 1 Lawrence & Giles, Searching the World Wide Web, Science,

This is just a sample of Web Science that we will be examining from a computing perspective.

More Resources Video: Nigel Shadbolt on Web Science (2008) Slides: “What is Web Science?” by Carr, Pope, Hall, Shadbolt (2008) web-science web-science