Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015.

Slides:



Advertisements
Similar presentations
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Transaction Management Overview Chapter 16.
Advertisements

What is Concurrent Process (CP)? Multiple users access databases and use computer systems Multiple users access databases and use computer systems simultaneously.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Transaction Management Overview Chapter 16.
TRANSACTION PROCESSING SYSTEM ROHIT KHOKHER. TRANSACTION RECOVERY TRANSACTION RECOVERY TRANSACTION STATES SERIALIZABILITY CONFLICT SERIALIZABILITY VIEW.
Fakultas Ilmu Komputer UI 1 Exercise A series of actions to be taken on the database such that either all actions are completed successfully, or none of.
1 Supplemental Notes: Practical Aspects of Transactions THIS MATERIAL IS OPTIONAL.
COS 461 Fall 1997 Transaction Processing u normal systems lose their state when they crash u many applications need better behavior u today’s topic: how.
Transaction Management Overview. Transactions Concurrent execution of user programs is essential for good DBMS performance. –Because disk accesses are.
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
Concurrency Control and Recovery In real life: users access the database concurrently, and systems crash. Concurrent access to the database also improves.
Consistency in distributed systems Distributed systems Lecture # 10 Distributed systems Lecture # 10.
Transactions and Wrap-Up Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December 8, 2005 Some slide content derived.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
1 Transaction Management Overview Yanlei Diao UMass Amherst March 15, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Chapter 8 : Transaction Management. u Function and importance of transactions. u Properties of transactions. u Concurrency Control – Meaning of serializability.
TRANSACTION PROCESSING TECHNIQUES BY SON NGUYEN VIJAY RAO.
Transaction. A transaction is an event which occurs on the database. Generally a transaction reads a value from the database or writes a value to the.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Transaction Management Overview Chapter 16.
DBSQL 7-1 Copyright © Genetic Computer School 2009 Chapter 7 Transaction Management, Database Security and Recovery.
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
1 Transaction Management Overview Chapter Transactions  Concurrent execution of user programs is essential for good DBMS performance.  Because.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Transaction Management Overview Chapter 16.
1 Transaction Management Overview Chapter Transactions  A transaction is the DBMS’s abstract view of a user program: a sequence of reads and writes.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Transaction Management Overview Chapter 18.
Crawling Slides adapted from
Chapterb19 Transaction Management Transaction: An action, or series of actions, carried out by a single user or application program, which reads or updates.
1cs Intersection of Concurrent Accesses A fundamental property of Web sites: Concurrent accesses by multiple users Concurrent accesses intersect.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Transaction Management Overview Lecture 21 Ramakrishnan - Chapter 18.
Transactions and Their Distribution Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 20, 2015.
Transaction processing Book, chapter 6.6. Problem: With a single user…. you run a query, you get the results, you run the next, etc. But database life.
Database Systems/COMP4910/Spring05/Melikyan1 Transaction Management Overview Unit 2 Chapter 16.
1 Transaction Management Overview Chapter Transactions  Concurrent execution of user programs is essential for good DBMS performance.  Because.
Distributed Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems April 15, 2008.
Transactions and Concurrency Control. Concurrent Accesses to an Object Multiple threads Atomic operations Thread communication Fairness.
Chapter 15: Transactions Loc Hoang CS 157B. Definition n A transaction is a discrete unit of work that must be completely processed or not processed at.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Transactions. Transaction: Informal Definition A transaction is a piece of code that accesses a shared database such that each transaction accesses shared.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Transaction Management Overview. Transactions Concurrent execution of user programs is essential for good DBMS performance. – Because disk accesses are.
Transaction Management Transparencies. ©Pearson Education 2009 Chapter 14 - Objectives Function and importance of transactions. Properties of transactions.
Google, Web Crawling, and Distributed Synchronization Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems April 1, 2008.
Synchronization & Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems April 3, 2008 Some slide content by.
Transaction Management and Recovery, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Transaction Management Overview Chapter 18.
NOEA/IT - FEN: Databases/Transactions1 Transactions ACID Concurrency Control.
© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Crawling and Publish/Subscribe February 22, 2016.
Advanced Database CS-426 Week 6 – Transaction. Transactions and Recovery Transactions A transaction is an action, or a series of actions, carried out.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Distributed transactions April 11, 2016.
Allan Heydon and Mark Najork --Sumeet Takalkar. Inspiration of Mercator What is a Mercator Crawling Algorithm and its Functional Components Architecture.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
MULTIUSER DATABASES : Concurrency and Transaction Management.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Transaction Management Overview Chapter 16.
Database System Laboratory Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p ,
Transaction Management Overview
Transaction Management Overview
CS122B: Projects in Databases and Web Applications Winter 2018
Transaction Management Overview
Transaction Management Overview
Transaction management
Transaction Management
Transaction Management Overview
Anwar Alhenshiri.
CS122B: Projects in Databases and Web Applications Winter 2019
CS122B: Projects in Databases and Web Applications Spring 2018
Transaction Management Overview
Transaction Communication
Presentation transcript:

Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015

2 Administrivia  ed list of project partners due Friday  … For those without 4-person groups, I will try to assign / merge groups over the weekend  … This might result in breaking up some 3 person groups

3 Mercator: Scalable Web Crawler  Expands a “URL frontier”  Avoids re-crawling same URLs  Also considers whether a document has been seen before  Every document has signature/checksum info computed as it’s crawled

4 Mercator Web Crawler 1.Dequeue frontier URL 2.Fetch document 3.Record into RewindStream (RIS) 4.Check against fingerprints to verify it’s new 5. Extract hyperlinks 6. Filter unwanted links 7. Check if URL repeated (compare its hash) 8. Enqueue URL

5 Mercator’s Polite Frontier Queues  Tries to go beyond breadth-first approach – want to have only one crawler thread per server  Distributed URL frontier queue:  One subqueue per worker thread  The worker thread is determined by hashing the hostname of the URL  Thus, only one outstanding request per web server

6 Mercator’s HTTP Fetcher  First, needs to ensure robots.txt is followed  Caches the contents of robots.txt for various web sites as it crawls them  Designed to be extensible to other protocols  Had to write own HTTP requestor in Java – their Java version didn’t have timeouts  Today, can use setSoTimeout()  Can also use Java non-blocking I/O if you wish:   But they use multiple threads and synchronous I/O

7 Other Caveats  Infinitely long URL names (good way to get a buffer overflow!)  Aliased host names  Alternative paths to the same host  Can catch most of these with signatures of document data (e.g., MD5)  Crawler traps (e.g., CGI scripts that link to themselves using a different name)  May need to have a way for human to override certain URL paths – see Section 5 of paper

Mercator Document Statistics PAGE TYPE PERCENT text/html69.2% image/gif 17.9% image/jpeg8.1% text/plain 1.5% pdf 0.9% audio0.4% zip 0.4% postscript0.3% other1.4% Histogram of document sizes (60M pages)

9 Further Considerations  May want to prioritize certain pages as being most worth crawling  Focused crawling tries to prioritize based on relevance  May need to refresh certain pages more often

10 Web Search Summarized  Two important factors:  Indexing and ranking scheme that allows most relevant documents to be prioritized highest  Crawler that manages to be (1) well-mannered, (2) avoid traps, (3) scale  We’ll be using Pastry to distribute the work of crawling and to distribute the data (what Google calls “barrels”)

11 We Need More Than Synchronization What needs to happen when you…  Click on “purchase” on Amazon? Suppose you purchased by credit card?  Use online bill-paying services from your bank?  Place a bid in an eBay-like auction system?  Order music from iTunes? What if your connection drops in the middle of downloading? Is this more than a case of making a simple Web Service (-like) call?

12 Transactions Are a Means of Handling Failures  There are many (especially, financial) applications where we want to create atomic operations that either commit or roll back  This is one of the most basic services provided by database management systems, but we want to do it in a broader sense  Part of “ACID” semantics…

13 ACID Semantics  Atomicity: operations are atomic, either committing or aborting as a single entity  Consistency: the state of the data is internally consistent  Isolation: all operations act as if they were run by themselves  Durability: all writes stay persistent!

14 A Problem Confronted by eBay  eBay wants to sell an item to:  The highest bidder, once the auction is over, or  The person who’s first to click “Buy It Now!”  But:  What if the bidder doesn’t have the cash?  A solution:  Record the item as sold  Validate the PayPal or credit card info with a 3 rd party  If not valid, discard this bidder and resume in prior state

15 “No Payment” Isn’t the Only Source of Failure  Suppose we start to transfer the money, but a server goes down… Purchase: sb = Seller.bal bb = Buyer.bal Write Buyer.bal= bb - $100 Write Item.sellTo = Buyer Write Seller.bal= sb + $100 CRASH!

16 Providing Atomicity and Consistency  Database systems provide transactions with the ability to abort a transaction upon some failure condition  Based on transaction logging – record all operations and undo them as necessary  Database systems also use the log to perform recovery from crashes  Undo all of the steps in a partially-complete transaction  Then redo them in their entirety  This is part of a protocol called ARIES

17 The Need for Isolation  Suppose eBay seller S has a bank account that we’re depositing money into, as people buy:  What if two purchases occur simultaneously, from two different servers on different continents? S = Accounts.Get(1234) Write S.bal = S.bal + $50

18 Concurrent Deposits  This update code is represented as a sequence of read and write operations on “data items” (which for now should be thought of as individual accounts): where S is the data item representing the seller’s account # 1234 Deposit 1 Deposit 2 Read(S.bal) S.bal := S.bal + $50 S.bal:= S.bal + €10 Write(S.bal)

19 A “Bad” Concurrent Execution Only one action (e.g. a read or a write) can actually happen at a time for a given database, and we can interleave deposit operations in many ways: Deposit 1 Deposit 2 Read(S.bal) S.bal := S.bal + $50 S.bal:= S.bal + €10 Write(S.bal) time BAD!

20 A “Good” Execution  Previous execution would have been fine if the accounts were different (i.e. one were S and one were T), i.e., transactions were independent  The following execution is a serial execution, and executes one transaction after the other: Deposit 1 Deposit 2 Read(S.bal) S.bal := S.bal + $50 write(S.bal) Read(S.bal) S.bal:= S.bal + $10 Write(S.bal) time GOOD!

21 Good Executions  An execution is “good” if it is serial (transactions are executed atomically and consecutively) or serializable (i.e. equivalent to some serial execution)  Equivalent to executing Deposit 1 then 3, or vice versa  Why would we want to do this instead? Deposit 1 Deposit 3 read(S.bal) read(T.bal) S.bal := S.bal + $50 T.bal:= T.bal + €10 write(S.bal) write(T.bal)

22 Concurrency Control  A means of ensuring that transactions are serializable  There are many methods, of which we’ll see one  Lock-based concurrency control (2-phase locking)  Optimistic concurrency control (no locks – based on timestamps)  Multiversion CC  …