Presentation is loading. Please wait.

Presentation is loading. Please wait.

Managing Distributed Collections: Evaluating Web Page Change, Movement, and Replacement Richard Furuta and Frank Shipman Center for the Study of Digital.

Similar presentations


Presentation on theme: "Managing Distributed Collections: Evaluating Web Page Change, Movement, and Replacement Richard Furuta and Frank Shipman Center for the Study of Digital."— Presentation transcript:

1 Managing Distributed Collections: Evaluating Web Page Change, Movement, and Replacement Richard Furuta and Frank Shipman Center for the Study of Digital Libraries and the Department of Computer Science Texas A&M University

2 Distributed Collections The Web is continuously changing –.gov and.edu pages change less frequently than.com pages (1999) Collection managers cannot control changes –Bookmark lists –Yahoo! directories –Web portals (NSDL) –Walden’s Paths

3 Changes to Items in Collections Items in collections –Play specific roles –Are semantically related To each other To the collection Change to an item may –Change its relationship to the collection Less coherent with other items (default assumption) More or no change in relationship –Affect the role it plays in the collection Less suitable (default assumption) More suitable or no effect on the role

4 Research Contributions Develop techniques to help collection managers cope with changes –Change, migration, disappearance Path Manager – A tool that helps collection managers cope with changes –Quantity of change –Nature of change –Relevance to the collection Dealing with missing pages –Find exact matches –Suggest similar pages

5 Management of Distributed Collections Detection of change is easy Determination of –Quantity of change is relatively easy –Relevance of change is less easy –Meaning of change is difficult Approaches –Human validation (Yahoo! surfers) –Automatic detection of change (Path Manager)

6 Path Manager – The tool Collection-level overviewPage-level overviewPage details Types of change –Content changes (what) –Presentation changes (how) –Structural changes (linking) –Behavioral changes (scripting – not addressed)

7 Collection-level Overview

8 Page-level Overview Little Change Server unreachable 404 error No change Drastic change

9 Page Details Page Information Modification details

10 Context-based Change Detection Context consists of –Content from other pages in the path –Annotations created by the author –Additional metadata provided by the author Distinguishes between edited and replaced pages

11 Evaluation 20 paths, pages selected from Yahoo! Directories Each path consisted of 10 to 12 pages Pages were randomly selected –no flash presentations or images A page in each path was randomly selected for replacement Each selected page was replaced by 3 pages –CNN Financials (large change) –Elephants (large change) –A page from the same Yahoo! Directory (small change)

12 Results – Distribution of Context-based changes More than -4-4 to 2More than 2 Replacement by a member of the Yahoo! Directory 1 (5%)10 (50%)9 (45%) Replacement by non- member 25 (62.5%)15 (37.5%)0 (0%) Replacements resulting in moving towards and away from the context vector Experimental thresholds Distinction between similar and different pages Managers can now focus on divergent pages

13 For more information on Walden’s Paths http://www.csdl.tamu.edu/walden/ walden@csdl.tamu.edu Principal Investigators: Richard Furuta (furuta@csdl.tamu.edu) Frank Shipman (shipman@csdl.tamu.edu)

14 Approach Context Generation Phase –Create weighted page term vectors W = log(tf) + constant scaling factor Known nouns are allocated higher weights –Create weighted context term vectors Exclude the page for which context vector is being generated Change Detection Phase –Calculate page term vector for changed page –Calculate the angle between new page term vector and context term vector –Difference between initial and current angle is a measure of the change

15 Content-based Metrics Replaced withPage about elephants CNN Financials page Page in same Yahoo! directory Average78.181.975.1 Range30.8 to 88.177.0 to 87.735.1 to 84.5 Standard deviation 15.652.8910.76 Angle between original and replacing pages (in degrees) High angle of change for all cases

16 Context-based Metrics Replaced withPage about elephants CNN Financials page Page in same Yahoo! directory Average-7.8-9.11.9 Range-23.2 to 1.6-45.0 to 0.9-15.2 to 14.3 Standard deviation 6.9510.576.80 Difference in angle to Yahoo directory between original and replacing pages (in degrees) Results agree with the intuitive expectation

17 Dealing with Missing Pages Pages may disappear due to a variety of reasons –Reconfiguration of Web sites –Change or expiration of domain names Temporary or permanent Threaten integrity of paths –Continuity of narrative structure –Completeness of collection Strategies –Find new locations for pages that have moved (exact replacements) –Find acceptable replacements for pages that have vanished (similar pages)

18 Approach – Information Extraction Phase Extract keyphrases from page –Extends the “Robust Hyperlinks” approach –Tag text in pages with part-of-speech tagger –Extract 1, 2 or 3 word keyphrases –n, n-n, a-n, n-n-n, a-n-n –Use HTML formatting for additional guidance Only or tags may separate terms in a phrase Store separate lists of keyphrases to use for locating exact replacements and similar pages

19 Approach – Locating Exact Replacements Keyphrases help discriminate this page from others on the Web TF-IDF-based measure Spelling mistakes and unusually uncommon words are most valuable Order keyphrase list by decreasing value of TF-IDF measure While locating pages –Begin with a (user-specified) keyphrase set –Search for pages that match these terms –Add a term and retry until the result set is as desired

20 Approach – Finding Similar Pages Rare phrases hinder search for similar pages Weed out phrases that have occurred less frequently than a certain threshold value Remaining phrases are then ordered by decreasing value of their TF-IDF measure While locating pages –Start with the most restrictive set of phrases –Reduce one phrase at a time until the desired result size is achieved Similarity is contextual Varies from person to person


Download ppt "Managing Distributed Collections: Evaluating Web Page Change, Movement, and Replacement Richard Furuta and Frank Shipman Center for the Study of Digital."

Similar presentations


Ads by Google