Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm 2002.09.04 Dongwon Lee Database Systems Lab.

Similar presentations


Presentation on theme: "A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm 2002.09.04 Dongwon Lee Database Systems Lab."— Presentation transcript:

1 A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm 2002.09.04 Dongwon Lee Database Systems Lab.

2 2/20 Contents Introduction Crawler Problem of Web Crawler URL Ordering Focused Crawler Strategy for URL Ordering Ordering Model WIE* Conclusion

3 3/20 Introduction(1) Search Engine Gathering web resource Shortage of Search Engine –Large Answer Set –Low Precision –Destroying the Hypertext Structures of Matching Hyper documents –General Concept Queries Topic driven web-crawling

4 4/20 Introduction(2) It can reduce search space to improve the efficiency of web search engine. It can be applied to special purpose search engine. –Ex) Medical information retrieval, Travel information retrieval, Biology information retrieval

5 5/20 Web Crawler(1) Program that automatically traverse the Web via hyperlinks embedded in hypertext, news group listings, directory structures or database schemas. Gather resources from the Web To ensure an index is kept as up to date as possible To achieve the broadest possible coverage of the Web.

6 6/20 Web Crawler(2) Retrieving Module Processing Module Formatting Module URL Listing Module The order of traversing –Breadth-first –Depth-first –Better pages first How frequently the index is updated –Stars in the sky view Word Wide Web Database Retrieving Module Processing Module Formatting Module URL Listing Module

7 7/20 Problem of Web Crawler How frequently the index is updated –How old is an index to a Web page Varies a lot: One day to two months Stars in the sky view –Percentage of invalid links: 2-9% Not be able to visit every possible page –Must periodically revisit pages –Storage capacity –Network bandwidth Shortage of Information Retrieval Systems –Too large answer set –Low precision –Etc.

8 8/20 URL Ordering BFS, DFS URL Ordering –Document similarity measure –PageRank Inlinks/Outlinks –URL measure “.com”, “home”, “www” J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through url ordering. In Proc. 7th Intl. World Wide Web Conference, Brisbane, Australia, 1998

9 9/20 Focused Crawler Text Categorization Method –Naïve Bayes Algorithm Distiller –Authority –Hub Soumen Chakrabarti, Martin van den Berg, Byron Dom, Focused crawling: a new approach to topic-specific Web resource discovery, 8th World Wide Web Conference

10 10/20 Strategy for URL Ordering Better page if it has many outgoing links Better page if its contents is more relevant to a certain topics Important rate of pages that passed influence on important rate of next page A Web Crawler using Hyperlink structure and Hypertext Categorization Method,The 17 th KIPS Spring conference

11 11/20 Ordering Model(1)

12 12/20 Ordering Model(2)

13 13/20 Image Crawler Filtering Useless Image –Bullet, Background image, … Find Semantic for indexing –Use document which contains images Generate metadata automatically –Mpeg-7

14 14/20 WIE* Automatic web page traversal and content extraction HPS (hyperlink prioritization search) Mechanism –Identify a URL –Travels Web from a provided url –Extract and collect information pieces by paragraph

15 15/20 WIE*- HPS(1) Notation –D={d|d is a web page} –P={p|p is a paragraph in d} –W={w|w is a word in p} –l(d,p,w) is a hyperlink with a (d,p,w) status –L(D,P,W) is a hyperlink category to which l(d,p,w)

16 16/20 WIE*- HPS(2) Notation –D 1 ={d|d is an element of D and contains search keyword(s)} –D 0 ={d|d is an element of D and not the element of D 1} –P 1 ={d|d is an element of P and contains search keyword(s)} –P 0 ={d|d is an element of P and not the element of P 1 } –W 1 ={d|d is an element of W and contains search keyword(s)} –W 0 ={d|d is an element of W and not the element of W 1 }

17 17/20 WIE*-HPS(3) Link category LC 7 > LC 6 > LC 4 > LC 0 Link categoryPageParagraphLink LC 0 D0D0 P0P0 W0W0 LC 1 D0D0 P0P0 W1W1 LC 2 D0D0 P1P1 W0W0 LC 3 D0D0 P1P1 W1W1 LC 4 D1D1 P0P0 W0W0 LC 5 D1D1 P0P0 W1W1 LC 6 D1D1 P1P1 W0W0 LC 7 D1D1 P1P1 W1W1

18 18/20 WIE*-HPS(4) Trend –Global trend T G = –Recent trend T R =

19 19/20 WIE*-HPS(5) Termination Scheme –Comparing Global trend and Recent trend –No more hyperlink –By the user’s decision Information Extraction –Extract paragraphs containing keyword

20 20/20 Further work & Discussion Intelligent WIE* –Learned WIE* Fast learning mechanism Reinforcement learning Hypertext categorization –Advanced Information Extraction technique –Personalization technique Using Link Information –Hub and Authority Client Side Light Application(?) –Ex) Plug in Web browser

21 21/20 Information Retrieval System


Download ppt "A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm 2002.09.04 Dongwon Lee Database Systems Lab."

Similar presentations


Ads by Google