Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Page Cleaning for Web Mining

Similar presentations


Presentation on theme: "Web Page Cleaning for Web Mining"— Presentation transcript:

1 Web Page Cleaning for Web Mining
Bing Liu Department of Computer Science University of Illinois at Chicago Joint work with my student at NUS: Lan Yi Based on two papers: IJCAI-2003, KDD-2003 2/5/2019 Web Page Cleaning

2 Outline Introduction Technique 1: cleaning through feature weighting
Categories of Web Noise Technique 1: cleaning through feature weighting DOM tree & Presentation Style Compressed Structure Tree Weighting Policy Technique 2: detect and eliminate noisy blocks Experimental results Conclusions 2/5/2019 Web Page Cleaning

3 Web Page Cleaning Identify and eliminate irrelevant or noisy information Improve Web mining, Web information retrieval, Web search Data collection Online HTML/XML Web pages Offline HTML/XML Web pages Web search results 2/5/2019 Web Page Cleaning

4 Categories of Web noise
Global noise Redundant objects Larger than individual page E.g., mirror sites, duplicated Web pages Local (intra-page) noise Irrelevant items within a Web page. E.g., banner ads, navigational guides We focus on dealing with local noise 2/5/2019 Web Page Cleaning

5 2/5/2019 Web Page Cleaning

6 2/5/2019 Web Page Cleaning

7 Web Page Noise Categorization
Fixed Description Noise: site logos, decoration images and texts, copyright notices, privacy statements, etc Service Noise: irrelevant services such as the weather, stock/market index, etc. Navigational Guidance: Directory guidance (global, hierarchic, etc) Advertisements 2/5/2019 Web Page Cleaning

8 Some Related Work Informative Content Blocks Discovery in Web Pages [Lin and Ho, 2002] Template Detection via Data [Bar-Yossef and Rajagopalan, 2002] Advertisement Detection [Kushmerick, 1999] Informative Structure Detection in Web Site [Kao et al., 2002 ] 2/5/2019 Web Page Cleaning

9 Cleaning via feature weighting
Objective: Cleaning pages for Web page classification and clustering Intuitive Ideas DOM trees Presentation Style Compressed Structure Tree (CST) Weighting Policy Analysis 2/5/2019 Web Page Cleaning

10 Intuitive Ideas Semi-structures within Web pages In a given Web site
Logical segmentation of a Web page Presentation styles of blocks Location of items In a given Web site Noisy blocks — Share common contents or presentation styles Meaningful (or main) blocks — diverse in contents and presentation style Weighting features makes cleaning automatic (nothing is eliminated) 2/5/2019 Web Page Cleaning

11 DOM trees <BODY bgcolor=WHITE>
<TABLE width=800 height=200 > </TABLE> <IMG src="image.gif" width=800> <TABLE bgcolor=RED> </BODY> bc=red bc=white IMG TABLE BODY root width=800 height=200 width=800 2/5/2019 Web Page Cleaning

12 Presentation Style root bc=white BODY width=800 height=200 width=800
<BODY, {bc=white}> bc=red bc=white IMG TABLE BODY root width=800 height=200 width=800 <(TABLE,{width=800}), (IMG,{}), (TABLE, {bc=red})> 2/5/2019 Web Page Cleaning

13 Compressed Structure Tree
TABLE bc=white BODY root d1 SPAN width=800 TABLE width=800 bc=red bc=white BODY root d2 bc=red {<BODY, {bc=white}>} CST: root 2 {<(TABLE,{width=800}), (SPAN,{}), (TABLE, {bc=red})>, <(TABLE,{width=800}), (TABLE, {bc=red})>} bc=white BODY 2 2 Width=800 1 bc=red TABLE SPAN TABLE 2/5/2019 Web Page Cleaning

14 Element Node E = (Tag, Attr, TAGs, STYLEs, CHILDs)
Tag — tag name. E.g., TABLE, IMG; Attr — display attributes of Tag. TAGs — actual tag nodes STYLEs — presentation styles CHILDs — pointers to child element nodes 2/5/2019 Web Page Cleaning

15 Weighting policy Inner Node Importance (1)
l = |E.STYLEs|, m = |E.TAGs| pi — percentage of tag nodes (in E.TAGs) using the i-th presentation style Inner NodeImp(E) — diversity of presentation styles 2/5/2019 Web Page Cleaning

16 Example (from Page 12) NodeImp(root) = -1log21 = 0 NodeImp(BODY)
{<BODY, {bc=white}>} CST: root 2 {<(TABLE,{width=800}), (SPAN,{}), (TABLE, {bc=red})>, (TABLE,{width=800}), (TABLE, {bc=red})>} bc=white BODY 2 2 Width=800 1 bc=red TABLE SPAN TABLE NodeImp(root) = -1log21 = 0 NodeImp(BODY) = -(0.5log log20.5) = 1 2/5/2019 Web Page Cleaning

17 Weighting policy (2) Leaf Node Importance N — number of features in E
ai — a feature of content in E (1-HE(ai)) — information contained in ai Leaf NodeImp(E) —content diversity of E 2/5/2019 Web Page Cleaning

18 Weighting policy Information Entropy of Features (3) m = |E.TAGs|
pij — probability of ai appears in Tj  E.TAGs HE(ai) — information entropy of ai the higher HE(ai), the less important ai 2/5/2019 Web Page Cleaning

19 Example m = |E.TAGs| = 3 N = |{PCMag, samsung, epson, canon}| = 4
root Example CST: Ep IMG TABLE 3 E t1: PCMag, samsung t2: PCMag, epson t3: PCMag, canon m = |E.TAGs| = 3 N = |{PCMag, samsung, epson, canon}| = 4 HE(PCMag) = -3 * (1/3log31/3) = 1 HE(samsung)=HE(epson)=HE(canon) = -(0+0+1log31) = 0 NodeImp(E) = ((1-1) + 3*(1-0))/4 = 0.75 2/5/2019 Web Page Cleaning

20 Weighting policy Node Importance not enough Only focus on local info
Cumulative importance Consider location of features Use Path Importance to Cumulate importance from root to E E1 is ancestor of E2  1PathImp(E2) PathImp(E1) 0. 2/5/2019 Web Page Cleaning

21 Weighting policy Path Importance (4) Weight of Features (5)
E — Leaf Element node containing ai Tj  E.TAGs fij — frequency of ai under tag Tj 2/5/2019 Web Page Cleaning

22 Summary of the technique
DOM trees  CST Presentation styles Actual content Information Based Evaluation Node Importance Path Importance Automatic weighting 2/5/2019 Web Page Cleaning

23 Web page cleaning via block elimination
Weighting features is only one approach We can also identify & eliminate noise content blocks in a page. We can use a similar technique called SST (site style tree): Identify a number of styles (templates) from a site. Computing an importance value for each block, using a specified threshold t to decide noisy or not noisy Matching to noisy blocks and not noisy blocks in the tree, given a new page. 2/5/2019 Web Page Cleaning

24 Noise Detection and Elimination
root Body Table Img Table Table Tr Tr Text P Text A P P P A P Img A Img A A A A A 2/5/2019 Web Page Cleaning

25 After simplification root Body Table Img Table Table Tr Tr Text
2/5/2019 Web Page Cleaning

26 Experiments Data Measures
Pages: Camera, Laptop, mobile, Printer, TV Sites: Amazon, CNet, J&R, PCMag, ZDnet Measures pricison recall F = 2p*r / (p+r) Evaluate using Web page classification, and clustering For classification, we use SVM For clustering, we use k-means. 2/5/2019 Web Page Cleaning

27 Different Experiment settings
Configuration 1 Training set from the same Web site Test set from other sites Configuration 2 Training sets from different sites, e.g., A and B. Test set does not include any pages from A or B Configuration 3 Training sets from different sites, e.g., A and B Test set also includes some pages from A and B 2/5/2019 Web Page Cleaning

28 Configuration 1 (training sets from the same site)
F score Noise Template SST Weight camera--mobile 0.9829 0.9681 0.9568 0.9839 camera--notebook 0.9939 0.9364 0.9936 0.9872 camera--printer 0.9847 0.9457 0.9727 0.9916 camera--tv 0.9920 0.9652 0.9708 0.9974 mobile--notebook 0.9421 0.8367 0.7978 0.9041 mobile--printer 0.8240 0.7912 0.7377 0.8705 mobile--tv 0.8086 0.6671 0.6186 0.8012 notebook--printer 0.9787 0.9809 0.9508 0.9631 notebook--tv 0.9960 0.7943 0.9334 0.9753 printer--tv 0.9736 0.9361 0.9922 0.9996 Average(F score) 0.9477 0.8822 0.8924 0.9474 2/5/2019 Web Page Cleaning

29 Configuration 2 training sets: different sites, A and B
Configuration 2 training sets: different sites, A and B. Test set: no pages from A and B Noise Template SST Weight camera--mobile 0.8448 0.8970 0.9334 0.9600 camera--notebook 0.7685 0.8035 0.9514 0.9697 camera--printer 0.7166 0.8664 0.9428 0.9777 camera--tv 0.7798 0.8565 0.9694 0.9911 mobile--notebook 0.5046 0.5451 0.7092 0.7705 mobile--printer 0.5175 0.6422 0.7185 0.7914 mobile--tv 0.5856 0.6664 0.7788 0.8739 notebook--printer 0.7374 0.7107 0.9520 0.9522 notebook--tv 0.7754 0.6537 0.9632 0.9666 printer--tv 0.7352 0.8410 0.9716 0.9779 Average(F score) 0.6965 0.7483 0.8890 0.9231 2/5/2019 Web Page Cleaning

30 Configuration 3 training sets: different sites, A and B
Configuration 3 training sets: different sites, A and B. Test set: with pages in A and B Noise Template SST Weight camera--mobile 0.7527 0.8632 0.9312 0.9589 camera--notebook 0.6157 0.7144 0.9424 0.9684 camera--printer 0.5896 0.7642 0.9356 0.9836 camera--tv 0.6233 0.8082 0.9629 0.9822 mobile--notebook 0.3565 0.4686 0.6783 0.7701 mobile--printer 0.3985 0.5710 0.7205 0.8254 mobile--tv 0.5025 0.6845 0.7820 0.8427 notebook--printer 0.6023 0.6305 0.9305 0.9432 notebook--tv 0.5984 0.5820 0.9585 0.9466 printer--tv 0.5371 0.7123 0.9622 0.9748 Average(F score) 0.5577 0.6799 0.8804 0.9196 2/5/2019 Web Page Cleaning

31 Clustering: E-product pages
F-score Distribution in Clustering of E-Product pages 2/5/2019 Web Page Cleaning

32 Clustering: E-product pages
Ave(F) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Noise 0.506 2 85 294 314 101 3 1 Template 0.631 5 80 213 316 124 62 SST 0.751 26 55 94 427 104 Weight 0.794 13 24 50 419 107 187 2/5/2019 Web Page Cleaning

33 Conclusion Web Page Cleaning (WPC) WPC through Feature Weighting
Compressed Structure Tree Information Based Measures Presentation Style, Actual Content, Locations of features Evaluation shows the effectiveness of the technique. Future work Identify different types of noises Using visual cues for web page cleaning 2/5/2019 Web Page Cleaning

34 Definition of pagelet An HTML element in the parse tree of a page is a pagelet if (1) none of its children contains at least k hyperlinks; and (2) none of its ancestor elements is a pagelet 2/5/2019 Web Page Cleaning


Download ppt "Web Page Cleaning for Web Mining"

Similar presentations


Ads by Google