Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Techniques for Automatic Web Filtering

Similar presentations


Presentation on theme: "Advanced Techniques for Automatic Web Filtering"— Presentation transcript:

1 Advanced Techniques for Automatic Web Filtering
James Z. Wang PNC Tech. Career Dev. Professor Penn State University Joint Work: Jia Li, Assist. Prof., Penn State Statistics Gio Wiederhold, Prof., Stanford Computer Science 11/16/2018 J. Z. Wang, Penn State University

2 J. Z. Wang, Penn State University
Outline The problem Related approaches Filtering based on image content Goals and methods The WIPE system Experimental results Website classification by image content Conclusions and future work 11/16/2018 J. Z. Wang, Penn State University

3 The Size and Content of the Web
02/99: ~16 million total web servers Estimated total number of pages on the web: ~800 million 15 Terabytes of text (comparable to text of Library of Congress) Year 2001: 3 to 5 billion pages Lawrence, Giles, Nature, 1999. 11/16/2018 J. Z. Wang, Penn State University

4 J. Z. Wang, Penn State University
Outline The problem Related approaches Filtering based on image content Goals and methods The WIPE system Experimental results Website classification by image content Conclusions and future work 11/16/2018 J. Z. Wang, Penn State University

5 Pornography-free Websites
E.g. Yahoo!Kids, disney.com Useful in protecting those children too young to know how to use the Web browser It is difficult to control access to other sites 11/16/2018 J. Z. Wang, Penn State University

6 J. Z. Wang, Penn State University
Text-based Filtering E.g. NetNanny, Cyber Patrol, CyberSitter Methods: Store more than 10,000 IPs Blocking based on keywords Block all image access Problems: Internet is dynamic Keywords are not enough (e.g. text incorporated in images) Images are needed for all net users 11/16/2018 J. Z. Wang, Penn State University

7 Classification of Web Community
Flake, Lawrence, Giles, ACM KDD, 2000 Graph clustering based on max flow – min cut analysis of the Web connectedness 11/16/2018 J. Z. Wang, Penn State University

8 J. Z. Wang, Penn State University
Outline The problem Related approaches Filtering based on image content Goals and methods The WIPE system Experimental results Website classification by image content Conclusions and future work 11/16/2018 J. Z. Wang, Penn State University

9 J. Z. Wang, Penn State University
Goals and Methods The problem comes from images, we deal with images Goals: use machine learning and image retrieval to classify Web images and Websites Requirements: high accuracy and high speed Challenges: non-uniform image background, textual noise in foreground, wide range of image quality, wide range of camera positions, wide range of composition… 11/16/2018 J. Z. Wang, Penn State University

10 J. Z. Wang, Penn State University
The WIPE System Inspired by the UC Berkeley’s FNP System Detailed analysis of images Skin filter and human figure grouper Speed: 6 mins CPU time per image Accuracy: 52% sensitivity and 96% specificity Stanford WIPE System Wavelet-based feature extraction + image classification + integrated region matching + machine leaning Speed: < 1 second CPU time per image Accuracy: 96% sensitivity and 91% specificity 11/16/2018 J. Z. Wang, Penn State University

11 J. Z. Wang, Penn State University
System Flow Original Web Image Feature Extraction (color, texture, shape) Type Classification photograph Photo Classification Result: REJECT or PASS Training Features 11/16/2018 J. Z. Wang, Penn State University

12 J. Z. Wang, Penn State University
Wavelet Principle 11/16/2018 J. Z. Wang, Penn State University

13 J. Z. Wang, Penn State University
Type Classification Graphs: Manually-generated images with smooth tones. 11/16/2018 J. Z. Wang, Penn State University

14 J. Z. Wang, Penn State University
Type Classification Photographs: Images with continuous tones. 11/16/2018 J. Z. Wang, Penn State University

15 J. Z. Wang, Penn State University
Photo Classification Content-based image retrieval + statistical classification 11/16/2018 J. Z. Wang, Penn State University

16 J. Z. Wang, Penn State University
Experimental Results Tested on a set of over 10,000 photographic images Speed: Less than one second of response time on a Pentium III PC Accuracy Type of Images Test + (Rejected) Test – (Passed) Objectionable 96% 4% Benign 9% 91% 11/16/2018 J. Z. Wang, Penn State University

17 J. Z. Wang, Penn State University
Comment on Accuracy The algorithm can be adjusted to trade off specificity for higher sensitivity In a real-world filtering application system, both the sensitivity and the specificity are expected to be higher Icons and graphs can be classified with almost 100% accuracy  higher specificity Combine text and image classification  higher sensitivity and higher speed 11/16/2018 J. Z. Wang, Penn State University

18 False Classifications Benign Images
Partially obscured human Areas with similar features Painting, fine-art Partially undressed human Animals (w/o clothes) 11/16/2018 J. Z. Wang, Penn State University

19 False Classifications Objectionable Images
Partially dressed Undressed area too small Dressed but objectionable Frame and text noise Dark, low contrast 11/16/2018 J. Z. Wang, Penn State University

20 Website Classification by Image Content
An objectionable site will have many such images For a given objectionable Website, we denote p as the chance of an image on the Website to be an objectionable image p is the percentage of objectionable images over all images provided by the site We assume some distributions of p over all Websites (e.g., Gaussian, shifted Gaussian) Classification levels could be provided as a service to filtering software producers 11/16/2018 J. Z. Wang, Penn State University

21 Flow in Website classification
11/16/2018 J. Z. Wang, Penn State University

22 Website Classification
Based on statistical analysis (see paper), we know we can expect higher than 97% accuracy on Website classification if We download images for each site We classify a Website as objectionable if 20-25% of downloaded images are objectionable Using text and IP addresses as criteria, the accuracy can be further improved skip IPs for museums, dog-shows, beach towns, sport events 11/16/2018 J. Z. Wang, Penn State University

23 J. Z. Wang, Penn State University
Outline The problem Related approaches Filtering based on image content Goals and methods The WIPE system Experimental results Website classification by image content Conclusions and future work 11/16/2018 J. Z. Wang, Penn State University

24 Conclusions and Future Work
Perfect filtering is never possible Effective filtering based on image content is feasible with the current technology Systems that combine content-based filtering with text-based criteria will have good accuracy and acceptable speed Objectionable websites are automatically identifiable, a service for the community? The technology can still be improved through further research. 11/16/2018 J. Z. Wang, Penn State University

25 J. Z. Wang, Penn State University
References (papers) ... /cgi-bin/zwang/wipe2_show.cgi (demo) ... /pub/gio/inprogress.html#COPA (testimony) (James Wang) (Gio Wiederhold) (Michel Bilello) 11/16/2018 J. Z. Wang, Penn State University


Download ppt "Advanced Techniques for Automatic Web Filtering"

Similar presentations


Ads by Google