Advanced Techniques for Automatic Web Filtering

Slides:



Advertisements
Similar presentations
August 2000Gio Wiederhold for COPA1 COPA notes Gio Wiederhold Computer Science Dept. and Medicine, Stanford University www-db.stanford.edu/people/gio.html.
Advertisements

Texture Segmentation Based on Voting of Blocks, Bayesian Flooding and Region Merging C. Panagiotakis (1), I. Grinias (2) and G. Tziritas (3)
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Image Segmentation some examples Zhiqiang wang
Multimedia for the Web: Creating Digital Excitement Multimedia Element -- Graphics.
ADVISE: Advanced Digital Video Information Segmentation Engine
CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.
Web queries classification Nguyen Viet Bang WING group meeting June 9 th 2006.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
California Car License Plate Recognition System ZhengHui Hu Advisor: Dr. Kang.
Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Automatic Estimation and Removal of Noise from a Single Image
Jia Li, Ph.D. The Pennsylvania State University Image Retrieval and Annotation via a Stochastic Modeling Approach.
How the World Wide Web Works
SIEVE—Search Images Effectively through Visual Elimination Ying Liu, Dengsheng Zhang and Guojun Lu Gippsland School of Info Tech,
DOG I : an Annotation System for Images of Dog Breeds Antonis Dimas Pyrros Koletsis Euripides Petrakis Intelligent Systems Laboratory Technical University.
Autonomous Learning of Object Models on Mobile Robots Xiang Li Ph.D. student supervised by Dr. Mohan Sridharan Stochastic Estimation and Autonomous Robotics.
ENDA MOLLOY, ELECTRONIC ENG. FINAL PRESENTATION, 31/03/09. Automated Image Analysis Techniques for Screening of Mammography Images.
Visual-Similarity-Based Phishing Detection Eric Medvet, Engin Kirda, Christopher Kruegel SecureComm 2008 Sep.
Multimedia Databases (MMDB)
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Web Page Design I Basic Computer Terms “How the Internet & the World Wide Web (www) Works”
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
 The World Wide Web is a collection of electronic documents linked together like a spider web.  These documents are stored on computers called servers.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
ALIP: Automatic Linguistic Indexing of Pictures Jia Li The Pennsylvania State University.
Blocking Blog Spam with Language Model Disagreement Gilad Mishne (Amsterdam) David Carmel (IBM Israel) AIRWeb 2005.
A Comparative Evaluation of Three Skin Color Detection Approaches Dennis Jensch, Daniel Mohr, Clausthal University Gabriel Zachmann, University of Bremen.
McLean HIGHER COMPUTER NETWORKING Lesson 14 Firewalls & Filtering Comparison of Internet content filtering methods: firewalls, Internet filtering.
Non-Photorealistic Rendering and Content- Based Image Retrieval Yuan-Hao Lai Pacific Graphics (2003)
Scene Completion Using Millions of Photographs James Hays, Alexei A. Efros Carnegie Mellon University ACM SIGGRAPH 2007.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
AdvisorStudent Dr. Jia Li Shaojun Liu Dept. of Computer Science and Engineering, Oakland University Automatic 3D Image Segmentation of Internal Lung Structures.
1 Filtering Web Content for Staff and the Public Sarah Ormes UKOLN University of Bath Bath, BA2 7AY UKOLN is funded by Resource: The Council for Museums,
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Automated Fingertip Detection
Team Members Ming-Chun Chang Lungisa Matshoba Steven Preston Supervisors Dr James Gain Dr Patrick Marais.
Web Page Design 1 Information Technology ClassAct SRS enabled. Web Page Design This presentation will explore: creating web pages structure, formatting.
Query by Image and Video Content: The QBIC System M. Flickner et al. IEEE Computer Special Issue on Content-Based Retrieval Vol. 28, No. 9, September 1995.
ESPL 1 Motivation Problem: Amateur photographers often take low- quality pictures with digital still camera Personal use Professionals who need to document.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
REAL-TIME DETECTOR FOR UNUSUAL BEHAVIOR
Visual Information Retrieval
Image Retrieval and Annotation via a Stochastic Modeling Approach
Advanced Image Processing
Face Detection EE368 Final Project Group 14 Ping Hsin Lee
CS 4501: Introduction to Computer Vision Sparse Feature Detectors: Harris Corner, Difference of Gaussian Connelly Barnes Slides from Jason Lawrence, Fei.
ROBUST FACE NAME GRAPH MATCHING FOR MOVIE CHARACTER IDENTIFICATION
Development of User-Participation-type Communication tools for revitalization of local communities using MapServer Kei SAITO*, Michihiko SHINOZAKI* and.
Introducing the World Wide Web
Introduction to Digital Photography
Chapter IV, Introduction to Digital Imaging: Lesson III Understanding the Components of Image Quality
Content-Based Image Retrieval Readings: Chapter 8:
A Tool for Implementing COPA+ (Child Online Protection Act)
Research on the Internet
Outline Announcement Texture modeling - continued Some remarks
Advanced Techniques for Automatic Web Filtering
Text Categorization Document classification categorizes documents into one or more classes which is useful in Information Retrieval (IR). IR is the task.
Multimedia Information Retrieval
COMS 161 Introduction to Computing
Ying Dai Faculty of software and information science,
Introduction to Digital Photography
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Web Mining Research: A Survey
Jia-Bin Huang Virginia Tech
Detecting Digital Forgeries using Blind Noise Estimation
Presentation transcript:

Advanced Techniques for Automatic Web Filtering James Z. Wang PNC Tech. Career Dev. Professor Penn State University Joint Work: Jia Li, Assist. Prof., Penn State Statistics Gio Wiederhold, Prof., Stanford Computer Science http://wang.ist.psu.edu 11/16/2018 J. Z. Wang, Penn State University

J. Z. Wang, Penn State University Outline The problem Related approaches Filtering based on image content Goals and methods The WIPE system Experimental results Website classification by image content Conclusions and future work 11/16/2018 J. Z. Wang, Penn State University

The Size and Content of the Web 02/99: ~16 million total web servers Estimated total number of pages on the web: ~800 million 15 Terabytes of text (comparable to text of Library of Congress) Year 2001: 3 to 5 billion pages Lawrence, Giles, Nature, 1999. 11/16/2018 J. Z. Wang, Penn State University

J. Z. Wang, Penn State University Outline The problem Related approaches Filtering based on image content Goals and methods The WIPE system Experimental results Website classification by image content Conclusions and future work 11/16/2018 J. Z. Wang, Penn State University

Pornography-free Websites E.g. Yahoo!Kids, disney.com Useful in protecting those children too young to know how to use the Web browser It is difficult to control access to other sites 11/16/2018 J. Z. Wang, Penn State University

J. Z. Wang, Penn State University Text-based Filtering E.g. NetNanny, Cyber Patrol, CyberSitter Methods: Store more than 10,000 IPs Blocking based on keywords Block all image access Problems: Internet is dynamic Keywords are not enough (e.g. text incorporated in images) Images are needed for all net users 11/16/2018 J. Z. Wang, Penn State University

Classification of Web Community Flake, Lawrence, Giles, ACM KDD, 2000 Graph clustering based on max flow – min cut analysis of the Web connectedness 11/16/2018 J. Z. Wang, Penn State University

J. Z. Wang, Penn State University Outline The problem Related approaches Filtering based on image content Goals and methods The WIPE system Experimental results Website classification by image content Conclusions and future work 11/16/2018 J. Z. Wang, Penn State University

J. Z. Wang, Penn State University Goals and Methods The problem comes from images, we deal with images Goals: use machine learning and image retrieval to classify Web images and Websites Requirements: high accuracy and high speed Challenges: non-uniform image background, textual noise in foreground, wide range of image quality, wide range of camera positions, wide range of composition… 11/16/2018 J. Z. Wang, Penn State University

J. Z. Wang, Penn State University The WIPE System Inspired by the UC Berkeley’s FNP System Detailed analysis of images Skin filter and human figure grouper Speed: 6 mins CPU time per image Accuracy: 52% sensitivity and 96% specificity Stanford WIPE System Wavelet-based feature extraction + image classification + integrated region matching + machine leaning Speed: < 1 second CPU time per image Accuracy: 96% sensitivity and 91% specificity 11/16/2018 J. Z. Wang, Penn State University

J. Z. Wang, Penn State University System Flow Original Web Image Feature Extraction (color, texture, shape) Type Classification photograph Photo Classification Result: REJECT or PASS Training Features 11/16/2018 J. Z. Wang, Penn State University

J. Z. Wang, Penn State University Wavelet Principle 11/16/2018 J. Z. Wang, Penn State University

J. Z. Wang, Penn State University Type Classification Graphs: Manually-generated images with smooth tones. 11/16/2018 J. Z. Wang, Penn State University

J. Z. Wang, Penn State University Type Classification Photographs: Images with continuous tones. 11/16/2018 J. Z. Wang, Penn State University

J. Z. Wang, Penn State University Photo Classification Content-based image retrieval + statistical classification 11/16/2018 J. Z. Wang, Penn State University

J. Z. Wang, Penn State University Experimental Results Tested on a set of over 10,000 photographic images Speed: Less than one second of response time on a Pentium III PC Accuracy Type of Images Test + (Rejected) Test – (Passed) Objectionable 96% 4% Benign 9% 91% 11/16/2018 J. Z. Wang, Penn State University

J. Z. Wang, Penn State University Comment on Accuracy The algorithm can be adjusted to trade off specificity for higher sensitivity In a real-world filtering application system, both the sensitivity and the specificity are expected to be higher Icons and graphs can be classified with almost 100% accuracy  higher specificity Combine text and image classification  higher sensitivity and higher speed 11/16/2018 J. Z. Wang, Penn State University

False Classifications Benign Images Partially obscured human Areas with similar features Painting, fine-art Partially undressed human Animals (w/o clothes) 11/16/2018 J. Z. Wang, Penn State University

False Classifications Objectionable Images Partially dressed Undressed area too small Dressed but objectionable Frame and text noise Dark, low contrast 11/16/2018 J. Z. Wang, Penn State University

Website Classification by Image Content An objectionable site will have many such images For a given objectionable Website, we denote p as the chance of an image on the Website to be an objectionable image p is the percentage of objectionable images over all images provided by the site We assume some distributions of p over all Websites (e.g., Gaussian, shifted Gaussian) Classification levels could be provided as a service to filtering software producers 11/16/2018 J. Z. Wang, Penn State University

Flow in Website classification 11/16/2018 J. Z. Wang, Penn State University

Website Classification Based on statistical analysis (see paper), we know we can expect higher than 97% accuracy on Website classification if We download 20-35 images for each site We classify a Website as objectionable if 20-25% of downloaded images are objectionable Using text and IP addresses as criteria, the accuracy can be further improved skip IPs for museums, dog-shows, beach towns, sport events 11/16/2018 J. Z. Wang, Penn State University

J. Z. Wang, Penn State University Outline The problem Related approaches Filtering based on image content Goals and methods The WIPE system Experimental results Website classification by image content Conclusions and future work 11/16/2018 J. Z. Wang, Penn State University

Conclusions and Future Work Perfect filtering is never possible Effective filtering based on image content is feasible with the current technology Systems that combine content-based filtering with text-based criteria will have good accuracy and acceptable speed Objectionable websites are automatically identifiable, a service for the community? The technology can still be improved through further research. 11/16/2018 J. Z. Wang, Penn State University

J. Z. Wang, Penn State University References http://WWW-DB.Stanford.EDU/IMAGE (papers) http://wang.ist.psu.edu ... /cgi-bin/zwang/wipe2_show.cgi (demo) http://www-db.stanford.edu ... /pub/gio/inprogress.html#COPA (testimony) jwang@ist.psu.edu (James Wang) gio@cs.stanford.edu (Gio Wiederhold) michel@db.stanford.edu (Michel Bilello) 11/16/2018 J. Z. Wang, Penn State University