Download presentation
Presentation is loading. Please wait.
Published byLisa Shona Barker Modified over 8 years ago
1
N-Gram-based Dynamic Web Page Defacement Validation Woonyon Kim Aug. 23, 2004 NSRI, Korea
2
Contents Introduction Related Works N-Gram Frequency Index N-Gram-based Index Distance Experiments Conclusions
3
Introduction Defacement of Web Sites CSI/FBI 2001 38 % of web sites were hacked. 21% of hacked sites were not aware of their own defacements. Zone-h The defaced web pages are rapidly increased year by year. (.kr domain : about 200% increase) Current solutions Hash-based detection system for minimizing damage Intrusion-tolerant system for contiguous service Problems of current solutions Current solutions use hash code as validation metric. Hash code can ’ t support dynamic characteristics.
4
Introduction N-Gram-based Index Distance (NGID) A validation metric of dynamically changing web pages The sum of absolute differences of frequency probability of N-Grams that can be found from both indexes. NGID represents the similarity of two web pages. NGID can be used to validate web pages with dynamic components or static.
5
Related Works Hash-based validation system Detecting web page defacements by comparing two hash codes Hash code is useful metric for large and static web pages. Hash code can ’ t work properly on the dynamically changing web pages. Intrusion-tolerant system Hash code is used to validate web pages. It also has limitation on dynamic web pages.
6
N-Gram Frequency Index (1) N-Gram An N-character slice of a string For example “ TEXT ” 2-Gram : TE, EX, XT N-Gram Frequency Index An index file that is sorted from the most frequent N- Grams to the least frequent ones It cuts off N-Grams below at a particular rank. So, minor changes are ignored. And this feature of N-Gram Frequency Index supports dynamics.
7
N-Gram Frequency Index (2) How to generate Count all N-Grams frequencies in a web page. Sort N-Grams from the most frequent to the least. Cut off N-Grams below at a particular rank. Sum up the frequencies of the remained N-Grams. Compute the probability of each N-Gram frequency. Save the N-Grams, frequency of the N-Grams, the probability of N-Grams into an index file.
8
N-Gram-based Index Distance(NGID) The sum of absolute difference of frequency probability of same N-Grams that can be found from both web pages. A metric for detecting whether a web page is defaced or not.
9
N-Gram-based Index Distance Evaluation is done by comparing NGID to validation threshold Evaluation Valid : NGID <= Validation Threshold Invalid : NGID > Validation Threshold
10
Experiments Assumptions Select 100 web pages Choose 0.1 for Validation Threshold of NGID. Procedure for false positive Connect to a selected web page at a time in remote place. Download a page and save it a file. Validate it using NGID. Validate it using Hash Code. Above four steps are recursively applied. Every 30-minute in a day News PaperBroadcastPortalPublicTotal 38151433100
11
Experiments False Positive News Paper Broadc ast PortalPublicTotal No. of Web Sites38151433100 No. of False Positive (MD5) 291412863 No. of False Positive (NGID) 11002
12
Experiments False Positive
13
Experiments NGID value as time flows The time of contents update 1 2
14
Experiments Procedure for false negative Collecting 50 web pages that are normal pages and hacked pages from zone-h. Validate it using NGID. Validate it using Hash Code. Result of Hash code 50-web pages are detected to be defaced. The number of false negative is 0.
15
Experiments False Negative
16
Conclusions N-Gram-based Index Distance A metric to evaluate dynamic web page defacement. NGID can validate dynamically changing web pages. Future Works Need a learning model to resolve a validation threshold of each web page. Need a feedback mechanism of normal index.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.