Download presentation
Presentation is loading. Please wait.
Published byCaren Spencer Modified over 8 years ago
1
Web Page Classifiers Inmaculada Hernández
2
Roadmap Introduction Classifiers Taxonomy Evaluation Conclusions & Future Work
3
Roadmap Introduction Classifiers Taxonomy Evaluation Conclusions & Future Work
4
Introduction Page categorization
5
Introduction Wrapper Generation
6
Roadmap Introduction Classifiers Taxonomy Evaluation Conclusions & Future Work
7
Web Classification Issues High Dimensionality Which features? How many? Where do we find them? High Speed Training Set Positive & Negative Only Positive None
8
Web classification Framework Focus Feature Analysis Page Representation Baseline Classifier Features Preprocessing Techniques
9
Hubs & Authorities In dynamic pages, we will use “detail” instead of “authority”
10
Classification Focus Sports Movies Political Online Stores… Domain Home Hubs Detail Error No results… Functional Angry Sad Happy… Sentiment Authorized mail Spam Spam…
11
Web classification Framework Focus Feature Analysis Page Representation Baseline Classifier Features Preprocessing Techniques
12
Page Representation Brad – 2 timesMovie – 4 timesRomance – 1 timeRating – 1 time
13
Web classification Framework Focus Feature Analysis Page Representation Baseline Classifier Features Preprocessing Techniques
14
Classification Features Content Structure Links URL Visual Analysis Hybrid Types On Page Neighbours Location
15
movie Brad Pitt Rating Characters Plot Release Date David Fincher Tenis Futbol Deportes Rafa Nadal Partidos Golf Content-Based Features Word Frequency Page Size Examples Hotho02 Selamat04 Pierre01 Movies Sports
16
Structure-Based Features Not used Templates Trees Regular Expressions Examples Reis04 Vieira06 Crescenzi01
17
Link-Based Features Incoming links Outcoming links Examples Pierre01 Bar-Yossef02 Blanco07 Web Site
18
URL-Based http://www.amazon.com/shoes-men-women-kids- baby/b/ref=sa_menu_shoe9_gw http://www.wordreference.com/es/en/translati on.asp?spen=newswire http://www.amazon.com/MP3-Music- Download/b/ref=sa_menu_dmusic2_gw
19
Visual Analysis Features Number of images Line spacing Distance between elements Examples Alvarez07
20
Hybrid Feature Set LinksStructureContent Features Combination of structure and content Examples Caverlee05 Markov08
21
Neutral Features Any Examples Yu03 Yu04 SVM
22
Neighbour-Based Features Anchor text Extended anchor text Headings preceeding anchor Labels Content Examples Cohen02 Furnkraz02
23
Web classification Framework Focus Feature Analysis Page Representation Baseline Classifier Features Preprocessing Techniques
24
Feature Analysis Dimensionality Reduction Feature Selection Document Frequency Mutual Information Odds Ratio Cross Entropy Information Gain Chi-square Feature Extraction Latent Semantic Indexing Word Clustering
25
Web classification Framework Focus Feature Analysis Page Representation Baseline Classifier Features Preprocessing Techniques
26
Baseline Classifier Statistical Bayesian Network K-Nearest Neighbour Machine Learning Decision Trees Genetic Algorithms Neural Networks Backpropagation Self- organizing maps Ad-hoc Techniques TemplateTokens UFRE Pagelet
27
Web classification Framework Focus Feature Analysis Page Representation Baseline Classifier Features Preprocessing Techniques
28
Cleansing & Formatting (jTidy) Tokenization Stemming Stop Words Removal of tags Removal of rare words
29
Related Work Author Classification TypeApproachBaseline ClassifierPage RepresentationExecutionPreprocessingFeature Analysis Pierre01 DomainContent & LinkK-Nearest NeighbourTextEager LSI Hotho02 DomainContentK-MeansConcept VectorsLazy Term Select., COSA (Ontology) Selamat04 DomainContentNeural NetworksWord VectorsEager Stemming, stop words PCA, CBPF Markov08 Domain Content & Structural C4.5, Naïve-Bayes Graph document representation / Boolean Vector Eager Stopwords, Stemming Minimal frequency threshold for subgraphs Yu03 & Yu04 BothAny1-DNF, SVMFeature VectorEager Doorenbos97 FunctionalStructural logical linesEager Crescenzi01 FunctionalStructuralUFREsTags & TextLazy jTidy, Tokenization Arasu03 FunctionalStructuralEq. classesTags & TextLazy LFEQ Bar-Yossef02 FunctionalStructural & LinkPageletParse TreeLazy Reis04 FunctionalStructuralTree Edit DistanceDOM TreeLazy Grumbach99 FunctionalStructuralMark-up Encoding Sequences of characters over a finite alphabet Lazy Flesca05 FunctionalStructuralDisc. Fourier Trans.DTDLazy Check document conformity to DTD Vieira06 FunctionalStructuralRTDM-TDDOM TreeLazy Caverlee05 Functional Content & Structural & Link K-MeansTag TreeEagerTidy Vidal07 FunctionalStructural & URLRTDMDOM TreeLazy Blanco07 FunctionalStructural & Link TemplateTokens DOM TreeEager Links
30
Roadmap Introduction Classifiers Taxonomy Evaluation Conclusions & Future Work
31
Evaluation Metrics Precision Recall F-Measure Data Set (Training & Testing) Size Source
32
Data Sets Sources Standard Data Sets Reuters (21578 docs), RCV (804,414 docs) TEL-8 (~500 docs) WebKB (8,282 docs) Open Directory (2,656,105 docs) Reference Webs: Amazon Yahoo! Sports Pages (¿?) Query results Crawled pages, …
33
Roadmap Introduction Classifiers Taxonomy Evaluation Conclusions & Future Work
34
Conclusions Several different proposals Good results in general Classifiers Not comparable Specific Data Sets Classifiers Evaluation Not frequently applied Specific techniques Preprocessing Structural Functional Our focus
35
Crawling vs. Virtual Integration CrawlingVirtual Integration
36
Research challenges Classifiers Feature Selection Standard Dataset Link classifiers Navigation Which web page classifiers are better for navigation? Post-filtering
37
Questions?
38
Thanks! Drop by our web site at http://www.tdg-seville.info inmahernandez@us.es
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.