Download presentation
Presentation is loading. Please wait.
Published byIlene Doris Banks Modified over 9 years ago
1
Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis Jean-Pierre Norguet
2
Web Communication Web transaction = request + response Meta-data in Web logs: –Request date et time –Page reference (URI) –Referral URI –Client machine information
3
Web Analytics Process
4
Web Analytics Tools Results –Page views –Number of visitors –Debit –Traffic Exploitation –Self-promotion –Sales planning –Technical resizing –Structure Optimization Low semantics Low-level decisions
5
Organization Structure Web analytics tools
6
Web Analytics Results Low semantics low intuitivity Too numerous results
7
Adress: http://www.ulb.ac.be/cgi/search Page Ref. Ambiguity (1)
8
Page Ref. Ambiguity (2) Adress: http://www.ulb.ac.be/cgi/search
9
Page Volatility Adress: http://www.ulb.ac.be/cgi/search
10
Page Synonymy (1)
11
Page Synonymy (2)
12
Page Polysemy
13
Page Temporality (1)
14
Page Temporality (2)
15
Problems Summary Low semantics low intuitivity Too numerous results Page reference ambiguity Page synonymy Page polysemy Page temporality Page volatility
16
Our solution Summarized and conceptual results for: –Chief editors –Organization managers Generic solution, independent from: –Web site content –Web site language –Web site technology analyze output text content
17
Output Page Collection Mining points in Web environment: 1.Web logs (+ content journal) 2.Web server 3.Network wire 4.On-screen Web page
18
Lexical Analysis Output page mining Web pages Unformatting text Tokenization terms Stopwords removal Stemming Term selection index terms Occurrence counting audience metrics
19
PresenceConsultation Online pagesOutput pages Interest Term occurrence counting in pages: Term-Based Metrics
20
Term-based metrics: –Consultation –Presence –Interest Limitations: –Too many terms –Term synonymy –Term polysemy Ontology-based term grouping
21
Hierarchical Aggregation Consultation Presence
22
Hierarchical Aggregation Consultation Presence Interest (x2)
23
Hierarchical Aggregation Consultation Presence Interest (x2)
24
Data model Ontology term hierarchy Number of occurrences: by day, by term List of days (possibly aggregated)
25
OLAP Model Parent-child ontology dimension Time dimension Measures
26
Case Study Web site: cs.ulb.ac.be –1.500 pages –100 page views/day –Knowledge domain: computer science Ontology: ACM classification –Knowledge domain: computer science –11 top domains –3 levels –1230 terms
27
Experimental setting WASA prototype SQL Server OLAP Analysis Service
28
Concept-Based Metrics Y: top ontology domains X: consultation, presence, interest
29
Results
30
Exploitation Process
31
Summary Web analytics Output page mining Lexical analysis Concept-based metrics with OLAP Experiments Conclusion & future work
32
Conclusion Most Web sites supported Approach validated by experiments Topic-based metrics are intuitive Exploitation at higher decision levels Limitation: ontology availability Future work: ontology enrichment Integration into Web analytics tools
33
Thank you for your attention
34
Q & A
35
Web logs + content journal (+) Easy to setup (+) Minimal storage and computation (-) Dynamic pages Content Journaling
36
Web server plugin (+) Dynamic pages (+) Fast (-) Risky Server Monitoring
37
TCP/IP packet sniffing (+) Independent from Web server (-) Ethernet only (-) Encrypted content (-) CPU-intensive Network Monitoring
38
Page-embedded program 1.Parses page 2.Sends content to mining server (+) Distributed workload (+) Supports client-side XML/XSL (-) Visibility and vulnerability Client-Side Collection
39
Output Page Collection Collection methods alone or in combination any Web site output is collectable 1.Implemented: WASA-CJ 2.Implemented: Sourceforge mod_trace_output
40
Experiments Experimental settings Visualization Ontology coverage Validation Scalability
41
Experimental setting WASA prototype SQL Server OLAP Analysis Service
42
EUROVOC Thesaurus European Commission thesaurus Knowledge domain: EC-related domains 21 top domains 8 levels 6650 terms
43
Eurovoc Example 04 Politics 08 International Relations 10 European Communities 12 Law 16 Economics 20 Trade 24 Finance 28 Social Questions 32 Education and Competition 36 Science 40 Business and Competition 44 Employment and Working Conditions 48 Transport 52 Environment 56 Agriculture, Forestry and Fisheries 60 Agri-Foodstuffs 64 Production, Technology and Research 66 Energy 68 Industry 72 Geography 76 International Organisations 28 SOCIAL QUESTIONS 2806 family 2811 migration 2816 demography and population 2821 social framework 2826 social affairs 2831 culture and religion –arts –cultural policy –culture –acculturation –civilization –cultural difference –cultural identity RT: protection of minorities (1236) RT: socio-cultural group (2821) –cultural pluralism –popular culture –regional culture –religion 2836 social protection 2841 health 2846 construction and town planning
44
Ontology Coverage Definition: the percentage of ontology terms that appear in the Web site ACM classification: 15% Eurovoc: 0,75% Characterizes the meaning of the metrics ontology enrichment with terms of the Web site
45
Collaborative Enrichment
46
Methodology Steps Editor browses his pages Select new terms Find enrichment point in the ontology Insert terms into ontology Editor sends ontology to chief editor Chief editor commits the inserts
47
Results
48
Validation Comparison with WebTrends Personal Web site Optimized custom ontology of 1250 terms Top concepts match the page directories results should be comparable
49
Results Urchin WASA
50
Scalability: Case Study Web site: www.ulb.ac.be –800,000 pages –100,000 page views –Knowledge domain: broad Ontology: Eurovoc –Knowledge domain: broad (EC’s interests) –21 top domains –8 levels –6650 terms Run=15 hours, linear dependency reasonable and applicable to any Web site
51
Experiments Experimental settings Visualization Ontology coverage Validation Scalability
52
Ontologies Specification of a conceptualisation Controlled vocabulary of terms and relations An ontology defines concepts and their relations, that are necessary to share, reuse, and represent a domain knowledge Example:
53
Ontology Restriction Ontology concept hierarchy
54
Contents Context & motivations Output page mining Lexical analysis Concept-based metrics with OLAP Experiments Exploitation Conclusion & future work
55
Context Web emergence Web communication analysis Maintenance needs effective decisions Highest organization levels Summarized and conceptual results Web analytics tools unappropriate
56
Exploitation Process
57
Metric Exploitation High interest –Search pages about the topic –Rank pages by consultation –Optimize pages Low interest –Search pages about the topic –Rank pages by presence –Question the topic: important/not important –Drain traffic to the pages/delete pages
58
Future Work Concept visualisation in semantic space Automated taxonomy enrichment Additional OLAP dimensions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.