Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis Jean-Pierre Norguet.

Similar presentations


Presentation on theme: "Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis Jean-Pierre Norguet."— Presentation transcript:

1 Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis Jean-Pierre Norguet

2 Web Communication Web transaction = request + response Meta-data in Web logs: –Request date et time –Page reference (URI) –Referral URI –Client machine information

3 Web Analytics Process

4 Web Analytics Tools Results –Page views –Number of visitors –Debit –Traffic Exploitation –Self-promotion –Sales planning –Technical resizing –Structure Optimization  Low semantics  Low-level decisions

5 Organization Structure Web analytics tools

6 Web Analytics Results Low semantics  low intuitivity Too numerous results

7 Adress: http://www.ulb.ac.be/cgi/search Page Ref. Ambiguity (1)

8 Page Ref. Ambiguity (2) Adress: http://www.ulb.ac.be/cgi/search

9 Page Volatility Adress: http://www.ulb.ac.be/cgi/search

10 Page Synonymy (1)

11 Page Synonymy (2)

12 Page Polysemy

13 Page Temporality (1)

14 Page Temporality (2)

15 Problems Summary Low semantics  low intuitivity Too numerous results Page reference ambiguity Page synonymy Page polysemy Page temporality Page volatility

16 Our solution Summarized and conceptual results for: –Chief editors –Organization managers Generic solution, independent from: –Web site content –Web site language –Web site technology  analyze output text content

17 Output Page Collection Mining points in Web environment: 1.Web logs (+ content journal) 2.Web server 3.Network wire 4.On-screen Web page

18 Lexical Analysis Output page mining  Web pages Unformatting  text Tokenization  terms Stopwords removal Stemming Term selection  index terms Occurrence counting  audience metrics

19 PresenceConsultation Online pagesOutput pages Interest Term occurrence counting in pages: Term-Based Metrics

20 Term-based metrics: –Consultation –Presence –Interest Limitations: –Too many terms –Term synonymy –Term polysemy  Ontology-based term grouping

21 Hierarchical Aggregation Consultation Presence

22 Hierarchical Aggregation Consultation Presence Interest (x2)

23 Hierarchical Aggregation Consultation Presence Interest (x2)

24 Data model Ontology term hierarchy Number of occurrences: by day, by term List of days (possibly aggregated)

25 OLAP Model Parent-child ontology dimension Time dimension Measures

26 Case Study Web site: cs.ulb.ac.be –1.500 pages –100 page views/day –Knowledge domain: computer science Ontology: ACM classification –Knowledge domain: computer science –11 top domains –3 levels –1230 terms

27 Experimental setting WASA prototype SQL Server OLAP Analysis Service

28 Concept-Based Metrics Y: top ontology domains X: consultation, presence, interest

29 Results

30 Exploitation Process

31 Summary Web analytics Output page mining Lexical analysis Concept-based metrics with OLAP Experiments Conclusion & future work

32 Conclusion Most Web sites supported Approach validated by experiments Topic-based metrics are intuitive Exploitation at higher decision levels Limitation: ontology availability Future work: ontology enrichment  Integration into Web analytics tools

33 Thank you for your attention

34 Q & A

35 Web logs + content journal (+) Easy to setup (+) Minimal storage and computation (-) Dynamic pages Content Journaling

36 Web server plugin (+) Dynamic pages (+) Fast (-) Risky Server Monitoring

37 TCP/IP packet sniffing (+) Independent from Web server (-) Ethernet only (-) Encrypted content (-) CPU-intensive Network Monitoring

38 Page-embedded program 1.Parses page 2.Sends content to mining server (+) Distributed workload (+) Supports client-side XML/XSL (-) Visibility and vulnerability Client-Side Collection

39 Output Page Collection Collection methods alone or in combination  any Web site output is collectable 1.Implemented: WASA-CJ 2.Implemented: Sourceforge mod_trace_output

40 Experiments Experimental settings Visualization Ontology coverage Validation Scalability

41 Experimental setting WASA prototype SQL Server OLAP Analysis Service

42 EUROVOC Thesaurus European Commission thesaurus Knowledge domain: EC-related domains 21 top domains 8 levels 6650 terms

43 Eurovoc Example 04 Politics 08 International Relations 10 European Communities 12 Law 16 Economics 20 Trade 24 Finance 28 Social Questions 32 Education and Competition 36 Science 40 Business and Competition 44 Employment and Working Conditions 48 Transport 52 Environment 56 Agriculture, Forestry and Fisheries 60 Agri-Foodstuffs 64 Production, Technology and Research 66 Energy 68 Industry 72 Geography 76 International Organisations 28 SOCIAL QUESTIONS 2806 family 2811 migration 2816 demography and population 2821 social framework 2826 social affairs 2831 culture and religion –arts –cultural policy –culture –acculturation –civilization –cultural difference –cultural identity RT: protection of minorities (1236) RT: socio-cultural group (2821) –cultural pluralism –popular culture –regional culture –religion 2836 social protection 2841 health 2846 construction and town planning

44 Ontology Coverage Definition: the percentage of ontology terms that appear in the Web site ACM classification: 15% Eurovoc: 0,75% Characterizes the meaning of the metrics  ontology enrichment with terms of the Web site

45 Collaborative Enrichment

46 Methodology Steps Editor browses his pages Select new terms Find enrichment point in the ontology Insert terms into ontology Editor sends ontology to chief editor Chief editor commits the inserts

47 Results

48 Validation Comparison with WebTrends Personal Web site Optimized custom ontology of 1250 terms Top concepts match the page directories  results should be comparable

49 Results Urchin WASA

50 Scalability: Case Study Web site: www.ulb.ac.be –800,000 pages –100,000 page views –Knowledge domain: broad Ontology: Eurovoc –Knowledge domain: broad (EC’s interests) –21 top domains –8 levels –6650 terms Run=15 hours, linear dependency  reasonable and applicable to any Web site

51 Experiments Experimental settings Visualization Ontology coverage Validation Scalability

52 Ontologies Specification of a conceptualisation Controlled vocabulary of terms and relations An ontology defines concepts and their relations, that are necessary to share, reuse, and represent a domain knowledge Example:

53 Ontology Restriction Ontology  concept hierarchy

54 Contents Context & motivations Output page mining Lexical analysis Concept-based metrics with OLAP Experiments Exploitation Conclusion & future work

55 Context Web emergence Web communication analysis Maintenance needs effective decisions Highest organization levels Summarized and conceptual results Web analytics tools unappropriate

56 Exploitation Process

57 Metric Exploitation High interest –Search pages about the topic –Rank pages by consultation –Optimize pages Low interest –Search pages about the topic –Rank pages by presence –Question the topic: important/not important –Drain traffic to the pages/delete pages

58 Future Work Concept visualisation in semantic space Automated taxonomy enrichment Additional OLAP dimensions


Download ppt "Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis Jean-Pierre Norguet."

Similar presentations


Ads by Google