Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining At Tech Journal. Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results Agenda.

Similar presentations


Presentation on theme: "Data Mining At Tech Journal. Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results Agenda."— Presentation transcript:

1 Data Mining At Tech Journal

2 Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results Agenda

3 Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results Agenda

4 The Company A US company (“TechJournal”) publishes an on-line journal (“TechPub”) with content specifically aimed at IT professionals TechJournal is 15 years old; TechPub is 5 years old Content for TechPub comes from three sources: –Aggregated content from public sources –TechJournal created content –Peer contributed content TechJournal core business is to produce a high-end list product for the marketing departments of IT manufacturers

5 The Journal The content on the publication website is available to both anonymous and registered users Registered users get access to some premium services as well Most content is free. Some whitepapers for sale. Three very unique features of the site –Peer contributed content –Auction system -> readers to get paid to contribute content –New: personalized content for each reader

6 Target: IT Professional involved in their organization’s technology purchasing decision Different levels of “readership”: The company continuously tries to stimulate new readership through e-mail campaigns The Readers E Mail Recipients Anonymous Visits E Mail Recipients Visited Site E Mail Recipients Repeat Visitor Registered Light Reader Registered Heavy Reader Number of Individuals

7 The Business Model “Active Readers Produce Better Lists” Loop “Known Readers Make For Better Journal” Loop “Success Breeds Success” Loop “Buzz Marketing” Loop

8 Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results Agenda

9 Focal Areas For Data Mining Is TechJournal’s current content taxonomy effective or would some content taxonomy be more useful? Given email recipient attributes, what is the likelihood of a visit to website? Which content headlines would maximize that visit likelihood? “Known Readers Make For Better Journal” Loop “Active Readers Produce Better Lists” Loop “Success Breeds Success” Loop Given registered readers’ attributes, which stories will they be interested in? Given past stories read, what is a registered reader most likely to also read? Given registered readers’ attributes, which will be most active?

10 Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results Agenda

11 The Data My “Chunk of Data” to Mine: An Issues Table 713,110 records Issues - Content Linker Table 2,185,664 records Content Items Table 590 records Page Visit Table 43,580 records Recipients Table 195,455 records Taxonomy Click Table 9,385 records

12 Attributes to Work With Reader AttributesContent AttributesFormat Attributes Primary KeyRecipient ID IP Address Content ID Issue ID Data Mining AttributesTitle City State Country Zip Phone IT Budget Employees Sales SIC Code Industry Time Sent Time Opened Time of Visit Time Content Click Abstract Headline Main Content Type Media Type Author Content Taxonomy Click Rate Template Type Media Type (HTML, Or Video) = Features that can be utilized directly or derived from for Classification

13 Creating Content Classes 1 1 Classes 5 46 798 1909 5000 + Level 2 3 4 5...... 21 TechJournal’s current taxonomy for classifying content: Manually derived Aggregation of other credible taxonomy fragments From a content provider point of view Goes out to 21 levels in some cases, others as shallow as three 31 Classes 9,750 Visits spread over

14 Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status Preliminary Results Agenda

15 A Variety of Approaches Given past stories read, what is a registered reader most likely to also read? Given email recipient attributes, what is the likelihood of a visit to website? Which content headline would maximize that visit likelihood? Given registered readers attributes, which readers will be most active? Given registered reader attributes, which types of content will they read? PREDICTIVE MODELING Is TechJournal’s current content taxonomy effective or would some other taxonomy be more useful? CLUSTER ANALYSIS ASSOCIATION ANALYSIS

16 Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results Agenda

17 Potential Issues Database evolution produces noisy, dirty, unevenly populated data Data comes from multiple sources, producing consistent data has been a challenge Still not clear if we will end up with enough data to see anything meaningful Content taxonomy is relatively new; most likely has real problems with how its structured Taxomony measures article subject matter, but behavior stimulating content may be in headlines Features are somewhat related: Features have high number of discrete values – need to be put into meaningful groupings Under-representation of several feature and class values

18 Feature Grouping - Location 1 2 3 4 5 6 7 10 9 8 Other 11

19 Feature Grouping - Title Start with ~ 1000 distinct self-reported Titles in the Database Most interested in Title as it correlates with impact, influence on IT buying decisions Reclassify them based on three concepts: Senority, Function, Employees in Company Functional Area 1 Functional Area N Owner Chairman/CEO Assistant Functional Area 1 Functional Area 10 Manager of Managers Assistant Manager of Doer 1 2,20 - 29 3,30 - 39 4 Result: 24 Categories

20 Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results Agenda

21 Where I Am In The Process Problem Definition Data Gathering Data Prep Data Mining Results Analysis Visualiz. Sum Up Insights

22 Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results Agenda

23 0.7037 n = 27 0.1429 n = 7 First Results Q: Given registered readers attributes, which readers will be most active? Method: Decision Tree Induction – Training Set 599 Records, Test Set 187 Records MSE on Test Set =.1451 MSE on Training Set =.1313

24 n= 786 node), split, n, deviance, yval * denotes terminal node 1) root 786 223508.000 29.44402 2) LocGrpID< 1.5 96 23784.990 24.01042 4) RIC>=70.5 53 10433.890 19.66038 * 5) RIC< 70.5 43 11112.050 29.37209 10) RIC< 66 33 8432.545 25.27273 * 11) RIC>=66 10 294.900 42.90000 * 3) LocGrpID>=1.5 690 196494.400 30.20000 6) RIC< 71.5 438 127844.900 28.34475 12) RIC>=14.5 411 120569.000 27.69586 * 13) RIC< 14.5 27 4468.667 38.22222 * 7) RIC>=71.5 252 64521.570 33.42460 14) Title_Code>=38 20 4712.950 20.45000 * 15) Title_Code< 38 232 56151.570 34.54310 * First Results Q: Given the attributes of a registered reader, which content types they will read? Method: Decision Tree Induction 20.45 n = 20 35.54 n = 232

25 First Results Q: Given registered reader attributes, which types of content will they read? Method: Kernel SVM with Gaussian Kernel Overall Training Error =.569975

26 Defining Project Success Success for this project could come in different forms: Insights gained on any of the six questions within the project’s scope; - and/or – Insight into how TechJournal should modify its data capture policies to facilitate data mining for the answers to these questions in the future

27 Questions/Comments

28


Download ppt "Data Mining At Tech Journal. Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results Agenda."

Similar presentations


Ads by Google