Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University.

Similar presentations


Presentation on theme: "Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University."— Presentation transcript:

1 Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University

2 Research Team Students –Lei Dong –Alistair Kennedy –Richong Zhang Faculty –Carolyn Watters –Jack Duffy

3 Overview Introduction Genre Task Summary

4 Introduction The focus of our current research is the investigation of filtering techniques for the Web This includes context-aware retrieval where context includes: –Adaptive user modeling –The user’s “task” Information need What it is the user is trying to do We are moving to incorporate the notions of genre and task and to evaluate the impact that these have on filtering

5 Filtering Genre Task User Profiles

6 Motivation for Research The Web has billions of documents Average query is 2-3 words One document will satisfy our information need!

7 But it’s more than just search “Browsing or surfing the Web represents the main model for web use, especially among younger users.” (Hunter)

8 Three general types [Marchionini] –Directed browsing – explicit info need –Semi-directed browsing – less well defined need –Undirected browsing - there is no real goal and the user is “surfing” Browsing Continuum SurfingSearching

9 Motivated Behaviour Intrinsically Motivated Behaviour –“… is that which appears to be spontaneously initiated by the person in pursuit of no other goal than the activity itself.” [Enzle, Wright, Redondo] –“… engaging in a task for its enjoyment value…” [Deci, Ryan] Extrinsically Motivated Behaviour –“… motivation is to engage in an activity as a means to an end … participation will result in desirable outcomes such as reward …” [Pintrich, Schunk]

10 Task and Information Need Continuum General information gathering Explicit Information need I’m shopping for a computer I want the price on the Dell Inspiron Notebook computer So, one document may not satisfy the information need

11

12

13 Why look at Genre and Task?

14 Filtering Based on Adaptive User Profiles and IR-type of Task Intrinsic Motivation –Fine-grained filtering of the Web is not feasible when the browsing task is “undirected” Extrinsic Motivation –Fine-grained filtering of the Web is feasible when there is an explicit information need

15 Genre A genre is a “classifying statement” It allows us to recognize items that are similar even in the midst of great diversity –Newspapers –Mystery novels –Office memos socially recognized communicative purpose Generally characterized by the tuple:

16

17 Cybergenre Genre on the web Characterized by the tuple Where functionality is the functionality afforded by the new medium, i.e., the web

18 cybergenre extant novel replicatedvariant emergent spontaneous electronic newspaper multimedia newspaper personalized newspaper FAQ

19

20 Recognizing Genres of Web Pages The number of cybergenres is increasing, with different estimates putting the number at well over 1000 (depends on granularity) It is difficult to know the boundaries of a genre and to know when one has crossed from one genre into another genre It is difficult to know when a web page represents the emergence of a new genre

21 Research Problems How can we identify automatically the genre of a web page? What features should be used in describing web pages? How can we make this adaptive to recognize: –New genre when they emerge? –Genre classes that are fuzzy and genres that slide from one class to another?

22 Research Questions Can we identify home pages? Can we distinguish among the sub-genres: –personal, corporate and organization home pages? What influence does the functionality attribute have in distinguishing these genres and sub-genres?

23

24

25

26 Machine Learning Model and Dataset The dataset consisted of 321 web pages –17 were classified manually as belonging to two of the three home page sub-genres –94 corporate home pages –93 personal home pages –74 organization home pages –77 noise pages Neural Net Model –Single classifier with three target output classes –Three different classifiers, one for each of three target output classes

27 Features Content Number of Meta tags used. Does the page contain any phone numbers? List of most common words appearing in between 16% and 40% of all documents. Form Number of images. Does the page have its own domain, or is it in a sub-directory within a domain? Size of file in bytes. Number of words in the page. Functionality Number of Links in the Web Page. Number of E-mail Links. Prop. of links that are navigational links to other web pages within the same site. Prop. of links that are links to locations within the same page. Prop. of links that are links to other pages on other sites. Number of form inputs Is the first tag a Script tag?

28 Terms Selected as Features ClassTerms Personal Home Page my, me, i, t Corporate Home Page we, services, service, available, fax, our, us, com, contact, copyright, free, amp Organization Home Page events, community, organization, 2004, help, its, members, news, information

29 Neural Net Categorization Personal Home Page Organization Home Page Corporate Home Page Target Categories Neural Net Data Set of Web Pages of Known Genre Type Input Feature Vector

30 Evaluation Recall –The proportion of web pages of genre type G i that are correctly categorized into category C i Precision –The proportion of web pages categorized into category C i that are of genre type G i F-measure(G i ) = the quality of the classifier with respect to web pages of genre type G i

31 10-Fold Cross Validation Used when data set is small in order to obtain statistically valid results

32

33 Test Set 1 10 % Training Set 90% Test Set 2 10 % Test Set 3 10 %

34 Significant Difference Personal Home Page.711.702- Corporate Home Page.666.637.005 Organization Home Page.553.555- F-measures using separate classifiers with noise pages

35 F-measures using single classifier with noise pages Significant Difference Personal Home Page.712.698.05 Corporate Home Page.650.644- Organization Home Page.537.536-

36 Misclassification tables Single Classifier PCONon-home Personal62.23.18.222.2 Corporate3.756.514.825.4 Organization4.812.236.525.9 Noise Pages11.17.46.752.9

37 Genre Summary We can recognize home pages from noise pages We can distinguish personal home pages from corporate and organization home pages, but distinguishing between corporate and organizational home pages is difficult Feature set needs a lot more attention paid to it

38 Open Questions What is an appropriate feature set? Full evaluation of functionality attribute What ML model to use? –Accuracy and scalability Adaptive –Track recognized genres as they evolve –Recognize the introduction of a novel genre not seen previously –Is this like topic detection and tracking?

39 Genre and Task on the Web? GroupGenreTaskRecognition Topics Home page, location, special topics Cultural, shopping, news, health url only host name, short, lots of graphics Publications Articles, publications, news Scholarly research, news, financial Hierarchical structure, longer, few graphics Products Product info, reviews, order forms Shopping, news, computing Short, prices, phone numbers Educational Glossary, course list, instructional material Educational pursuitsedu domain, education lexicon FAQ Health, self-helpMetadata and headings, structure Roussinov, et al., Genre Based Navigation on the Web, HICSS’34

40 Yahoo Directory

41 Yahoo categories are created and maintained manually –Creator of a web site submits a description –Editors review these Can we automatically classify a web page by task?

42 Experiment Creation of data set Data cleaning 10-fold cross validation –Feature selection (IG) –Principal component analysis –Build Decision Tree –Testing

43 Creation of Data Set Selected 120 web pages randomly from Yahoo directories in each of: –Shopping –Health –Education Selected 70 pages (NSHE) not from the Web that are not shopping, health or education Total of 430 Web pages Validated by 3 raters

44 Data Cleaning XML, HTML tags –,, Pictures, Audio files, Video files Scripts – Stop words Porter’s stemming algorithm

45 Feature Selection Using the Information Gain (IG) Employed as a term goodness criterion Based on Information Theory –The number of “bits of information” gained by knowing the term is present or absent

46 Information Gain (IG) A measure of importance of the feature for predicting the presence of the class. The information gain of term t is defined to be denotes the set of categories in the target space.

47 Information gain (IG) HealthShoppingEducationIG value Educ234930.352725188 Diseas49100.200911642 Medic57520.19171664 Health7915190.188112451 Teacher01460.185452451 School76600.170352452 Price55010.16546535 Item25270.156980483 Ship14320.149329261 Student63510.148850138 Custom75040.133412067 Accessori03200.130532457 Cancer32000.130532457 Doctor36110.124860273 Public163510.12081971 Shop105590.120056849 Heart33200.116157938 Cart03540.114589777 Medicin37220.113854763 Physician27000.10811821 Risk26000.103738132 Number of documents in which term appears in each category

48 Information gain (IG) 300 features

49 Document Term Matrix 324 Documents (108 in each of Health, Shopping and Education) 300 terms as identified by the Information Gain measure

50 Principal Component Analysis Identifies patterns in data and is a way to express the data is such a way as to highlight their similarities and differences Once these patterns have been found in the data, we can reduce the number of dimensions without much loss of data

51 PCA Calculate covariance matrix of original data Calculate eigenvalues and eigenvectors of covariance matrix Largest eigenvector identifies principal component The principal component is the eigenvector that expresses the most significant relationship among the data dimensions

52 Principal Component Eigenvalues First 3 eigenvectors carry most of the information

53 Matrix Projection After determining which components or eigenvectors to use, project the original document-term matrix into this new space

54

55

56 Decision Tree Flow-chart-like tree structure Each internal node denotes a test on an attribute Each branch represents an outcome of the test Leaf nodes represent classes or class distributions. Used for classification

57 Decision Tree The tree’s generation process could be seen as the generation of rules. First, build a tree from a known training data set. Then, use this tree to predict new data set. Decision tree makes rules among data visualized, and easy to understand.

58 Decision Tree

59 Health Shopping Education NSHE Health 10.0 0.80.50.71 Shopping0.89.90.11.2 Education0.90.19.21.8 NSHE0.61.11.63.7 Confusion Matrix Target Categories Original Categories

60 Precision and Recall Precision0.810.830.810.50 Recall 0.83 0.77 0.53 Health Shopping Education NSHE Health 10.0 0.80.50.7 Shopping0.89.90.11.2 Education0.90.19.21.8 NSHE0.61.11.63.7

61 Conclusion and Future Work As a filter, this approach would identify 80% of pages in Health, Shopping or Education Evaluate other classifiers System has to be scaled up: –More tasks, such as entertainment and sports –Larger data set with more noise Add form and functionality features to determine if there are recognizable genres of tasks

62 How do I see these filters working? Search Engine Filter By Task Filter By Genre Query Task Genre Search Results Filtered Results

63 Thank You Web Information Filtering Lab http://www.cs.dal.ca/wifl/


Download ppt "Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University."

Similar presentations


Ads by Google