Mark Chavira Ulises Robles Classifying Web Pages Mark Chavira Ulises Robles
Motivation World Wide Web is huge. Computers help some, but not enough. Would like computers to help more: “Who is the president of Stanford University?” Problem: WWW designed for human understanding.
Project Highlights Demonstrate a simple way by which knowledge may be extracted from the Web. Classify Web pages from Computer Science Departments. Learner: Naive Bayes. Features: word counts. Ran 60 experiments, each using different values for various parameters.
Data Set
Some Parameters Which words do we count? Select words using: Pointwise Mutual Information vs. Average Mutual Information vs. X2 What form do feature values take? “raw” word counts vs. word counts normalized for page length.
Number of Experiments (5 data sets) * (2 Feature Types) * (3 Feature Selection Techniques) * (2 Normalization Methods) = 60 Experiments.
Results
Results (cont.)
Total Results
Best Results: 85% Correct Classification Using: Feature Selection: Pointwise Mutual Information Normalization: Normalized for Document Length