Apriori Algorithm and the World Wide Web Roger G. Doss CIS 734
The purpose of this presentation is to introduce an application of the Apriori algorithm to perform association rule data mining on data gathered from the World Wide Web. Specifically, a system will be designed that gathers information ( web pages ) from a user specified site, and performs association rule data mining on that data.
We have already seen the Apriori algorithm applied to textual data in class. Given an implementation that can work with textual data...
What we want to do, is to use Apriori in the following manner: Given an input of: (url,N links,support,confidence,keywords) *obtain the url * traverse all adjacent links up to N *format the data *compute support and confidence levels for each word in a user supplied keyword set.
We can invision several components to this system which can be divided into four components: Phase 0: User input. Phase 1: Data Acquisition. Phase 2: Running Apriori on the data. Phase 3: User output.
Data Acquisition: Traverse Web Page(URL,N) | while N web pages not visited | Obtain WebPage via HTTP | Parse information ( look for keywords, adjacent links ) | Store keywords in a file Store adjacent links to visit
If we treat the initial web page and each adjacent web page as a transaction, then each occurance of a keyword is an element in that transaction. At this point, the Apriori algorithm can be run on the data, producing a set of Association Rules based on desired Confidence and Support levels. Running Apriori on the Data:
Some modules that may be needed to implement the system: * HTTP Client. Accessing a web page from a URL mechanically. * Data Cleaning. Extracting words that match keyword list. Extracting hyper text references, ie, href=" * Apriori Algorithm. * Web traversal.
Building this system allows one to have a code base that can be used for future research and work. An HTTP client is needed to obtain data from the web, web traversal is important in web crawling and parsing HTML allows one to extract information from web pages. An interesting problem is how one could traverse a web page and visit N links reachable from that web page. We can view the WWW as a graph. Each URL is a nodeWWW on that graph. From each page, we have hyper-text references that point to other resources, including other web pages. We consider these other web pages as adjacent nodes.
Assume that you have the following primitives: string get_webpage( string url ); list get_adj_webpages( string webpage ); Using C++ Standard Template Library, implement Breadth First Search to traverse all adjacent web pages from an initial web page source. Hint: The following containers might be useful: map visited; queue q;
void bfs( string url ) { // Maps urls to boolean value indicating // if they were visited. mapM visited; // FIFO queue of urls. queue q; // List of adjacent urls. list adj; // Contains web page results. string data; // Mark initial url as not visited. visited[url] = false;
// Insert into queue the initial url. q.push(url); // Traverse the web pages. while(q.size() != 0) { if(visited[(url=q.top())] == false) { data = get_webpage(url); adj = get_adj_wepages(data); // Mark as visited. visited[url] = true; // Remove url just visited from queue. q.pop();
// Insert into queue all adjacent webpages. for(list ::iterator i =adj.begin(); i != adj.end(); i++) { // If we did not already visit this page... if(visited[(*i)] != true) { q.push((*i)); visited[(*i)]=false; } }// bfs
We have a given node/url, A, with adjacent nodes/urls B,C,D as follows: page A adj B,C,D. page B adj A,E,F. page C adj G. page D adj A. Or as a directed graph: B A D | EC | FG
(init) visit A (from A) visit B,C,D (from B) visit E,F (from C) visit G * We do not consider URLs already visited. * Each time we visit a page, some processing can be done. In this case, we obtain a list of words that we are interested in.
Given that we can extract a set of words from a web page, we know what URL those words appeared on, and we can produce support and confidence levels using Apriori, design a simple database using SQL and a RDBMS that allows one to model the following information: keyword, site, url, support, confidence and give an example query where provided the keyword, support and confidence levels, we can obtain the site,url's that contain that keyword with the desired support and confidence level. Site refers to the WWW address, such as d URL refers to the location, such as /index.html