Download presentation
Presentation is loading. Please wait.
Published byGordon Scott Modified over 9 years ago
1
Design of a Click-tracking Network for Full-text Search Engine Group 5: Yuan Hu, Yu Ge, Youwen Gong, Zenghui Qiu and Miao Liu
2
Outline Introduction Objective Project diagram –Web Crawling –Indexing schema Ranking strategies –PageRank Algorithms –Neural Network –Content-Based Ranking Software and Reference
3
Introduction Full-text Search Engine –search on key words –rank results What is in a Search Engine? –Crawling –Indexing –Ranking results of query
4
Objective Design a full-text search engine Rank search results in different ways
5
Project Diagram Website Crawling Text & urls Database Indexing Query Function Click-Tracking Network PageRank Algorithms Content-Based Ranking Ranked results
6
Web Crawling Depth 1: crawling all the url links on the main page Depth 2: crawling all the url links found in depth 1 Main page: …… http://en.wikipedia.org/wiki/Machine_learning http://en.wikipedia.org/wiki/Decision_tree_learning#Information_gain http://en.wikipedia.org/wiki/Machine_learning#Decision_tree_learning …… # Implemented with Python urllib2 module and BeautifulSoup API
7
URL LINK URL Main Page Depth 1 Depth 2 URL LINK
8
Schema for Basic Index Link Row_ID From_ID To_ID Url_list Row_ID Url Word_location Url_ID Word_ID Location Word_list Row_ID Word Link_words Word_ID Link_ID # Implemented with SQLite
9
Results for Multiple-words Query Words Combination Same url _idWord location ! Notice that all the url_ids returned are not ranked.. Query function
10
PageRank Algorithm Developed by Larry Page at Stanford U. in 1996. How important that page is. The importance of the page is calculated from all the other pages that link to it. http://www.rasch.org/rmt/rmt232a.htm
11
How to Calculate PR d: damping factor, 0<d<1, 0.85. PR(B), ……..,PR(D)…. : PageRank value of each webpage linking to page A. L(B),…….,L(D),….. : The number of links going out of page B,……D…..
12
Example PR(A) = 0.15 + 0.85 * ( PR(B)/links(B) + PR(C)/links(C) +PR(D)/links(D) ) = 0.15 + 0.85 * ( 0.5/4 + 0.7/4 + 0.2/1 ) = 0.15 + 0.85 * ( 0.125 + 0.175 + 0.2) = 0.15 + 0.85 * 0.465 = 0.575
13
How to Update the PR Value If we don’t know what their PR should be to begin with, just assign an initial PR value for every page. 20 Iterations Update http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
14
Results for PageRank PageRank values
15
Neural Network Why? Make reasonable guess about results for queries that they have never seen before. Click-tracking The weights are updated based on the search results which the user clicked.
16
Neural Net Work Step1: Setting Up the Database Step2: Feeding Forward Activation Step3: Training with BackPropagation How Neural Network works? Solid line: Strong connections Bold text: Active node
17
Step1: Setting Up the ANN Database Create a table for hidden layer(red box) Create two tables for the connections(green boxes)
18
Step2: Feeding Forward Activation Objective: activate the ANN. –Take words as inputs –Activate the links in the network –Give outputs for URL Hyperbolic tangent function X-axis: total input to the node
19
Step3: Training with Backpropagation Train the network every time someone performs a search and choose one of the links The same algorithm covered in class. Learning rate = 0.5
20
Step 1: From ID To ID Hidden node Strength Step 2: relevance of URL input URL Results For Neural Network Step 3: Training with one query
21
Results For Neural Network(contd) Step 3: Training with more queries
22
Content-Based Ranking Word frequency Document location Word distance Basic Idea: Calculate a score based only on the query and the content of the page
23
Reference Collective Intelligence- Toby Segaran SQLite Tutorial - ZetCode Dive into Python – Mark Pilgrim Software Ubuntu 11.04 Python 2.7.3 SQLite
24
Thank you.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.