Download presentation
Presentation is loading. Please wait.
1
1 Object-Level Vertical Search Zaiqing Nie Microsoft Research Asia
2
2 2 Outline Overview Demo: Libra Academic Search Core Technologies –Vision-based Page Segmentation –Web Object Extraction –Object Integration –Object Ranking –Object Mining Conclusion
3
3 3 Outline Overview Demo: Libra Academic Search Core Technologies –Vision-based Page Segmentation –Web Object Extraction –Object Integration –Object Ranking –Object Mining Conclusion
4
4 4 Terminology Web Object –A collection of (semi-) structured Web information about a real- world object –e.g. Person, product, job, movie, restaurant, … Object-Level Search –Search based on Web objects Vertical Search –Search information in a specific domain
5
5 5 General Web Search (Google)
6
6 6 Page Level Vertical Search (Google Scholar)
7
7 7 Object Level Vertical Search (MSRA Libra)
8
8 8 Object-Level Search vs. Page-Level Search Page-Level SearchObject-Level Search Technology Information Retrieval (IR) Page as the unit of retrieval Database (DB) Object as the unit of retrieval Pros Ease of authoring Ease of use Powerful query capability Direct answer Aggregate answer Cons Limited query capability Sifting through hundreds of (irrelevant) pages Where and how to get the objects? – A large portion of Web contents are inherently (semi-)structured
9
9 9 Why Vertical Search? Sites dedicated to a specific domain Audiences with specific interests Easier to build object level search in a domain –Data in some domains are more structured and uniform –Easy to define object types and schemas
10
10 Web Search 2.0 -> Web Search 3.0 Current Web Search Page-Level General Search Vertical Search Object-Level
11
11 Web Search 2.0 -> Web Search 3.0 Current Web Search Google Scholar Libra Academic Search Page-Level General Search Vertical Search Object-Level From Relevance to Intelligence
12
12 Goal of Object Level Vertical Search Make Web search engine –as scalable as an IR system –as effective as a DB system
13
13 Page Category Number of English Pages Percentage of Total (22.5K English Page) # of Pages in 9B index (All Languages) Product Transactional Pages18358.15%733M Product List Pages9954.4%396M Product Review Pages1960.9%81M Comercial Services1960.9%81M Hotels,Travel packages,Tickets1310.6%54M Location1120.5%45M Newspapers920.4%36M Job listings180.08%7.2M Music160.07%6.3M Movie150.07%6.3M Commercial Data Statistics Manually labeled Web pages using predefined categorization Initial results from manually classifying 51K randomly selected pages
14
14 Architecture Web Object Crawling Classification Location Extractor Product Extractor Conference Extractor Author Extractor Paper Extractor Paper Integration Author Integration Conference Integration Location Integration Product Integration Scientific Web Object Warehouse Product Object Warehouse Web Objects PopRank Object RelevanceObject Community MiningObject Categorization
15
15 Outline Overview Demo: Libra Academic Search Core Technologies –Vision-based Page Segmentation –Web Object Extraction –Object Integration –Object Ranking –Object Mining Conclusion
16
16 Http://libra.msra.cn
17
17 Http://libra.msra.cn
18
18 Http://libra.msra.cn
19
19 Http://libra.msra.cn
20
20 Http://libra.msra.cn
21
21 Http://libra.msra.cn
22
22 Http://libra.msra.cn
23
23 Http://libra.msra.cn
24
24 Http://libra.msra.cn
25
25 Http://libra.msra.cn
26
26 Http://libra.msra.cn
27
27 Http://libra.msra.cn
28
28 Http://libra.msra.cn
29
29 Http://libra.msra.cn
30
30 Http://libra.msra.cn
31
31 Outline Overview Demo: Libra Academic Search Core Technologies Vision-based Page Segmentation –Web Object Extraction –Object Integration –Object Ranking –Object Mining Conclusion
32
32 Motivation Problems of treating a web page as an atomic unit –Web page usually contains not only pure content Noise: navigation, decoration, interaction, … –Multiple topics Web page has internal structure –Two-dimension logical structure & Visual layout presentation – > Free text document – < Structured document Layout – the 3 rd dimension of Web page –1 st dimension: content –2 nd dimension: hyperlink
33
33 Object Information on the Web Scientific PapersResearchersProduct Items Business LocationsImagesJobs
34
34 Is DOM a Good Representation of Page Structure? Page segmentation using DOM –Extract structural tags such as P, TABLE, UL, TITLE, H1~H6, etc Page segmentation using DOM, content and link –Record boundary discovery by heuristics –Fine-grained topic distillation by link analysis Function-based Object Model (FOM) –Define a function for each object and partition the page based on these functions DOM is more related content display, does not necessarily reflect semantic structure How about XML?
35
35 Vision-based Content Structure Goal: Extract content structure for web page based on visual cues –Typical visual cues Position Line Blank area Color Font size Image … –Extracted from the rendering result of Web browsers Assumption: Content structure based on visual display reflects semantic partition of content, and visual cues help to build content structure
36
36 Definition of Vision-based Content Structure A hierarchical structure of layout block –Layout block is a basic objects or a group of basic objects –Basic object is the leaf node in the DOM tree of the page Can be formally described by : – is a finite set of layout blocks – is a finite set of visual separators – describes the relationship that two layout blocks are separated by a visual separator –Each layout block is a sub-web-page and has similar intra structure
37
37 An Example of Vision-based Content Structure A hierarchical structure of layout block A Degree of Coherence (DOC) is defined for each block –Show the intra coherence of the block –DoC of child block must be no less than its parent’s The Permitted Degree of Coherence (PDOC) can be pre-defined to achieve different granularities for the content structure –The segmentation will stop only when all the blocks’ DoC is no less than PDoC –The smaller the PDoC, the coarser the content structure would be
38
38 VIPS (VIsion-based Page Segmentation) – An Algorithm to Effectively Extract Content Structure Visual Block Extraction Visual Separator Detection Content Structure Construction Iterating the Above Steps Steps: Iteratively find all appropriate visual blocks Visual cues in this stage –Tag cue –Color cue –Text cue –Size cue Visual Separator –horizontal or vertical line –visually cross with no blocks Weight of Separator –Set weight to each separator according to some patterns Maximally-weighted separators are chosen as the real separators Blocks that are not separated are merged Calculate DOC for each block Each block is checked if they meet the granularity requirement (i.e., if DOC > PDOC) –For those that fail, iteratively partition them
39
39 The VIPS Algorithm Flowchart
40
40 Step 1: Visual Block Extraction (Cont.) Find iteratively all appropriate visual blocks contained in the current sub-tree Visual cues to determine whether dividing a DOM node –Tag cue: Tags such as are often used to separate different topics from visual perspective. –Color cue: DOM nodes with different background colors –Text cue: If most of the children of a DOM node are text nodes, we prefer to not divide it. –Size cue: We predefine a relative size (comparing with the size of the whole page or sub-page) threshold for different tags
41
41 Step 2: Visual Separator Detection Visual separator is represented by (P s, P e ) –P s is the start pixel while P e is the end pixel –horizontal or vertical lines in a web page that visually cross with no blocks in the pool Two parts –Separator Detection –Separator Weights Setting
42
42 Step 2: Visual Separator Detection (Cont.) Separator Detection –Start with only one separator (P tl, P br ) –Add block into the pool and update separators If the block is contained in a separator, split it If the block crosses a separator, update it If the block covers a separator, remove it –Remove the separators on the border
43
43 Step 2: Visual Separator Detection (Cont.) Setting Weight for Separator –Blank size between blocks –Overlap of the separator and tags, such as HR –Difference of font sizes between the two sides of the separator –Difference of colors between the two sides of the separator
44
44 Step 3: Content Structure Construction Maximally-weighted separators are chosen as the real separators Blocks that are not separated are merged A content structure is built at this level Each sub-block is checked if they meet the granularity requirement –For those that fail, we iteratively partition them –If all sub-blocks meet the requirement, we finally get the content structure for the web page
45
45 An VIPS Example
46
46 Advantages of the VIPS Algorithm Accuracy –Blocks at semantic level Efficiency –True top-down –Cutting down page rendering and display time Scalable –blocks with various granularities
47
47 Example of Web Page Segmentation (1)
48
48 Example of Web Page Segmentation (2) Can be applied on web image retrieval –Surrounding text extraction
49
49 Experiments Manual evaluation of page segmentation –140 pages are selected from 14 Yahoo! categories Human judgmentNumber of pages Perfect86 Satisfactory50 Failed4
50
50 Web Page Block – Better Information Unit Page Segmentation Vision based approach Block Importance Modeling Statistical learning Importance = Med Importance = Low Importance = High Web Page Blocks WWW’03 PaperWWW’04 Paper
51
51 Block Importance (WWW’04) Page Importance –Importance page vs. unimportant page –HITS, PageRank Block Importance –Valuable information vs. noisy information –?
52
52 A User Study of Block Importance Do people have consistent opinions about the importance of the same block in a page? Subjective importance –From users’ view –Attention: concentration of mental powers on an object, a close or careful observing or listening –Affected by users’ purposes and preferences Objective importance –From authors’ views –Correlation degree between a block and the theme of the web page
53
53 Settings of User Study Data –600 web pages from 405 sites in 3 categories in Yahoo!: news, science and shopping –Each category includes 200 pages –With diverse layouts and contents –600 pages are segmented to 4,539 blocks using VIPS Importance Labeling –5 human assessors to manually label each block with 4-level importance values –Level 1: noisy information such as advertisement, copyright, decoration, etc. –Level 2: useful information, but not very relevant to the topic of the page, such as navigation, directory, etc. –Level 3: relevant information to the theme of the page, but not with prominent importance, such as related topics, topic index, etc. –Level 4: the most prominent part of the page, such as headlines, main content, etc.
54
54 Result Analysis Users do have consistent opinions when judging the importance of blocks Levels3/5 agreement4/5 agreement5/5 agreement 1,2,3,40.9290.5350.237 1,(2,3),40.9950.7330.417 (1,2,3),410.9320.828
55
55 Result Analysis (cont.) Levels3/5 agreement4/5 agreement5/5 agreement (1,2),3,40.9650.760.562 1,(2,3),40.9950.7330.417 1,2,(3,4)0.9630.6140.318 (1,3),2,40.9650.5530.244 1,3,(2,4)0.9650.5550.248 (1,4),2,30.9340.5390.24 Levels 2 and 3 are the most blurry zones to be distinguished
56
56 Block Importance Model A block importance model is formalized as Block features: –Content features Absolute –ImgNum, ImgSize –LinkNum, LinkTextLength –InnerTextLength –InteractionNum, InteractionSize: and –FormNum, FormSize: Relative
57
57 Block Importance Model (cont.) Block features: –Spatial features Absolute –{BlockCenterX, BlockCenterY, BlockRectWidth,BlockRectHeight} Relative –{BlockCenterX/PageWidth, BlockCenterY/PageHeight, BlockRectWidth/PageWidth, BlockRectHeight/PageHeight} Window –BlockRectHeight = BlockRectHeight/WindowHeight –BlockCenterY is modified
58
58 Learning Block Importance Training set T: Labeled blocks (x, y) –x: feature representation of a block –y: importance label A function f such is minimized Learning algorithms –Regression by neural network RBF network –Classification by Support Vector Machines Linear kernel Gaussian RBF kernel
59
59 Experiments Experimental setup –600 labeled web pages from 405 sites –4517 blocks for which at least 3 of the five assessors have agreement on their importance –5-folder cross validation –Measures: Micro-F1 and Micro-Accuracy
60
60 3-level vs. 4-level Importance For the 4-level importance model, the precision and recall of level 2 and 3 are much lower than those of level 1 and 4 By combining level 2 and 3, the performance increased significantly Level 1Level 2Level 3Level 4Micro-F1Micro-Acc 4-level0.708 (P) 0.782 (R) 0.643 (P) 0.658 (R) 0.567(P) 0.372(R) 0.826 (P) 0.822 (R) 0.6850.843 3-level0.763 (P) 0.776 (R) 0.796 (P) 0.804 (R) 0.839 (P) 0.770 (R) 0.7900.859
61
61 Spatial Features vs. All Features Content features do provide some complementary information to spatial features in measuring block importance Level 1Level 2Level 4Micro-F1Micro-Acc Spatial0.714 (P) 0.684 (R) 0.754 (P) 0.769 (R) 0.805 (P) 0.841 (R) 0.7480.832 All0.763 (P) 0.776 (R) 0.796 (P) 0.804 (R) 0.839 (P) 0.770 (R) 0.7900.859
62
62 Block Importance Model vs. Human Assessors Level 1Level 2Level 3Micro-F1Micro-Acc Assessor 10.817 (P) 0.856 (R) 0.871 (P) 0.857 (R) 0.934 (P) 0.871 (R) 0.8580.906 Assessor 20.756 (P) 0.834 (R) 0.815 (P) 0.782 (R) 0.816 (P) 0.715 (R) 0.7920.861 Assessor 30.864 (P) 0.815 (R) 0.838 (P) 0.881 (R) 0.852 (P) 0.809 (R) 0.8490.899 Assessor 40.904 (P) 0.684 (R) 0.797 (P) 0.908 (R) 0.827 (P) 0.912 (R) 0.8300.887 Assessor 50.849 (P) 0.924 (R) 0.895 (P) 0.882 (R) 0.938 (P) 0.762 (R) 0.8820.921 Average0.838 (P) 0.823 (R) 0.843 (P) 0.862 (R) 0.873 (P) 0.814(R) 0.8420.895 Our model0.763 (P) 0.776 (R) 0.796 (P) 0.804 (R) 0.839 (P) 0.770 (R) 0.7900.859
63
63 Block-level Link Analysis C A B
64
64 A Sample of User Browsing Behavior
65
65 Improving PageRank using Layout Structure Z: block-to-page matrix (link structure) X: page-to-block matrix (layout structure) Block-level PageRank: –Compute PageRank on the page-to-page graph BlockRank: –Compute PageRank on the block-to-block graph
66
66 Using Block-level PageRank to Improve Search Block-level PageRank achieves 15-25% improvement over PageRank (SIGIR’04) PageRank Block-level PageRank Search = IR_Score + (1- PageRank
67
67 Outline Overview Demo: Libra Academic Search Core Technologies –Vision-based Page Segmentation Web Object Extraction (demo first) –Object Integration –Object Ranking –Object Mining Conclusion
68
68 Extracting Objects from the Web Zaiqing Nie, Fei Wu, Ji-Rong Wen, and Wei-Ying Ma
69
69 Collecting Object Information Data Feed? –Limited coverage –Fail to handle “tail” Data Crawl –The largest index wins –Data refreshing Mining Web Objects –Bridge between unstructured and structured data –Deal with data with huge volume –Adapt to the highly diverse and dynamic Web environment
70
70 Object Information on the Web Scientific PapersResearchersProduct Items Business LocationsImagesJobs
71
71 Existing Approaches Basic Idea –Convert HTML into a sequence of tokens or a tag tree –Discover pattern Representative Methods –Wrapper generation Manually write a wrapper Induct a wrapper [Liu 2000], [Kushmerick 1997] –Extract structured data from Web pages that shared a common template Equivalence classes [Arasu 2003] RoadRunner [Crescenzi 2001] –Extract data record within a Web page OMINI: record-boundary discovery IEPAD: pattern discovery on PAT tree MDR: repeated nodes discovery –Extract data from tables in a Web page Classify tables into genuine table or non-genuine table [Wang 2002] Extract data from data tables [Chen 2002], [Lerman 2001]
72
72 Problems with Existing Approaches
73
73 Problems with Existing Approaches
74
74 Problems with Existing Approaches
75
75 Problems with Existing Approaches
76
76 Vision-based Approach for Web Object Extraction Visual Element Identification Similarity Measure & Clustering Record Identification & Extraction Visual Element Identification Similarity Measure & Clustering Record Identification & Extraction Object Blocks
77
77 Object Block & Object Element Object Block Element
78
78 Object-level Information Extraction (IE) The Problem Name Price Description Brand Rating Image Digital Camera Object Block e1 e2 e3 e4 e5 e6 a1 a2 a3 a4 a5 a6 Element Attribute
79
79 Object Extraction as Sequence Data Labeling Sequence Characteristics: productbeforeresearcherbefore (name, desc)1.000(name, Tel)1.000 (name, price)0.987(name, email)1.000 (image, name)0.941(name, address)1.000 (image, price)0.964(address, email)0.847 (Image, desc)0.977(address, tel)0.906 Product: 100 product pages (964 product blocks) Researcher: 120 researcher’s homepages (120 homepage blocks)
80
80 Extended Conditional Random Fields Our Solution based on Extended Conditional Random Fields (ECRF) Model
81
81 Example Features in Extended CRF Model Text Features –Only contain “$” and digits –Percent of digits in the element, … Vision Features –Font size, color, style,… –Element size & position –Separators: lines, blank areas, images… Database Features –Match an attribute of a record –Match key attributes of a record, …
82
82 Experiment Three kinds of objects are selected: Header Citation Homepage
83
83 Experimental Results
84
84 Information Integration Database Name Price Description Brand Rating Image Digital Camera Website 1 Website N Improve IE process
85
85 Results with DB of various sizes
86
86 2D Conditional Random Fields for Web Information Extraction Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang and Wei-Ying Ma August 9, 2005
87
87 Limitations of Linear-Chain CRFs Sequentialization Attributes are two-dimensionally laid out Which sequentialization is better? Two-dimensional interactions are seriously lost
88
88 2D Conditional Random Fields
89
89 Inference on diagonal State sequence
90
90 Modeling an Object Block
91
91 Experiment -- Dataset Randomly crawled 572 Web pages & collected 2500 Web blocks using Vision-based segmentation technology Two types of Web blocks –ODS: one dimensional blocks (information doesn’t have two-dimensional interactions) –TDS: two dimensional blocks (information does have two-dimensional interactions) Training Set –500 Web blocks ( 400 TDS + 100 ODS ) Testing Sets –ODS (1000) –TDS (1000)
92
92 Experiment -- Evaluation Criteria Precision –the percentage of returned elements that are correct Recall –the percentage of correct elements that are returned F1 Measure –the harmonic mean of precision & recall Average F1 Measure –the average of F1 values of different attributes Block Instance Accuracy –the percentage of blocks of which the important attributes (name, image, and price) are correctly labeled
93
93 Experiment -- Results
94
94 Experiment -- Results
95
95 Simultaneous Record Detection and Attribute Labeling in Web Data Extraction Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, Wei-Ying Ma SIGKDD 2006
96
96 De-coupled Web Object Extraction Papers –Nie et al., ICDE ’ 06 & Zhu et al., ICML ’ 05 Technical Transfer of the Year Award (MSRA 2006) Basic Idea Step1: Record detection and Element segmentation Step2: Attribute Labeling Kristjansson et al. AAAI’04, Nie et al., ICDE’06 Zhu et al., ICML’05
97
97 Inefficiencies Error propagation –Limited in overall performance Lack of semantics in record detection –Semantics help identify records Lack of mutual interactions in attribute labeling –Records in the same page are related and mutually constrained First-order Markov assumption –Fail to incorporate long distance dependency
98
98 Vision-based Page Representation A Web Page can be represented as a Vision-tree [Cai et al., 2004] –Makes use of page layout features such as font, color, and size. –Each node represents a data region in the Web page, is called a block
99
99 Joint Web Object Extraction Definition 1: Record Detection –Given a vision-tree, record detection is the task of locating the minimum set of blocks that contain the content of a record. Definition 2: Attribute Labeling –For each identified record, attribute labeling is the process of assigning attribute labels to the leaf blocks (or elements) within the record. Definition 3: Joint optimization of record extraction and attribute labeling: –let be the features of all the blocks, –let be one possible label assignment of the corresponding blocks. – Let Hierarchical CRF model do it!
100
100 Hierarchical CRF Model Assumptions –Sibling variables are directly interacted –Non-sibling variables are conditionally independent Parameter estimation and labeling can be solved using standard junction tree algorithm Details in our paper……
101
101 HCRF Model for Web Object Extraction Inter-level interactions –Dependencies between parents and children –Different from Multi-scale CRF model [He et al., 2004] Long distance dependencies –Through the dependencies at various levels and the inter-level interactions Flexibility to incorporate any useful feature –HCRF model is a conditional model, and also a CRF model
102
102 Inter-level interactions –Dependencies between parents and children –Different from Multi-scale CRF model [He et al., 2004] Long distance dependencies –Through the dependencies at various levels and the inter-level interactions Flexibility to incorporate any useful feature –HCRF model is a conditional model, and also a CRF model Computational efficiency HCRF Model for Web Object Extraction Contain Name and Price Contain Description Contain Image
103
103 HCRF Model for Web Object Extraction Inter-level interactions –Dependencies between parents and children –Different from Multi-scale CRF model [He et al., 2004] Long distance dependencies –Through the dependencies at various levels and the inter-level interactions Flexibility to incorporate any useful feature –HCRF model is a conditional model, and also a CRF model Computational efficiency Product NameProduct ImageProduct DescriptionProduct Price
104
104 Empirical Evaluation Two datasets for two types of Web pages –List dataset (LDST) 771 list pages (200 for training and 571 for testing) –Detail dataset (DDST) 450 detail pages (150 for training and 300 for testing)
105
105 Outline Overview Demo: Libra Academic Search Core Technologies –Vision-based Page Segmentation –Web Object Extraction Object Integration –Object Ranking –Object Mining Conclusion
106
106 Motivation Which Lei Zhang are we talking about?
107
107 Object Identification Only text similarity is used in existing approaches Connection strength in the object relationship graph is another important evidence The author identification problem Lei Zhang Multiple researchers with the same name Alon Levy Alon Y. Halevy Multiple names of the same researcher Conference Paper Conference Identifying all papers by a researcher Lei Zhang
108
108 Web Connections Local information is incomplete Web is a good source for validating connections between objects Co-appearance on the same sentence, Web page, or Website
109
109 Outline Overview Demo: Libra Academic Search Core Technologies –Vision-based Page Segmentation –Web Object Extraction –Object Integration Object Ranking Object Mining Conclusion
110
110 Object-level Ranking: Bringing Order to Web Objects Zaiqing Nie Yuanzhi Zhang Ji-Rong Wen Wei-Ying Ma Microsoft Research Asia (presented by Zaiqing Nie)
111
111 Object Relationship Graph Different types of links –Paper->paper, author->paper, conf->paper –have different semantics –Affect the popularity of the related objects differently
112
112 Back-Links of An Object on the Web Adding a Popularity Propagation Factor to each relationship link –The same type of links have the same factor The popularity of an object is also affected by the popularity of the Web pages containing the object
113
113 Biased Random Surfing Behavior A Random Object Finder Model –Starting his random walk on the Web to find the first seed object –Then starting following only the relationship links –Eventually getting bored –Restarting his random walk on the Web again to find another seed object Popularity of an Object Depends on –The Probability of finding the object through the Web graph –The Probability of finding the object through the object relationship graph
114
114 The PopRank Model R X = εR EX + (1-ε)Σγ YX M T YX R Y Where X = {x 1,x 2,…,x n },Y = {y 1,y 2,…,y n }: objects of type X and type Y ; R X, R Y: vector of popularity rankings of objects of type X and type Y; M YX : adjacent matrices, m yx = 1 / Num(y;x), if there is a relationship link from object y to object x, Num(x,y) denotes the number of links from object y to any objects of type X; m yx = 0, otherwise; γ YX: denotes the popularity propagation factor of the relationship link from an object of type Y to an object of type X R EX : vector of Web popularity of objects of type X; ε: a damping factor which is the probability that the "random object finder" will get bored of following the object relationship links and start looking for another object through the Web graph.
115
115 How to Assign PPF Factors Impractical to manually assign PPF factors Easy to collect some partial ranking of the objects from domain experts –An example: SIGMOD -> VLDB -> ICDE -> ER A typical parameter optimization problem –Selecting a combination of PPF factors for the PopRank model that results in rankings of the training objects that match the expert ranking as closely as possible. –Exploring the search space using a simulated annealing approach
116
116 Searching for PPF Factors PopRank Calculator Ranking Distance Estimator Select a new combination from neighbors of the best Chosen as the best Link Graph Initial Combination of PPFs Better than the best ? Accept The worse one ? Expert Ranking Yes No Yes
117
117 Challenges Facing our Learning Approach May take hours/days to try and evaluation a single combination of PPF factors for a large graph Prohibitively expensive to try hundreds of combinations The effect decreases as the "relationship distance" increases A subgraph that includes the training objects and their closely related objects to approximate the full graph
118
118 Subgraph Selection PopRank Calculator Ranking Distance Estimator Increase the distance Link Graph Initial Relationship Distance Greater than the stop threshold ? Ranking from the full graph Yes No Done
119
119 Experimental Study Datasets – 7 million object relationship links from three different types of links –1 million papers, 650,000 authors, 1700 conferences, and 480 journals –14 partial ranking lists containing ranking information for 67 objects (8 for training, and 6 for testing)
120
120 Experimental Results for Different Subgraph Learning Time Ranking Distance
121
121 Experimental Results for Different Stop Thresholds Learning Time Ranking Distance
122
122 PopRank versus PageRank
123
123 Conclusion A PopRank model for calculating object popularity scores –Web graph –Object relationship graph An Automated approach for assigning Popularity Propagation Factors The effectiveness is shown in Libra –Significantly better than PageRank Generally applicable for most vertical search domains –Product Search, Movie Search, …
124
124 Web Object Retrieval Information about a Web Object is Extracted from Multiple Sources –Inconsistent Copies –Reliability assumption no longer valid Inconsistency Example Source TitleAuthors Ground Truth Towards Higher Disk Head Utilization: Extracting Free Bandwidth From Busy Disk Drives Christopher R. Lumb, Jiri Schindler, Gregory R. Ganger, David Nagle, Erik Riedel CiteSeerTowards Higher Disk Head Utilization: Extracting Free Bandwidth From Busy Disk Drives Christopher R. Lumb, Jiri... DBLPTowards Higher Disk Head Utilization: Extracting “Free” Bandwidth from Busy Disk Drives Christopher R. Lumb, Jiri Schindler, Gregory R. Ganger, David Nagle, Erik Riedel
125
125 Unreliability of Information about Web Objects Web Object Attribute 1 Attribute 2 Attribute n Object Block 1 Object Block 2 Object Block m imp 1 imp 2 imp n conf 1 conf 2 conf m The unreliability of Objects –Unreliable data sources –Incorrect object detection –Incorrect attribute value extraction
126
126 Web Object Retrieval A Language Model for Object Retrieval Balancing Structured and Unstructured Retrieval –Block-level unstructured object retrieval –Attribute-level retrieval –Using the confidence of the extracted object information as the parameter to find the balance Web Source 1 Web Source 2 Web Source m Record Attribute 1 Web Object Attribute 2 Attribute n Record Extraction Record-level Representation Attribute Extraction Attribute-level Representation α1α1 α2α2 αmαm γ1γ1 γ2γ2 γmγm β1β1 β2β2 βnβn
127
127 Experimental Results Models –Bag of Words (BW) –Unstructured Object Retrieval (UOR) –Multiple Weighted Fields (MWF) –Structured Object Retrieval (SOR) –Balancing Structured and Unstructured Retrieval (BSUR)
128
128 Outline Overview Demo: Libra Academic Search Core Technologies –Vision-based Page Segmentation –Web Object Extraction –Object Integration Object Ranking Object Mining Conclusion
129
129 Object Mining Object Community Mining Relationship mining Trend analysis
130
130 Research Community Mining Motivation –Discovering research communities and their important papers, authors A community is described as a set of concentric circles –Core objects in the center –Affiliated objects surround the core with different ranks
131
131 Demo
132
132 Conclusion An object-level vertical search model is proposed Key technologies to build object-level vertical search engine –Object extraction –Object identification –Object popularity ranking –Object community mining More applications –Yellow page search –Job search –Mobile search –Movie search –……
133
133 Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.