Download presentation
Presentation is loading. Please wait.
Published byFrancine Farmer Modified over 9 years ago
1
Searching CiteSeer Metadata Using Nutch Larry Reeve INFO624 – Information Retrieval Dr. Lin – Winter 2005
2
CiteSeer
3
CiteSeer Search Issues Search Issues Keyword-based full-text search Keyword-based full-text search Boolean search syntax Boolean search syntax How to… How to… search by author name? search by author name? search author affiliation? search author affiliation? search by publication date? search by publication date?
4
CiteSeer Example: Example: Suggested author search approach: Suggested author search approach: For authors, list all variants that appear in citations, separated by “OR“ For authors, list all variants that appear in citations, separated by “OR“ Examples: Examples: m jordan or michael jordan or m i jordan or m jordan or michael jordan or m i jordan or michael i jordan howard w/2 white or h w/2 white howard w/2 white or h w/2 white
5
CiteSeer – phrase search
6
CiteSeer – term search
7
Goal Search selected metadata fields Search selected metadata fields Author name Author name Author affiliation Author affiliation Publication Date (month, day, year) Publication Date (month, day, year) Title Title Others… Others… Increase precision Increase precision
8
Methodology - Nutch An open-source web search engine An open-source web search engine Includes crawling, indexing, searching Includes crawling, indexing, searching Technologies: Java, JSP, Tomcat Technologies: Java, JSP, Tomcat Extensible Extensible new fields new fields new parsing/indexing facilities new parsing/indexing facilities adapt UI for searching adapt UI for searching
9
Methodology - Metadata
10
Methodology 1) Split XML file into HTML documents Each HTML doc contains metadata Each HTML doc contains metadata Allows existing crawler to be used/extended Allows existing crawler to be used/extended 2) Crawl and index HTML documents on local filesystem 3) Search generated index using JSP page
11
Methodology 100 HTML Documents XML File (100 records) Split Program Nutch Crawler Parse Filter Index Filter Nutch Search (JSP) Query Filter Implemented as part of project
12
XML to HTML Split
13
Methodology - Split
14
Methodology – Crawl/Index Requires 2 filters to process metadata Requires 2 filters to process metadata CSParseFilter CSParseFilter Parses HTML for metadata values Parses HTML for metadata values Implements Nutch HtmlParseFilter interface Implements Nutch HtmlParseFilter interface CSIndexingFilter CSIndexingFilter Uses metadata generated by ParseFilter Uses metadata generated by ParseFilter Adds metadata to index Adds metadata to index Implements Nutch IndexingFilter interface Implements Nutch IndexingFilter interface
15
Parse Filter – extract metadata
16
Index Filter
18
Methodology – Query Modification of Nutch search page Modification of Nutch search page Change URL from filesystem metadata HTML to CiteSeer Change URL from filesystem metadata HTML to CiteSeer Change to 20 hits, to match CiteSeer Change to 20 hits, to match CiteSeer Query filter Query filter Handles custom fields from index filter Handles custom fields from index filter Prefixed with cs_ Prefixed with cs_ Implements Nutch QueryFilter interface Implements Nutch QueryFilter interface
19
Query Filter
20
Evaluation Testing for precision/recall Testing for precision/recall 100 documents 100 documents Stress test Stress test 10,000 documents 10,000 documents Approx 10 mins to crawl/index Approx 10 mins to crawl/index 575,000 documents in CiteSeer metadata download 575,000 documents in CiteSeer metadata download (716,797 documents in CiteSeer) (716,797 documents in CiteSeer) 3.5 hours to split XML into HTML 3.5 hours to split XML into HTML 12 hours to crawl/index 12 hours to crawl/index ~551,000 indexed during crawling ~551,000 indexed during crawling
21
Evaluation Precision & recall Precision & recall Use first 100 docs (easy to measure recall) Use first 100 docs (easy to measure recall) Issue queries Issue queries Author last name Author last name Author first & last name Author first & last name Author affiliation Author affiliation Precision Precision Use max docs in each system Use max docs in each system Issue author search queries to both systems Issue author search queries to both systems Measure precision on each page of 20 hits Measure precision on each page of 20 hits
22
Evaluation – P & R Look for all papers where Peter Lee is an author (1 document) Look for all papers where Peter Lee is an author (1 document) cs_authorlast:lee cs_authorlast:lee Returns 3 documents, all with last name of Lee Returns 3 documents, all with last name of Lee P=.33, R=1 P=.33, R=1 cs_authorlast:lee cs_authorfirst:peter cs_authorlast:lee cs_authorfirst:peter Returns single document Returns single document P=1, R=1 P=1, R=1
23
Evaluation - Precision Author search: Author search: Q1: Peter Lee Q1: Peter Lee Project: cs_authorfirst:peter cs_authorlast:lee Project: cs_authorfirst:peter cs_authorlast:lee CiteSeer: peter w/2 lee CiteSeer: peter w/2 lee Q2: Jeffrey Ullman Q2: Jeffrey Ullman Project: cs_authorfirst:jeffrey cs_authorlast:ullman Project: cs_authorfirst:jeffrey cs_authorlast:ullman CiteSeer: jeffrey w/2 ullman CiteSeer: jeffrey w/2 ullman Q3: John Smith Q3: John Smith Project: cs_authorfirst:john cs_authorlast:smith Project: cs_authorfirst:john cs_authorlast:smith CiteSeer: john w/2 smith CiteSeer: john w/2 smith
24
Evaluation - Precision
25
Search Demo Available fields: Available fields: cs_authorfirst cs_authorfirst cs_authorlast cs_authorlast cs_authoraffiliation cs_authoraffiliation cs_pubyear cs_pubyear cs_pubmonth cs_pubmonth
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.