Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

Slides:



Advertisements
Similar presentations
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Advertisements

Linh Harvesting useful data from researchers’ homepages.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
A Web of Concepts Dalvi, et al. Presented by Andrew Zitzelberger.
Aki Hecht Seminar in Databases (236826) January 2009
Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Search Engine Optimization By Andy Smith | Art Institute of Dallas.
T.Sharon-A.Frank 1 Internet Resources Discovery (IRD) Concrete Learning Agents.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National.
XP New Perspectives on Microsoft Access 2002 Tutorial 71 Microsoft Access 2002 Tutorial 7 – Integrating Access With the Web and With Other Programs.
Automation Repository - QTP Tutorials Made Easy The Zero th Step TEST AUTOMATION AND QTP.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
Webpage Understanding: an Integrated Approach
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
LATTICE TECHNOLOGY, INC. For Version 10.0 and later XVL Web Master Advanced Tutorial For Version 10.0 and later.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Internet Research Skills Workshop Presented By: Paul Chisholm Program Resource Teacher Gateway Education Centre March 6, 2007.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
1 INTRODUCTION TO DATABASE MANAGEMENT SYSTEM L E C T U R E
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
The Internet 8th Edition Tutorial 4 Searching the Web.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Artezio LLC Address: 3G Gubkina Str., suite 504, Moscow, Russia, Phone: +7 (495) Fax: +7 (495)
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.
LOGO 1 Mining Templates from Search Result Records of Search Engines Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hongkun Zhao, Weiyi.
Information Retrieval and Web Search Crawling in practice Instructor: Rada Mihalcea.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Company LOGO In the Name of Allah,The Most Gracious, The Most Merciful King Khalid University College of Computer and Information System Websites Programming.
Chapter 1 Getting Listed. Objectives Understand how search engines work Use various strategies of getting listed in search engines Register with search.
Vector and symbolic processors
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
: Information Retrieval อาจารย์ ธีภากรณ์ นฤมาณนลิณี
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Data mining in web applications
Objective % Select and utilize tools to design and develop websites.
Objective % Select and utilize tools to design and develop websites.
Thanks to Bill Arms, Marti Hearst
CSc4730/6730 Scientific Visualization
Tutorial 7 – Integrating Access With the Web and With Other Programs
Presentation transcript:

Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan

Ta Nha Linh 2TIM13 March 2009 Motivation Databases dedicated to scientific publications: CiteSeer, Google Scholar, ACM Portal, SpringerLink How about the authors of those publications? Publication-centric.

Ta Nha Linh 3TIM13 March 2009 Motivation Researcher-centric database? – Singapore Researchers Database: researchers to sign up and input, restricted conditions, in Singapore only – Resilience Alliance Reseachers Database: manual submission by researchers, in ecological and social sciences – Some other similar databases: manual update, specific to certain organization

Ta Nha Linh 4TIM13 March 2009 Goal: Automated system to build researchers database, for multiple disciplines Where to get the information? Their home pages. – Basic information – Contact information – Educational history – Publications

Ta Nha Linh 5TIM13 March 2009 Challenges Different layouts – Templates – Personal pages Different content – Pages introducing researchers – CV-like – Personal pages Different content structures – Tables / lists – Natural language text

Ta Nha Linh 6TIM13 March 2009

Ta Nha Linh 7TIM13 March 2009

Ta Nha Linh 8TIM13 March 2009

Ta Nha Linh 9TIM13 March 2009 Challenges Different data presentations  hangli at microsoft dot com  cs.duke.edu, junyang   erafalin(at)cs.tufts.edu   Natalio.Krasnogor -replace all this by at symbol- nottingham.ac.uk  wmt then the at-sign then uci dot edu

Ta Nha Linh 10TIM13 March 2009 System Architecture Fields Identification (Tagging Core) Home page Identification Post Processing

Ta Nha Linh 11TIM13 March 2009 Fields Identification - Purpose To identify data in the page contents to corresponding fields in a pre-defined set of desired information. Current set includes: Name – Position – Affiliation Address – Phone – Fax - BS year – BS major – BS university MS year – MS major – MS university PhD year – PhD major – PhD university Research Interest – Publications

Ta Nha Linh 12TIM13 March 2009 Fields Identification- Related works Tang et al (2007), (2008) – ArnetMiner – Prepocessing: tokenize text into 5 categories – Tagging of tokens by using Conditional Random Field (CRF) – F1 = 83.37% (~1,000 researchers) – Set of features used: + Content features (word, morphological, image features) + Pattern features (positive word, special token, reseacher name features) + Term features (term, dictionary features)

Ta Nha Linh 13TIM13 March 2009 Fields Identification- Related works Tang et al (2007), (2008) – ArnetMiner – Has researcher’s name as input. This is an important information to be made used of when parse other fields. Different from TIM. – Based only on text of the page. Stylistic information can be of use.

Ta Nha Linh 14TIM13 March 2009 Fields Identification- Related works Cai et al (2003) – VIsion-based Page Segmentation (VIPS) algorithm to produce visual-based content structure of a web page – Make use of DOM tree and visual cues on web pages – May help in narrowing down relevant sections – Drawback: need a browser to get the visual information

Ta Nha Linh 15TIM13 March 2009 Fields Identification- Related works Lee (2004) PARCELS Stylistic Engine – Made use of some heuristics proposed by Cai et al (2003) – Parse the DOM tree for text-only and stylistic properties – Text-only data passed to another engine for further process – Stylistic data is stored in vector for machine learning, to classify sections with a set of domain-specific tags. – The domain used was the news domain

Ta Nha Linh 16TIM13 March 2009 Fields Identification- Method Input: a researcher home page CRF is employed as the automated learning model Features used – Global features – Lexicon features – Context features – Dictionaries features – Stylistic features

Ta Nha Linh 17TIM13 March 2009 Fields Identification- Method Global features: apply for current token – Morphological features – Initials – Number – Punctuation Lexicon features: apply for current token – Positive words for certain annotation fields: Position, Affiliation, Address, Phone, Fax,

Ta Nha Linh 18TIM13 March 2009 Fields Identification- Method Context features: apply for whole line – Name context – Address context – Phone context: 'phone', 'tel', 'mobile' – Fax context: 'fax', 'facsimile' – context: ' ', ' ' – Bachelor (BS) context: appearance of 'B.S' or 'BS' or 'Bachelor' – Master (MS) context: appearance of 'M.S' or 'MS' or 'Master' – Ph.D (PhD) context: appearance of 'Ph.D' or 'Doctorate' or 'Doctor(ate) of Philosophy' – Research-interest context: multiple line property – Publication context: multiple line property – Degree: help to correctly differentiate BS/MS/PhD info when they are presenting in prose style / on the same line.

Ta Nha Linh 19TIM13 March 2009 Fields Identification- Method Dictionaries – Parscit dictionary: detect male names, female names, popular last names, month names, place names, publisher names, each is a single feature – Major dictionary: to help in identifying researchers' major in their educational history, may also help in Research Interests – Research dictionary: classified into high/mid/low confidence. – Universities dictionary: of names of most of universities, according to Open Directory

Ta Nha Linh 20TIM13 March 2009 Fields Identification- Method Stylistic features – List feature – Table features – Section feature: based on html tags like,,, header tags, list elements, table

Ta Nha Linh 21TIM13 March 2009 Fields Identification - Performance Data set of 40 home pages, cross validation processed tokens with phrases; found: phrases; correct: accuracy: 80.09%; precision: 80.09%; recall: 80.09%; FB1: address: precision: 78.90%; recall: 74.57%; FB1: affiliation: precision: 30.27%; recall: 59.47%; FB1: bs-major: precision: 88.89%; recall: 78.05%; FB1: bs-uni: precision: 68.67%; recall: 57.00%; FB1: bs-year: precision: 90.00%; recall: 72.00%; FB1: precision: 79.31%; recall: 70.77%; FB1: fax: precision: 47.73%; recall: 72.41%; FB1: misc: precision: 85.23%; recall: 92.35%; FB1: ms-major: precision: 71.43%; recall: 32.26%; FB1: ms-uni: precision: 52.94%; recall: 52.94%; FB1: ms-year: precision: 77.78%; recall: 56.00%; FB1: name: precision: 75.66%; recall: 51.34%; FB1: phd-major: precision: 83.33%; recall: 73.17%; FB1: phd-uni: precision: 74.56%; recall: 72.03%; FB1: phd-year: precision: %; recall: 74.07%; FB1: phone: precision: 53.38%; recall: 89.25%; FB1: position: precision: 79.46%; recall: 64.49%; FB1: publications: precision: 71.05%; recall: 43.27%; FB1: research-interest: precision: 48.48%; recall: 36.04%; FB1:

Ta Nha Linh 22TIM13 March 2009 Fields Identification - Discussion Data fields to be annotated similar to those from ArnetMiner. – Extra: Name, Research Areas, Publications – Missing: Image Stylistic feature used is minimal

Ta Nha Linh 23TIM13 March 2009 Fields Identification - Discussion F1 value is slightly lower than that of ArnetMiner’s – ArnetMiner has the researcher name as input, and uses features referring to researcher name to identify other fields. TIM has absolutely no prior knowledge about the page to be parsed. – Identifying ‘Research Interest’ and ‘Publications’ is the most challenging. Not always presented. If presented, in various styles

Ta Nha Linh 24TIM13 March 2009 Home page Identification - Purpose Add-on component To complete automation of the system: finding home pages to input to the Fields Identification component.

Ta Nha Linh 25TIM13 March 2009 Home page Identification – Related works Ahoy! – Input: Researcher name and institution name (optional) – Use MetaCrawler as a 'reference source', cross filter by database – Heuristic-based filter: based entirely on reference's tile, URL, short textual extract (if supplied by the search engine) – Ranking: based on 1/ person name match, 2/ institution URL match, 3/ page appears to be a homepage – URL Pattern Extraction and Generation: extract and learn the pattern if a success, else generate URL from database of URL patterns

Ta Nha Linh 26TIM13 March 2009 Home page Identification – Related works Ahoy! – Dynamic search, high performance reported, URL patterns usage a good feature – Does not serve the same purpose as my Home page Identification: should not take researcher name as input. – Definition of ‘home page’ is not the same. Ahoy! classifies based on URL patterns, TIM classified based on page contents.

Ta Nha Linh 27TIM13 March 2009 Home page Identification – Method Collect a list of Universities domains Use Yahoo! BOSS to search for professors in the institutions For each valid web page, fetch the page, scan for words indicating ‘phone’, ‘mail’ and ‘professor’. Count the number of appearance. – #phone < 3 && #mail < 2 && #professor < 5  Home page Home pages will be passed to Fields Identification component.

Ta Nha Linh 28TIM13 March 2009 Home page Identification – Discussion Query to Yahoo! BOSS is not optimal. But this covers the majority Drawback: result set from Yahoo! BOSS may get duplicate pages, or sub-pages of a researcher’s home page  Treated as 2 different records. – Need high confidence in overall system performance. But researcher names are not unique. – Best if can eliminate duplication by analyzing URLs. But domain hierachies differ within department, between departments, and between institutions.

Ta Nha Linh 29TIM13 March 2009 Post-processing - Purpose Input: CRF++ output file from Fields Identification. Group neighboring tokens identified with the same annotation tag Deduplication Store into database

Ta Nha Linh 30TIM13 March 2009 Contribution Produced an automated system for fetching researchers’ information from the world wide web. Introduced a number of features for Fields Identification machine learning.

Ta Nha Linh 31TIM13 March 2009 Future improvements Fields Identification – Introduce more features, especially stylistic features – Strengthen features targeting Name, Research Interest and Publications tags – Cater for the tag – Be able to handle pages using HTML frames – Be able to follow links on the page if necessary Home page Identification – Improve heuristics Post-processing – Be able to refine output from Fields Identification A new component to facilitate front end for user to query the database

Ta Nha Linh 32TIM13 March 2009 THANK YOU! Question?