Download presentation
Presentation is loading. Please wait.
Published byStephen Rogers Modified over 9 years ago
1
Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan
2
Ta Nha Linh 2TIM13 March 2009 Outline Motivation Challenges Researchers Information COllector (RICO) Contributions Future Works
3
Ta Nha Linh 3TIM13 March 2009 Outline Motivation Challenges Researchers Information COllector (RICO) Contributions Future Works
4
Ta Nha Linh 4TIM13 March 2009 Motivation Databases dedicated to scientific publications: CiteSeer, Google Scholar, ACM Portal, SpringerLink How about the authors of those publications? Publication-centric.
5
Ta Nha Linh 5TIM13 March 2009 Motivation Researcher-centric database? – Singapore Researchers Database: researchers to sign up and input, restricted conditions, in Singapore only – Resilience Alliance Reseachers Database: manual submission by researchers, in ecological and social sciences – Some other similar databases: manual update, specific to certain organization
6
Ta Nha Linh 6TIM13 March 2009 Goal: Automated system to build researchers database, for multiple disciplines Input: Researchers’ home pages. – Basic information – Contact information – Educational history – Publications
7
Ta Nha Linh 7TIM13 March 2009 Outline Motivation Challenges Researchers Information COllector (RICO) Contributions Future Works
8
Ta Nha Linh 8TIM13 March 2009 Challenges Different layouts – Templates – Personal pages Different content – Pages introducing researchers – CV-like – Personal pages Different content structures – Tables / lists – Natural language text
9
Ta Nha Linh 9TIM13 March 2009
10
Ta Nha Linh 10TIM13 March 2009
11
Ta Nha Linh 11TIM13 March 2009
12
Ta Nha Linh 12TIM13 March 2009 Challenges Different data presentations hangli at microsoft dot com cs.duke.edu, junyang ASJMZheng@ntu.edu.sg erafalin(at)cs.tufts.edu Natalio.Krasnogor -replace all this by at symbol- nottingham.ac.uk wmt then the at-sign then uci dot edu
13
Ta Nha Linh 13TIM13 March 2009 Outline Motivation Challenges Researchers Information COllector (RICO) Contributions Future Works
14
Ta Nha Linh 14TIM13 March 2009 Researchers Information COllector (RICO) Field Identification Home page Identification Post Processing
15
Ta Nha Linh 15TIM13 March 2009 RICO - Architecture Home page Identification Field Identification Post-Processing
16
Ta Nha Linh 16TIM13 March 2009 Researchers Information COllector (RICO) Field Identification Home page Identification Post Processing
17
Ta Nha Linh 17TIM13 March 2009 Field Identification - Purpose To identify data in the page contents to corresponding fields in a pre-defined set of desired information. Current set includes: Name – Position – Affiliation Address – Phone – Fax - Email BS year – BS major – BS university MS year – MS major – MS university PhD year – PhD major – PhD university Research Interest – Publications
18
Ta Nha Linh 18TIM13 March 2009 Field Identification - Related works Tang et al (2007), (2008) – ArnetMiner – Prepocessing: tokenize text into 5 categories – Tagging of tokens by using Conditional Random Field (CRF) – F1 = 83.37% (~1,000 researchers) – Set of features used: + Content features (word, morphological, image features) + Pattern features (positive word, special token, reseacher name features) + Term features (term, dictionary features)
19
Ta Nha Linh 19TIM13 March 2009 Field Identification - Related works Tang et al (2007), (2008) – ArnetMiner – Has researcher’s name as input. This is an important information to be made used of when parse other fields. Different from TIM. – Based only on text of the page. Stylistic information can be of use.
20
Ta Nha Linh 20TIM13 March 2009 Field Identification - Methodology Input: a researcher home page CRF is the learning model Features used – Global features – Lexicon features – Context features – Dictionaries features – Stylistic features
21
Ta Nha Linh 21TIM13 March 2009 Field Identification - Methodology Global features: apply for current token – Morphological features – Initials – Number – Punctuation Lexicon features: apply for current token – Positive words for certain annotation fields: Position, Affiliation, Address, Phone, Fax, Email
22
Ta Nha Linh 22TIM13 March 2009 Field Identification - Methodology Context features: apply for whole line – Name context – Address context – Phone context: 'phone', 'tel', 'mobile' – Fax context: 'fax', 'facsimile' – Email context: 'email', 'e-mail' – Bachelor (BS) context: appearance of 'B.S' or 'BS' or 'Bachelor' – Master (MS) context: appearance of 'M.S' or 'MS' or 'Master' – Ph.D (PhD) context: appearance of 'Ph.D' or 'Doctorate' or 'Doctor(ate) of Philosophy' – Research-interest context: multiple line property – Publication context: multiple line property – Degree: help to correctly differentiate BS/MS/PhD info when they are presenting in prose style / on the same line.
23
Ta Nha Linh 23TIM13 March 2009 Field Identification - Methodology Dictionaries – Parscit dictionary: detect male names, female names, popular last names, month names, place names, publisher names, each is a single feature – Major dictionary: to help in identifying researchers' major in their educational history, may also help in Research Interests – Research dictionary: classified into high/mid/low confidence. – Universities dictionary: of names of most of universities, according to Open Directory
24
Ta Nha Linh 24TIM13 March 2009 Field Identification - Methodology Stylistic features – List feature – Table features – Section feature: based on html tags like,,, header tags, list elements, table
25
Ta Nha Linh 25TIM13 March 2009 Field Identification - Performance Data set of 40 home pages, cross validation Overall Precision: 70.66 – Recall: 62.73 – F1: 64.87 ClassesPrecisionRecallF1 name75.66% 51.34%61.17 phone53.38% 89.25%66.80 fax47.73%72.41%57.53 email79.31%70.77%74.80 address78.90%74.57%76.67 affiliation30.27%59.47%40.12 position79.46%64.49%71.20 research- interest 48.48%36.04%41.34 publications71.05%43.27%53.79 ClassesPrecisionRecallF1 bs-major88.89%78.05%83.12 bs-uni68.67%57.00%62.30 bs-year90.00%72.00%80.00 ms-major71.43%32.26%44.44 ms-uni52.94% 52.94 ms-year77.78%56.00%65.12 phd-major83.33%73.17%77.92 phd-uni74.56%72.03%73.28 phd-year100.00%74.07%85.11
26
Ta Nha Linh 26TIM13 March 2009 Field Identification - Discussion Data fields to be annotated similar to those from ArnetMiner. – Extra: Name, Research Areas, Publications – Missing: Image Stylistic feature used is minimal
27
Ta Nha Linh 27TIM13 March 2009 Field Identification - Discussion F1 value is significantly lower than that of ArnetMiner’s – ArnetMiner has the researcher name as input, and uses features referring to researcher name to identify other fields. RICO has no prior knowledge about the page to be parsed. Heuristic to improve confidence of ‘Name’ Make use of Affiliation name input – Identifying ‘Research Interest’ and ‘Publications’ is challenging. Improve ‘Publications’
28
Ta Nha Linh 28TIM13 March 2009 Researchers Information COllector (RICO) Field Identification Home page Identification Post Processing
29
Ta Nha Linh 29TIM13 March 2009 Home page Identification - Purpose Add-on component To complete automation of the system
30
Ta Nha Linh 30TIM13 March 2009 Home page Identification – Related works Ahoy! – Input: Researcher name and (optional) institution name – “Home page”: allocated page, classified by URL patterns RICO – Input: Institution name – “Home page”: allocated page with biographical information, classified by contents
31
Ta Nha Linh 31TIM13 March 2009 Home page Identification – Methodology Collect a list of Universities domains Use Yahoo! BOSS to search for professors in the institutions For each valid web page, fetch the page, scan for words indicating ‘phone’, ‘mail’ and ‘professor’. Classify by the number of appearance of keywords. Home pages will be passed to Fields Identification component.
32
Ta Nha Linh 32TIM13 March 2009 Home page Identification – Discussion Query used not able to get all relevant pages. Tune for majority: professors in institutions. – Target researchers in research organizations. Drawback: result set from Yahoo! BOSS may get duplicate pages, or sub-pages of a researcher’s home page Treated as 2 different records. – Need high confidence in overall system performance. But researcher names are not unique. – Best if can eliminate duplication by analyzing URLs. But domain hierarchies differ within department, between departments, and between institutions.
33
Ta Nha Linh 33TIM13 March 2009 Researchers Information COllector (RICO) Field Identification Home page Identification Post Processing
34
Ta Nha Linh 34TIM13 March 2009 Post-processing - Purpose Input: CRF++ output file from Fields Identification. Group neighboring tokens identified with the same annotation tag Deduplication Store into database (current size ~ 170,000 researchers)
35
Ta Nha Linh 35TIM13 March 2009 Outline Motivation Challenges Researchers Information COllector (RICO) Contributions Future Works
36
Ta Nha Linh 36TIM13 March 2009 Contribution Produced an automated system for fetching researchers’ information from the world wide web. Introduced a number of features for Field Identification machine learning.
37
Ta Nha Linh 37TIM13 March 2009 Outline Motivation Challenges Researchers Information COllector (RICO) Contributions Future Works
38
Ta Nha Linh 38TIM13 March 2009 Future improvements Field Identification – Introduce more features, especially stylistic features – Strengthen features targeting Name, Research Interest and Publications tags – Cater for the tag – Be able to handle pages using HTML frames – Be able to follow links on the page if necessary Home page Identification – Improve heuristics Post-processing – Be able to refine output from Fields Identification A new component to facilitate front end for user to query the database
39
Ta Nha Linh 39TIM13 March 2009 THANK YOU! Question?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.