Download presentation
Presentation is loading. Please wait.
Published byAdam Hopper Modified over 9 years ago
1
Linh Harvesting useful data from researchers’ homepages
2
15-Aug-08 Outline Researchers’ homepages Challenges Related works
3
15-Aug-08 Researchers’ homepages Lots of useful information about the researchers themselves Basic information Contact information Educational history Publications
4
15-Aug-08 Challenges Different layouts Templates Personal pages Different content Pages introducing researchers CV-like Personal pages Different content structures Tables / lists Natural language text
5
15-Aug-08 Challenges Different data presentations hangli at microsoft dot com cs.duke.edu, junyang ASJMZheng@ntu.edu.sg erafalin(at)cs.tufts.edu Natalio.Krasnogor -replace all this by at symbol- nottingham.ac.uk wmt then the at-sign then uci dot edu
6
15-Aug-08 Related works – Tang et al (2008) Tang et al.(2008) – ArnetMiner Separate text into tokens (5 token types) Assign possible tags to each tokens (CRF) Extract profile properties (Amilcare tool and SVM) F1 = 83.37% (1,000 researchers) Name disambiguation: may be simpler in our case
7
15-Aug-08 Related works - Cai et al (2003) Cai et al (2003) - Visual-based content structure extraction Underlying documentation presentation independent Visual-based Page Segmentation (VIPS) By combining DOM structure and visual cues (tag, color, text, size)
8
15-Aug-08 Related works - Cai et al (2003)
9
15-Aug-08 Related works - Cai et al (2003) Strength Domain independent layout independent No data training required Good results in evaluation report (97% of pages correctly detected) Applicability Can be used to improve speed and correctness of the retrieval Different levels of complexicity in homepages layouts
10
15-Aug-08 References J. Tang, D. Zhang, and L. Yao. Social network extraction of academic researchers. In Proc. of ICDM’2007 pp292-301, 2007. D. Cai, S. Yu, J.R. Wen and W.Y. Ma (2003). Extracting content structure for web pages based on visual representation. In the 5 th APWC, pp. 406-417 C.H. Lee (2004). PARCELS: PARser for Content Extraction and Logical Structure (Stylistic detection). Honours Thesis, School of Computing, NUS, 2004. J. Chen, K. Xiao (2008). Perception-oriented Online news extraction. In JCDL 2008 pp.363 Amilcare Webpage - http://nlp.shef.ac.uk/amilcare/amilcare.html http://nlp.shef.ac.uk/amilcare/amilcare.html Wikipedia Webpage – http://en.wikipedia.orghttp://en.wikipedia.org W3Schools Webpage – http://www.w3schools.com/default.asphttp://www.w3schools.com/default.asp
11
Linh
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.