Linh Harvesting useful data from researchers’ homepages.

Linh Harvesting useful data from researchers’ homepages

15-Aug-08 Outline  Researchers’ homepages  Challenges  Related works

15-Aug-08 Researchers’ homepages  Lots of useful information about the researchers themselves  Basic information  Contact information  Educational history  Publications

15-Aug-08 Challenges  Different layouts  Templates  Personal pages  Different content  Pages introducing researchers  CV-like  Personal pages  Different content structures  Tables / lists  Natural language text

15-Aug-08 Challenges  Different data presentations  hangli at microsoft dot com  cs.duke.edu, junyang  ASJMZheng@ntu.edu.sg  erafalin(at)cs.tufts.edu   Natalio.Krasnogor -replace all this by at symbol- nottingham.ac.uk  wmt then the at-sign then uci dot edu

15-Aug-08 Related works – Tang et al (2008)  Tang et al.(2008) – ArnetMiner  Separate text into tokens (5 token types)  Assign possible tags to each tokens (CRF)  Extract profile properties (Amilcare tool and SVM) F1 = 83.37% (1,000 researchers)  Name disambiguation: may be simpler in our case

15-Aug-08 Related works - Cai et al (2003)  Cai et al (2003) - Visual-based content structure extraction  Underlying documentation presentation independent  Visual-based Page Segmentation (VIPS)  By combining DOM structure and visual cues (tag, color, text, size)

15-Aug-08 Related works - Cai et al (2003)

15-Aug-08 Related works - Cai et al (2003)  Strength Domain independent  layout independent No data training required Good results in evaluation report (97% of pages correctly detected)  Applicability Can be used to improve speed and correctness of the retrieval Different levels of complexicity in homepages layouts

15-Aug-08 References  J. Tang, D. Zhang, and L. Yao. Social network extraction of academic researchers. In Proc. of ICDM’2007 pp292-301, 2007.  D. Cai, S. Yu, J.R. Wen and W.Y. Ma (2003). Extracting content structure for web pages based on visual representation. In the 5 th APWC, pp. 406-417  C.H. Lee (2004). PARCELS: PARser for Content Extraction and Logical Structure (Stylistic detection). Honours Thesis, School of Computing, NUS, 2004.  J. Chen, K. Xiao (2008). Perception-oriented Online news extraction. In JCDL 2008 pp.363  Amilcare Webpage - http://nlp.shef.ac.uk/amilcare/amilcare.html http://nlp.shef.ac.uk/amilcare/amilcare.html  Wikipedia Webpage – http://en.wikipedia.orghttp://en.wikipedia.org  W3Schools Webpage – http://www.w3schools.com/default.asphttp://www.w3schools.com/default.asp

Linh Harvesting useful data from researchers’ homepages.

Similar presentations

Presentation on theme: "Linh Harvesting useful data from researchers’ homepages."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Linh Harvesting useful data from researchers’ homepages.

Similar presentations

Presentation on theme: "Linh Harvesting useful data from researchers’ homepages."— Presentation transcript:

Similar presentations

About project

Feedback