Linking Wikipedia to the Web Antonio Flores Bernal Department of Computer Sciencies San Pablo Catholic University 2010
Authors Rianne Kaptein Pavel Serdyukov Jaap kamps
Linking Wikipedia to the Web Introduction External Link Detection Conclusion Antonio Flores Bernal Linking Wikipedia to the Web
Introduction Wikipedia is a natural starting point for information on almost any topic Where to go if we want more information? Only 45% of all Wikipedia pages have an “External Links” section. Antonio Flores Bernal Linking Wikipedia to the Web
Can we automatically find external links for Wikipedia pages? The task INEX Link-the-Wiki consists of finding links between the Wikipedia pages. The task is find links from Wikipedia pages to external web pages Antonio Flores Bernal Linking Wikipedia to the Web
Clueweb category B 2009 TREC entity ranking task Antonio Flores Bernal Linking Wikipedia to the Web
External Link Detection Task and Test collection: Given a topic, a Wikipedia page return the external Web pages It's created a topic set The URLs of the external links are matched with the URLs in the Clueweb collection Antonio Flores Bernal Linking Wikipedia to the Web
External Link Detection External link on entity pages are split in two parts: A home page Informational pages Antonio Flores Bernal Linking Wikipedia to the Web
Link Detection Approaches There are three approaches The baseline approach is a language model with a full-text index. An anchor text index, which has proved to work well for home page finding The third approach exploits information of Delicious Antonio Flores Bernal Linking Wikipedia to the Web
It was send a search request to Delicious and match the first 250 results with the urls in the Clueweb collection to create a ranking. Indri toolkit, Krovetz stemmer and Dirichlet document smoothing Antonio Flores Bernal Linking Wikipedia to the Web
Mean Reciprocal Rank (MRR) Success at 5 Antonio Flores Bernal Linking Wikipedia to the Web
Link Detection Results The anchor text index leads to a much better results than the full-text index. Modern home pages contain less relevant text Antonio Flores Bernal Linking Wikipedia to the Web
Three causes for not finding a relevant page: The external link in Wikipedia isn't a home page The home page is redirected The Wikipedia title contains ambiguous words Antonio Flores Bernal Linking Wikipedia to the Web
Using Delicious It does not return results for all topics Long queries don't return any results Duplicates pages Antonio Flores Bernal Linking Wikipedia to the Web
Conclusion The anchor text index is a very effective method to retrieve home pages. Using Delicious on its own does not lead to very good results, but it does contain valuable information. This kind of system is effective at predicting the external links for Wikipedia pages Antonio Flores Bernal Linking Wikipedia to the Web
Linking Wikipedia to the Web Antonio Flores Bernal Department of Computer Sciencies San Pablo Catholic University 2010