Link Distribution on Wikipedia [0422]KwangHee Park.

Slides:



Advertisements
Similar presentations
Yansong Feng and Mirella Lapata
Advertisements

Literacy Test Preparation
Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:
A UTOMATICALLY A CQUIRING A S EMANTIC N ETWORK OF R ELATED C ONCEPTS Date: 2011/11/14 Source: Sean Szumlanski et. al (CIKM’10) Advisor: Jia-ling, Koh Speaker:
Personalized News Josh Alspector, Alek Kolcz - University of Colorado at Colorado Springs.
Tim Benke Supervisors: Josiane Xavier Parreira, Sebastian Michel Bachelor thesis.
By Satyadhar Joshi May 2011 Online Class
Thesis Project Nirvana
Linear Clustering Algorithm BY Horne Ken & Khan Farhana & Padubidri Shweta.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Overview of Search Engines
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
1 L07SoftwareDevelopmentMethod.pptCMSC 104, Version 8/06 Software Development Method Topics l Software Development Life Cycle Reading l Section 1.4 – 1.5.
Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.
Small Business Strategies: Imitation with a Twist
Chapter 8 Introduction to Hypothesis Testing
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Multilingual Synchronization focusing on Wikipedia
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
SE-02 SOFTWARE ENGINEERING LECTURE 3 Today: Requirements Analysis Requirements tell us what the system should do - not how it should do it. Requirements.
Multiple Indicator Cluster Surveys Survey Design Workshop Sampling: Overview MICS Survey Design Workshop.
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
Chapter 6 Lecture 3 Sections: 6.4 – 6.5.
General Register Office for S C O T L A N D information about Scotland's people General Register Office for Scotland 2006 Census Test – Evaluation Methodology.
Never-ending Search: (What you REALLY need to know about online searching) Ms. Emili school year.
 Copyright 2011 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute Enabling Networked Knowledge.
ICOM 6115: COMPUTER SYSTEMS PERFORMANCE MEASUREMENT AND EVALUATION Nayda G. Santiago August 16, 2006.
Chapter 6: Information Retrieval and Web Search
P2Pedia A Distributed Wiki Network Management and Artificial Intelligence Laboratory Carleton University Presented by: Alexander Craig May 9 th, 2011.
BioSumm A novel summarizer oriented to biological information Elena Baralis, Alessandro Fiori, Lorenzo Montrucchio Politecnico di Torino Introduction text.
16-1 Chapter 16 Analyzing Information & Writing Reports   Analyzing Data   Choosing Information   Organizing Reports   Seven Organization Patterns.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.
For: CS590 Intelligent Systems Related Subject Areas: Artificial Intelligence, Graphs, Epistemology, Knowledge Management and Information Filtering Application.
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
National Taiwan University, Taiwan
 Goal recap  Implementation  Experimental Results  Conclusion  Questions & Answers.
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
Introduction to Hypothesis Testing: the z test. Testing a hypothesis about SAT Scores (p210) Standard error of the mean Normal curve Finding Boundaries.
NEW EVENT DETECTION AND TOPIC TRACKING STEPS. PREPROCESSING Removal of check-ins and other redundant data Removal of URL’s maybe Stemming of words using.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
A Knowledge-Based Search Engine Powered by Wikipedia David Milne, Ian H. Witten, David M. Nichols (CIKM 2007)
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
1 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University.
Final Year Project 1 (FYP 1) CHAPTER 1 : INTRODUCTION
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Link Distribution on Wikipedia [0407]KwangHee Park.
Compact Query Term Selection Using Topically Related Text Date : 2013/10/09 Source : SIGIR’13 Authors : K. Tamsin Maxwell, W. Bruce Croft Advisor : Dr.Jia-ling,
Leveraging Knowledge Bases for Contextual Entity Exploration Categories Date:2015/09/17 Author:Joonseok Lee, Ariel Fuxman, Bo Zhao, Yuanhua Lv Source:KDD'15.
Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta.
GENERATING RELEVANT AND DIVERSE QUERY PHRASE SUGGESTIONS USING TOPICAL N-GRAMS ELENA HIRST.
Dependency-Based Word Embeddings Omer LevyYoav Goldberg Bar-Ilan University Israel.
Evaluating Sources and Information How do you know what’s useful?
How to search for relevant information. Preparing to search: PLAN WHAT am I looking for? WHY do I want it? WHEN? Time period? HOW? Document type? What.
Plan for today Introduction Graph Matching Method Theme Recognition Comparison Conclusion.
Application architectures Advisor : Dr. Moneer Al_Mekhlafi By : Ahmed AbdAllah Al_Homaidi.
Link Distribution in Wikipedia [0324] KwangHee Park.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
Queensland University of Technology
Writing for Academic Journals
Measuring Sustainability Reporting using Web Scraping and Natural Language Processing Alessandra Sozzi
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Mining and Analyzing Data from Open Source Software Repository
Semantic Soccer: Implementation on Semantic Wiki Platform
Link Distribution in Wikipedia
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
WSExpress: A QoS-Aware Search Engine for Web Services
Presentation transcript:

Link Distribution on Wikipedia [0422]KwangHee Park

Table of contents  Introduction  Similarity between document  Error case  Modify word bag  Conclusion

Introduction  Why focused on Link  When someone make new article in Wikipedia, mostly they simply link to other language source or link to similar and related article. After that, that article to be wrote by others  Assumption  Link terms in the Wikipedia articles is the key terms which can represent specific characteristic of articles

Introduction  Problem what we want to solve is  To analyses latent distribution of set of Target document by topic modeling

Topic modeling – our approach  Target  Document = Wikipedia article  Terms = linked term in document  Modeling method  LDA  Modeling tool  Lingpipe api

Advantage of linked term  Don’t need to extra preprocessing  Boundary detection  Remove stopword  Word stemming  Include more semantics  Co-relation between term and document  Ex) cancer as a term  cancer as a document cancer A Cancer

Preliminary Problem  How well link terms in the document are represent specific characteristic of that document  Link evaluation  Calculate similarity between document

Link evaluation  Similarity based evaluation  Calculate similarity between documents  Sim_d{doc1,doc2}  Calculate similarity between terms  Sim_t{term1,term2}  Compare two similarity

Similarity between documents  Sim_d  Similarity between documents  Significantly affected input term set  Data set  1536 number of document  Disease domain : 208  Settlement domain : 1328 p,q = topic distribution of each document Kullback Leibler divergence

Example –reasonable

Example – not good

Error analysis  Length problem – overestimate portion of topic  If the document contain only few link term then portion of topic of that document tend to be overestimated  Ex)1950 년,1960 년, 파푸아 뉴기니, 식인풍습

Error analysis  Some document’s Link terms do not describe document itself  Ex) Date, Country,…etc

Demo website  For disease domain :   For settlement domain :   For disease + settlement domain : 

Modify word bag  Including non-link term  Excluding noise term  Weighted score for duplication term  Including incoming link

Conclusion  Topic modeling with link distribution in Wikipedia  Need to measure how well link distribution can represent each article’s characteristic  After that analysis topic distribution in variety way  Expect topic distribution can be apply many application

Thank