Link Distribution in Wikipedia [0324] KwangHee Park.

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

Yansong Feng and Mirella Lapata
Ouyang Ruofei Topic Model Latent Dirichlet Allocation Ouyang Ruofei May LDA.
Modern Language Association (MLA) International Bibliography Hosted by Gale Cengage Welcome to our Guided Tour Tour takes about 7 minutes. The show will.
A Joint Model of Text and Aspect Ratings for Sentiment Summarization Ivan Titov (University of Illinois) Ryan McDonald (Google Inc.) ACL 2008.
Group 4 Project Presentation
Chang WangChang Wang, Sridhar mahadevanSridhar mahadevan.
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Multilingual Synchronization focusing on Wikipedia
BPOS LOCALIZATION TEMPLATE Zürich, February 2010.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
By Josué A. Ruiz Rodriguez Wyatt Lugo Caballero.  What do you understand about Web tool?
Data Mining By Dave Maung.
A MIXED MODEL FOR CROSS LINGUAL OPINION ANALYSIS Lin Gui, Ruifeng Xu, Jun Xu, Li Yuan, Yuanlin Yao, Jiyun Zhou, Shuwei Wang, Qiaoyun Qiu, Ricky Chenug.
1 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University.
Yarmouk University Department of Computer Information Systems CIS 499 Yarmouk University Department of Computer Information Systems CIS 499 Yarmouk University.
Content Management Systems Allyson Falkner Spokane County ISD
 Goal recap  Implementation  Experimental Results  Conclusion  Questions & Answers.
Submission doc.: IEEE /0073r0 September 2015 Alaa Mourad, BMW GroupSlide 1 Wireless Coexistence in the Automotive Environment – Interest group.
Topic Modeling using Latent Dirichlet Allocation
Multilingual Synchronization focusing on Wikipedia
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
1 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University.
Link Distribution on Wikipedia [0407]KwangHee Park.
CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Plan for today Introduction Graph Matching Method Theme Recognition Comparison Conclusion.
Link Distribution on Wikipedia [0422]KwangHee Park.
Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003.
classification/classify genus invertebrate kingdom phylum/phyla species vertebrate.
Measuring Monolinguality
Bagrut Project in English
Information on Energy Saving Calculation
Automate your content translation with the Google Translate API.
Template library tool and Kestrel training
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
The Gender Analysis Process in Food For Peace Development Programs
SLOPE = = = The SLOPE of a line is There are four types of slopes
Ag. No. 8.4 EARS and rent PPP for 2017
Project 1 Binary Classification
Warm Up – September 25, 2017 Solve the following: 2x2 – 3x – 5 = 0
People-LDA using Face Recognition
Find API Usage Patterns
Scaffolding the Writing Task for
Semantic Soccer: Implementation on Semantic Wiki Platform
EPAN - eGovernment EPAN Administrative Framework
classification/classify genus invertebrate kingdom phylum/phyla species vertebrate.
Link Distribution in Wikipedia
Name of Method (2 to 3 words)
A Suite to Compile and Analyze an LSP Corpus
Hierarchical Relational Models for Document Networks
Introduction to the Framework: Unit1, Key Topic 1. wested
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 22, Feb, 2010 Department of Computer.
Databases 1.
A Method for the Comparison of Criminal Cases using digital documents
Engineering Portfolio
Engineering Portfolio
Workshop: Equipment June 29, 2006.
Jinwen Guo, Shengliang Xu, Shenghua Bao, and Yong Yu
STEPS Site Report.
ABSTRACTS AND EXECUTIVE SUMMARIES
Target Language English Created by Jane Driver.
PROJECT NAME YOUR LOGO [ NAME ] [ DATE ] BUSINESS CASE PRESENTATION
Active AI Projects at WIPO
Presentation transcript:

Link Distribution in Wikipedia [0324] KwangHee Park

Table of contents  Introduction  Cluster using LDA  Experiment  Disease, settlement  Demo  Considering Application

Introduction  Why focused on Link  When someone make new article in Wikipedia, mostly they simply link to other language source or link to similar and related article. After that, that article to be wrote by others  Assumption  Link terms in the Wikipedia articles is the key terms which can represent specific characteristic of articles

Introduction  Problem what we want to solve is  To analyses latent distribution of set of Target document by Clustering of Link term set  Find the Tendency of latent distribution of specific Domain by limiting input document to specific Domain

Process  Terminology  Term set = all of terms in the input documents  Topic = Set of term  {W i,…,W n }  Document = Set of term  {W k,W l,…,W n }  Document = set of part of topic  {T n, T k,…,T m }  {Doc : 1 }  {T n : 0.4, T k : 0.3,… }  Clustering Term set  Find latent distribution of each Document  Group by domain

LDA  The clustering techniques  The LDA model consists of a fixed number of topics  Each topic is modeled as a distribution over words.  A document under LDA is modeled as a distribution over topics. Term Set Topic n Topic Topic 3 Topic 2 Topic 1 Doc 1 Doc2 Doc 3

Experiment  Domain :  Disease  #Doc : 208  #Link terms :  English : 46615, Espanola: 34560, French:, 31747Chinese:, 9286 Korean: 3272  Settlement  #Doc : 1328  #Link term :  English : , Espanola: , French:150921, Chinese:93227, Korean:  Number of Topic  10,20,30,40,50,75,100,125,150,175,200,225,250  Demo site 

Considering Application  Document Classification  Classify domain of target document by calculate similarity between topic distribution of document  Usage : Template recommendation,…  Domain characteristic # of appearance / # of total Doc Topic number Disease Settlement

Template recommendation  Starvation Trenton,_New_Jersey  Starvation  Disease  Trenton,_New_Jersey  Settlement

Thanks

Domain characteristic # of appearance /# of total Doc Topic number Disease Settlement