Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.

Slides:

Advertisements

Similar presentations

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology VisualRank- Applying PageRank to Large-Scale Image Search.

Advertisements

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A 24-h forecast of solar irradiance using artificial neural.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Yu Cheng Chen Author: Hichem.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Validating Transliteration Hypotheses Using the Web: Web.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel document similarity measure based on earth mover’s.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Web-Page Summarization Using Clickthrough Data Advisor.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Discovering Leaders from Community Actions Presenter : Wu, Jia-Hao Authors : Amit Goyal, Francesco Bonchi,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Quality evaluation of product reviews using an information.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Fast exact k nearest neighbors search using an orthogonal search tree Presenter : Chun-Ping Wu Authors.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text classification based on multi-word with support vector.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology U*F clustering : a new performant “ clustering-mining ”

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Human eye sclera detection and tracking using a modified.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 On-line Learning of Sequence Data Based on Self-Organizing.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Probabilistic Model for Definitional Question Answering.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 TANGENT: A Novel, “Surprise-me”, Recommendation Algorithm.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Web usage mining: extracting unexpected periods from web.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Taxonomy of Similarity Mechanisms for Case-Based Reasoning.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Positive and Negative Patterns for Relevance Feature.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. A Web 2.0-based collaborative annotation system for enhancing knowledge sharing in collaborative learning.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extraction Presenter : Jiang-Shan.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Visualizing Ontology Components through Self-Organizing.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Finding Terminology Translations From Hyperlinks On the.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. A quantitative stock prediction system based on financial news Presenter : Chun-Jung Shih Authors :Robert.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology SIGIR1 Improving Web Search Results Using Affinity Graph.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Empirical Study of Learning from Imbalanced Data Using.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. A semantic similarity metric combining features and intrinsic information content Presenter: Chun-Ping.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Recommendations for E-Learning Personalization.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. An IPC-based vector space model for patent retrieval Presenter: Jun-Yi Wu Authors: Yen-Liang Chen, Yu-Ting.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A k-mean clustering algorithm for mixed numeric and categorical.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. How valuable is medical social media data? Content analysis of the medical web Presenter :Tsai Tzung.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Development of a reading material recommendation system based on a knowledge engineering approach Presenter.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extensions of vector quantization for incremental clustering.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Word sense disambiguation of WordNet glosses Presenter: Chun-Ping Wu Author: Dan Moldovan, Adrian Novischi.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Adaptation of the Vector-Space Model for Ontology-Based.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2007.SIGIR.8 New Event Detection Based on Indexing-tree.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Utilizing Marginal Net Utility for Recommendation in E-commerce.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Yu Cheng Chen Author: YU-SHENG.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extending the Growing Hierarchal SOM for Clustering Documents.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Psychiatric document retrieval using a discourse-aware model Presenter : Wu, Jia-Hao Authors : Liang-Chih.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Regularization in Matrix Relevance Learning Petra Schneider,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Loss of the Mahalanobis Distance in High Dimensions-

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An initialization method to simultaneously find initial.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology O( ㏒ 2 M) Self-Organizing Map Algorithm Without Learning.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Region-based image retrieval using integrated color, shape,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Semantic segment extraction and matching for Internet.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A new data clustering approach- Generalized cellular automata.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining concept maps from news stories for measuring civic scientific literacy in media Presenter :

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Modeling Semantic Similarities in Multiple Maps Presenter.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Towards comprehensive support for organizational mining Presenter : Yu-hui Huang Authors : Minseok Song,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Providing Justifications in Recommender Systems Presenter.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Study of Learning a Merge Model for Multilingual Information.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Key Blog Distillation: Ranking Aggregates Presenter : Yu-hui Huang Authors :Craig Macdonald, Iadh Ounis.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text Classification, Business Intelligence, and Interactivity:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Extraction from Wikipedia: Moving Down the Long.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. An Integrated Machine Learning Approach to Stroke Prediction Presenter: Tsai Tzung Ruei Authors: Aditya.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Visualizing social network concepts Presenter : Chun-Ping Wu Authors :Bin Zhu, Stephanie Watts, Hsinchun.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Named Entity Disambiguation by Leveraging Wikipedia Semantic Knowledge Presenter : Jiang-Shan Wang Authors.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.

Presentation transcript:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation using web directories Quang Minh Vu, Atsuhiro Takasu, Jun Adachi IPM, 2008 Presented by Hung-Yi Cai 2010/09/01

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outlines  Motivation  Objectives  Methodology  Experiments  Conclusions  Comments

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation  Searching for information about a person on the internet is an increasing requirement in information retrieval.  Search results returned from search engines for a personal name query often contain documents relevant to several people because a name is usually shared by several people.  Due to this name ambiguity problem, users have to manually investigate the result documents to ﬁlter out people in whom they have no interest. 3

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Previous Studies 4 NameMethod of Detaildisadvantage Co-reference->VSM (In news articles) First, Extracting text relevant to a person in a document. Then, using the VSM to measure similarities between articles. People in the web may have several appearances related to different events. Second order context vector Applying the log likelihood method together with singular value decomposition for co-occurrence information to calculate context vectors of terms. This method may not work well when dealing with people who are not famous. The building of context vectors difficult. c-value/nc-value Extracting key phrases related to people. Then, senting key phrases as queries to search engines and built key phrases’ contexts using snippets of the resulting documents. It requires many query transactions to build contexts for key phrases. Link information ->ACDC Using link information in web pages and the other uses the A/CDC algorithm to group together web pages having the same topic. When we search for a person on the web, We may not know his or her social network in advance. Pattern-Matching Database (DBLP&Amzon) Extracting personal profiles, or using databases, such as DBLP and Amazon, to extract authors’ names and research keywords. The method of extracting personal profiles may not work well with web pages other than profile pages, while the method that uses a dictionary-like database cannot extract terms not listed in the database. Natural language- processing Extracting named entities in documents.Because web documents contain much noisy information, the extraction of named entities may not work well.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objectives  Propose Similarity via Knowledge Base (SKB) that uses web directories to improve the disambiguating performance in Name Disambiguation System (NDS).  SKB can be divided into two components: ─ Using web directories as a knowledge base to ﬁnd common contexts by TF-IDF in documents. ─ Then, using the common contexts measure to determine document similarities. 5

Intelligent Database Systems Lab N.Y.U.S.T. I. M. TF-IDF 6  Term weights are calculated using the terms’ occurrences in the document concerned and in a set of documents. ─ Tf (t, doc) is the number of times term t appears in the document doc.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology  In SKB, using web directories to measures features of terms in a document. 1) Measurement of term weights using a knowledge base A knowledge base Modification of term weight in documents Modification of term weight in directories 2) Measurement of document similarities Find directories close in topic with the document Measure document similarities 7

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Name Disambiguation System  The operational details are as follows: 1) Preprocessing documents 2) Calculation of document similarities 3) Discrimination by reranking documents 8

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments  Step 1. Data Sets ─ Documents of people ─ Creation of pseudo namesake document sets and real namesake document sets 9

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments  Step 2. Web directory structures 10

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments  Step 3. Baseline methods ─ Comparing SKB with two conventional methods: VSM ：  Calculating the weight of these terms by TF-IDF  Building the feature vectors of documents NER ：  Extracting the entity names in the documents by LingPipe software  Using these names to construct feature vectors of the documents (the constituents of vectors were binary values) 11

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments  Step 4. Evaluation metrics ─ We recorded the precision values at 11 recall points: 0%, 10%,...,90%, and 100% and denoted these as P(doc i, 0%), P(doc i, 10%),..., P(doc i, 90%) and P(doc i, 100%), respectively. 12

Intelligent Database Systems Lab N.Y.U.S.T. I. M.  Step 5. Experimental results ─ The overall performance for each method In this experiment, we set the window size n = 50 and the number of representative directories k = 20. We set the frequency document ratio threshold for SKB2 r = 5. Experiments 13

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments  Step 5. Experimental results ─ Performance of SKB2 when varying the frequency ratio threshold 14

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments  Step 5. Experimental results ─ Performance of SKB systems when varying the window size 15

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments  Step 5. Experimental results ─ Performance of SKBs when varying the number of representative directories 16

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments  Step 5. Experimental results ─ Performance for each method on real namesake document sets 17

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 Conclusions  Disambiguation of people will be a trend in web search, and we propose a new method that uses web directories as a knowledge base to improve the disambiguation performance.  The experimental results showed a significant improvement with our system over the other methods, and we also verified the robustness of our methods experimentally with di ff erent web directory structures and with di ff erent parameter values.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19 Comments  Advantages ─ Just requiring little preparation ─ Broad range of people  Shortages ─ Cost of computation is proportional ─ Some mistake  Applications ─ Information retrieval