Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval 出處: institute of information science, academia sinica, taipei,

Slides:



Advertisements
Similar presentations
Improved TF-IDF Ranker
Advertisements

Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
String Searching Algorithm
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪.
IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan
1 SELC:A Self-Supervised Model for Sentiment Classification Likun Qiu, Weishi Zhang, Chanjian Hu, Kai Zhao CIKM 2009 Speaker: Yu-Cheng, Hsieh.
Tools for Text Review. Algorithms The heart of computer science Definition: A finite sequence of instructions with the properties that –Each instruction.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University
1999/3/10Li-we Pan1 Case-Based CBR : Capturing and Reusing Reasoning About Case Adaptation 指導老師 : 何正信教授 學生:潘立偉 學號: M 日期: 88/3/10 David B. Leake,
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
© 2004 Goodrich, Tamassia Tries1 Chapter 7 Tries Topics Basics Standard tries Compressed ( 壓縮 ) tries Suffix ( 尾字 ) tries.
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
1999/2/10NTUST Ailab Li-we Pan1 Semester Report 指導老師 : 何正信教授 學生:潘立偉 學號: M 日期: 88/2/10.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Search Engines and Information Retrieval Chapter 1.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Ontology Learning for Chinese Information Organization and Knowledge Discovery in Ethnology and Anthropology Kong Jing Institute of Ethnology & Anthropology,
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
The Chinese University of Hong Kong Introduction to PAT-Tree and its variations Kenny Kwok Department of Computer Science and Engineering.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Hsin-Hsi Chen9-1 Chinese Language Retrieval Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University.
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
By Chung-Hong Lee ( 李俊宏 ) Assistant Professor Dept. of Information Management Chang Jung Christian University 資料庫與資訊檢索系統的整合 - 一個文件資料庫系統的開發研究.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Chapter 23: Probabilistic Language Models April 13, 2004.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A New Lexicon Mechanism for Chinese Word Segmentation.
Query Segmentation Using Conditional Random Fields Xiaohui and Huxia Shi York University KEYS’09 (SIGMOD Workshop) Presented by Jaehui Park,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using the Web for Automated Translation Extraction in.
National Taiwan University, Taiwan
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Psychiatric document retrieval using a discourse-aware model Presenter : Wu, Jia-Hao Authors : Liang-Chih.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Translation of Web Queries Using Anchor Text Mining Advisor.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
GENERATING RELEVANT AND DIVERSE QUERY PHRASE SUGGESTIONS USING TOPICAL N-GRAMS ELENA HIRST.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
An Efficient Algorithm for Incremental Update of Concept space
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
CS 430: Information Discovery
Multimedia Information Retrieval
Text Categorization Document classification categorizes documents into one or more classes which is useful in Information Retrieval (IR). IR is the task.
Presentation transcript:

Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval 出處: institute of information science, academia sinica, taipei, taiwan,R.O.C. 學生:陳道輝、周鉦琪、葉飛 指導老師:黃三益 教授

Abstract PAT-tree-based adaptive approach PAT-tree-based adaptive approach IR application: automatic term suggestion, domain-specific lexicon construction, book indexing and document classification IR application: automatic term suggestion, domain-specific lexicon construction, book indexing and document classification

Introduction Keyphrase (keywords) extraction in Chinese language is a critical problem because of difficulties in word segmentation and unknown word identification.ex( 哈電族 ) Keyphrase (keywords) extraction in Chinese language is a critical problem because of difficulties in word segmentation and unknown word identification.ex( 哈電族 )

Definition of the Problems Lexical pattern: a string that consists of more than one successive character and has certain occurrences in a text collection with a specific domain. Lexical pattern: a string that consists of more than one successive character and has certain occurrences in a text collection with a specific domain. For example: 關鍵詞抽取 For example: 關鍵詞抽取 LPs: 關鍵、建詞、 詞抽、抽取、關鍵詞、 鍵詞抽、詞抽取、關鍵詞抽、鍵詞抽取、 關鍵詞抽取 LPs: 關鍵、建詞、 詞抽、抽取、關鍵詞、 鍵詞抽、詞抽取、關鍵詞抽、鍵詞抽取、 關鍵詞抽取

Definition of the Problems (cont) Complete lexical pattern: a LP with a complete meaning and lexical boundaries in semantics. Complete lexical pattern: a LP with a complete meaning and lexical boundaries in semantics. For example: 關鍵詞抽取 For example: 關鍵詞抽取 CLP: 關鍵、抽取、關鍵詞、關鍵詞抽取 CLP: 關鍵、抽取、關鍵詞、關鍵詞抽取

Definition of the Problems (cont) Significant lexical pattern: A CLP which is either “ specific ” or “ significant ” in the database Significant lexical pattern: A CLP which is either “ specific ” or “ significant ” in the database For example: 關鍵詞抽取 For example: 關鍵詞抽取 SLP: 關鍵詞、關鍵詞抽取 SLP: 關鍵詞、關鍵詞抽取

Definition of the Problems (cont) Definition 1:SLP Extraction Problem Definition 1:SLP Extraction Problem Definition 2:CLP Estimation Problem Definition 2:CLP Estimation Problem To solve problem 1, first we should solve problem 2 To solve problem 1, first we should solve problem 2

Definition of the Problems (cont) Proposed Approach: 3 modules Proposed Approach: 3 modules –Text analysis and PAT-tree indexing module –CLP extraction module –SLP extraction module

Definition of the Problems (cont)

Estimation of CLP Most CLP have strong associations between their composed and overlapped substrings Most CLP have strong associations between their composed and overlapped substrings Association Norm Estimation function Association Norm Estimation function If AE is large, it can be found that in many cases, patterns y and z will occur together is the text collection If AE is large, it can be found that in many cases, patterns y and z will occur together is the text collection ( 關鍵詞抽取、鍵詞抽取、關鍵詞抽 ) ( 關鍵詞抽取、鍵詞抽取、關鍵詞抽 )

Estimation of CLP (cont) It ’ s not enough to check if x has complete lexical boundaries using AE ( 關鍵詞 ) It ’ s not enough to check if x has complete lexical boundaries using AE ( 關鍵詞 ) To overcome this, we use two additional metrics, LCD (left context dependency) and RCD(right context dependency) ex. 李登輝 To overcome this, we use two additional metrics, LCD (left context dependency) and RCD(right context dependency) ex. 李登輝 By these metrics we can say: By these metrics we can say: –X is a CLP iff it has no LCD and RCD, and AE > (t3) threshold

Estimation of CLP (cont) X has LCD if |L| t2, where t1, t2 are threshold values, z E L and |L| means the number of unique right adjacent characters of x X has LCD if |L| t2, where t1, t2 are threshold values, z E L and |L| means the number of unique right adjacent characters of x X has RCD if |L| t2, where t1, t2 are threshold values, y E L and |L|means the number of unique right adjacent characters of x X has RCD if |L| t2, where t1, t2 are threshold values, y E L and |L|means the number of unique right adjacent characters of x

Text Analysis and PAT-Tree Indexing PAT tree uses as primarily implementation structure, and used for text retrieval and keyphrase extraction PAT tree uses as primarily implementation structure, and used for text retrieval and keyphrase extraction Use delimiter(, “ ”.) to determine a segment boundary, then build semi-infinite string Use delimiter(, “ ”.) to determine a segment boundary, then build semi-infinite string For example: 個人電腦, 人腦 For example: 個人電腦, 人腦 – 個人電腦, 人電腦, 電腦, 腦, 人腦, 腦 Node information (comparison bit, external nodes,frequency) Node information (comparison bit, external nodes,frequency) PAT Is easy for prefix search. PAT Is easy for prefix search. IPAT is easy for postfix search. IPAT is easy for postfix search.

Text Analysis and PAT-Tree Indexing (cont) Convert semi-infinite strings to bits Convert semi-infinite strings to bits According semi-infinite strings ’ bit sequences and differences to build PAT Tree According semi-infinite strings ’ bit sequences and differences to build PAT Tree We also create inverse PAT tree for inverse data streams of the database to check the occurrences of LSs and RSs We also create inverse PAT tree for inverse data streams of the database to check the occurrences of LSs and RSs ( 詞鍵關、詞鍵、詞鍵關展發、詞鍵關行進 ) ( 詞鍵關、詞鍵、詞鍵關展發、詞鍵關行進 )

Text Analysis and PAT-Tree Indexing (cont) Why use Pat tree (patricia) ? Why use Pat tree (patricia) ? –Log key value comparison times is low. –Computing time and space is down. –Efficient search. –We can use Pat tree to check RCD. –We can use Inverse Pat tree to check LCD.

Extraction of SLP A CLP is not always a SLP A CLP is not always a SLP –It cannot prove its significance in the text collection –Many CLP are commonly found in daily use All CLP is checked against a set of lexical rules and a general-domain corpus All CLP is checked against a set of lexical rules and a general-domain corpus Rules: Rules: –Numbers, Adverbs, Timing-related Terms –General Domain Pat Tree vs Specific Domain Pat Tree.

Evaluation Extraction of SLP Extraction of SLP –Ask 3 people to select CLPs and keyphrases from 50 “ seed sentence ” –Use these test data to test accuracy of SLP extraction Phrase length Total Number of Extracted Keyphrases Number of Correct Keyphrases Extracted Precision % % % % >= % Total %

Evaluation (cont) Speed and Space Requirements Speed and Space Requirements Corpus Corpus size (KB) PAT Tree size (KB) Time to construct PAT tree (sec) Time to extract keyphrases (sec) C1-O(10k) C2-O(100k) C3-O(1M) C4-O(10M) C5-O(100M)

Conclusion This method reduced the difficulty of keyphrase extraction in Chinese, with better performance This method reduced the difficulty of keyphrase extraction in Chinese, with better performance

String Bit 個人電腦 / 節點 … 人電腦 / 節點 … 電腦 / 節點 … 腦 / 節點 … 人腦 / 節點 … 腦 / 節點 … 個人電腦,人腦 節點號碼 Semi-infinite strings

( 比較位元, 外部節點數, 字串次數 ) (0,6,1) (4,6,1) (5,3,1) (24,2,1) ( 8,3,2)