Arabic Text Categorization Based on Arabic Wikipedia

Slides:



Advertisements
Similar presentations
Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Harun Ug˘uz 2011.KBS A two-stage feature selection method for text categorization by.
Advertisements

1 Software Processes A Software process is a set of activities and associated results which lead to the production of a software product. Activities Common.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Chapter 1: Introduction to Pattern Recognition
Paper Title Your Name CMSC 838 Presentation. CMSC 838T – Presentation Motivation u Problem paper is trying to solve  Characteristics of problem  … u.
Text Classification: An Implementation Project Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Author : Jochen Dijrre, Peter Gerstl, Roland Seiffert Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
OOSE 01/17 Institute of Computer Science and Information Engineering, National Cheng Kung University Member:Q 薛弘志 P 蔡文豪 F 周詩御.
Intelligent Database Systems Lab Presenter: YU-TING LU Authors: Liang-Chu Chen, Ting-Jung Yu, Chia-Jung Hsieh ACM KeyGraph-based chance discovery.
March 8, 2006Spectral RTL ATPG1 High-Level Spectral ATPG for Gate-level Circuits Nitin Yogi and Vishwani D. Agrawal Auburn University Department of ECE.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
READING A PAPER. Basic Parts of a Research Paper 1. Abstract 2. Introduction to Technology (background) 3. Tools & techniques/Methods used in current.
Presented by Tienwei Tsai July, 2005
Intelligent Database Systems Lab Presenter : WU, MIN-CONG Authors : Jorge Villalon and Rafael A. Calvo 2011, EST Concept Maps as Cognitive Visualizations.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
COMPARISON OF IMAGE ANALYSIS FOR THAI HANDWRITTEN CHARACTER RECOGNITION Olarik Surinta, chatklaw Jareanpon Department of Management Information System.
Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Yongzheng Zhang, Rajyashree Mukherjee, Benny Soetarman 2012, ACM Concept Extraction for.
Intelligent Database Systems Lab Presenter : JHOU, YU-LIANG Authors :Shady Shehata, Fakhri Karray, Mohamed S. Kamel, Fellow 2012, IEEE An Efficient Concept-Based.
Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors Mohamed Ali Hadj Taieb *, Mohamed Ben Aouicha, Abdelmajid Ben Hamadou KBS Computing.
A progressive sentence selection strategy for document summarization Presenter : Bo-Sheng Wang Authors: You Quyang, Wenjie Li, Renxian Zhang, Qin Lu IPM,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor : Dr.
Notes 9.6 – Statistics and Data - Graphically. I. Variables A.) Def: characteristics of individuals being identified or measured. 1.) CATEGORICAL – Class.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An IPC-based vector space model for patent retrieval Presenter: Jun-Yi Wu Authors: Yen-Liang Chen, Yu-Ting.
Intelligent Database Systems Lab Presenter : Chang,Chun-Chih Authors : Youngjoong Ko, Jungyun Seo 2009, IPM Text classification from unlabeled documents.
Intelligent Database Systems Lab Presenter : Chang,Chun-Chih Authors : CHRISTOS BOURAS, VASSILIS TSOGKAS 2012, KBS A clustering technique for news articles.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Utilizing Marginal Net Utility for Recommendation in E-commerce.
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Kevin Meijer, Flavius Frasincar, Frederik Hogenboom 2014.DSS. A semantic approach.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Intelligent Database Systems Lab Presenter : WU, MIN-CONG Authors : YUNG-MING LI, TSUNG-YING LI 2013, DSS Deriving market intelligence from microblogs.
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Luca Cagliero, Paolo Garza 2013.DKE. Improving classification models with taxonomy.
Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining knowledge from natural language texts using fuzzy associated concept mapping Presenter : Wu,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.
Intelligent Database Systems Lab Presenter : Chuang, Kai-Ting Authors : Rafael Odon de Alencar, Clodoveu Augusto Davis Jr., Marcos André Gonçalves 2010,
Intelligent Database Systems Lab Presenter : WU, MIN-CONG Authors : STEPHEN T. O’ROURKE, RAFAEL A. CALVO and Danielle S. McNamara 2011, EST Visualizing.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models.
Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Microsoft Research, Silicon Valley Geoff Hulten,
ID Identification in Online Communities Yufei Pan Rutgers University.
Content Analytics – Uncovering Critical Insight YellowBrix 3/2/20161.
Intelligent Database Systems Lab Presenter : Chang,Chun-Chih Authors : Emilio Corchado, Bruno Baruque 2012 NeurCom WeVoS-ViSOM: An ensemble summarization.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Hsin-Chang Yang, Han-Wei Hsiao, Chung-Hong Lee IPM Multilingual document mining.
Sentence Similarity Based on Semantic Nets and Corpus Statistics
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
A Semantic Text Classification Based on DBpedia Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,
Intelligent Database Systems Lab Presenter: YU-TING LU Authors: Yong-Bin Kang, Pari Delir Haghighi, Frada Burstein ESA CFinder: An intelligent key.
IDENTIFYING GREAT TEACHERS THROUGH THEIR ONLINE PRESENCE Evanthia Faliagka, Maria Rigou, Spiros Sirmakessis.
Classify A to Z Problem Statement Technical Approach Results Dataset
Intelligent HIV/AIDS FAQ Retrieval System Using Neural Networks
Queensland University of Technology
Using lexical chains for keyword extraction
ARTIFICIAL NEURAL NETWORKS
For Evaluating Dialog Error Conditions Based on Acoustic Information
Affiliation of presenter
Targeted Searches using Q Pipeline
Motivation and Background
Motivation and Background
Presented by: Prof. Ali Jaoua
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pooria Taghizadeh : Dr. Hadi Tabatabaee : Dr. Mona Ghassemian :
Enriching Taxonomies With Functional Domain Knowledge
Ali Hakimi Parizi, Paul Cook
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
SHUFFLING-SLICING IN DATA MINING
Presentation transcript:

Arabic Text Categorization Based on Arabic Wikipedia Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI 2014. ACM TALIP.

Outlines Motivation Objectives Methodology Experiments Conclusions Comments

Motivation  A challenge due to the correlation between certain subcategories and overlap between main categories. EX:

Objectives To solve this, we use algorithm and further adopt the two approaches .

CATEGORIZATION CORPORA - Training Data Related Tags Approach

Testing Data 10 categories with 40 documents in each category

Methodology - PREPROCESSING TECHNIQUES Root Extraction (RE) Light Stemming (LS) Special Expressions Extraction

Methodology- CATEGORIZATION PROCESS Categorize the input text in two phases Phase one: we categorize the text into one of the main categories. Phase two: We further categorize the input text based on subcategories:

Methodology - Basic Categorization Algorithm (BCA)

Methodology - Percentage and Difference Categorization (PDC) Algorithm has frequency 7 in the 300-word

Methodology - Percentage and Difference Categorization (PDC) Algorithm The category with the highest sum of flag values is considered to be the best match for the input text.

Methodology – PDC Algorithm vs. BCA Algorithm

Methodology – Enhancing Main/Subcategories Grouping Problem : The possible high correlation between subcategories of different main categories (1) Overlapping Main Categories for Phase Two

Methodology – Enhancing Main/Subcategories Grouping (2) Replacing Main Categories by Groups of Related Categories

Methodology – Enhancing Main/Subcategories Grouping

Methodology - Word Filtration Techniques within Categories   

Methodology - The result of applying the three techniques

Modified PDC with N Scales 1 0.5 Define a scaling of 1 0.75 0.5 0.25

Further Testing on the PDC Algorithm Tool Root Extraction Tool Light Stemming & Light10 Tool Double Words Tool Expressions Extraction

Using Testing Data from the Reference Categories

Training Data Characteristics

COMPARISON WITH RELATED WORK

Using Testing Data from the Reference Categories

Conclusions To use training and testing data from same source by splitting the corpus into test and training components. This consistently gives better results. However, we believe that the second method (different source ) makes more sense, as the tests will be more credible and indicative of performance in real-life environments.

Comments Advantages To. Applications Arabic Text Categorization .