Qatar Content Classification Presenter Mohamed Handosa VT, CS6604 May 6, 2014 Client Tarek Kanan 1.

Slides:



Advertisements
Similar presentations
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
Advertisements

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
ONLINE ARABIC HANDWRITING RECOGNITION By George Kour Supervised by Dr. Raid Saabne.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Arabic Natural Language Processing: P-Stemmer, Browsing Taxonomy, Text Classification, RenA, ALDA, and Template Summaries — for Arabic News Articles Tarek.
Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
ELISQ Discussion with QNL Director Lux 20 May 2015 Edward A. Fox Professor, Computer Science, Virginia Tech Blacksburg, VA USA
1 WARINTEK ? WARINTEK (Technology Information Kiosk) is Indonesian Multipurpose Community Telecentres that could encourage and support.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Text Classification using SVM- light DSSI 2008 Jing Jiang.
Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)
Lakeland Click arrow to advance show. Click on the “A” under “Listed By Name.” (“A” for Academic Search Database)
Software Agents for Web Mining FYP Project by: Shuchi Mittal Quek Siew Guat Patricia Professor: Franklin Fu.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Developing a Concept Extraction Technique with Ensemble Pathway Prat Tanapaisankit (NJIT), Min Song (NJIT), and Edward A. Fox (Virginia Tech) Abstract.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
Qatar Content Classification Presenter Mohamed Handosa VT, CS6604 March 6, 2014 Client Tarek Kanan 1.
Document management (aka ‘digital libraries’) The Greenstone Group: Professor Ian Witten (leader); David Bainbridge, Dave Nichols, S.J. Cunningham, Steve.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
Medical Data Classifier undergraduate project By: Avikam Agur and Maayan Zehavi Advisors: Prof. Michael Elhadad and Mr. Tal Baumel.
TEXT CLASSIFICATION USING MACHINE LEARNING Student: Hung Vo Course: CP-SC 881 Instructor: Professor Luo Feng Clemson University 04/27/2011.
Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA.
TagLearner: A P2P Classifier Learning System from Collaboratively Tagged Text Documents Haimonti Dutta 1, Xianshu Zhu 2, Tushar Muhale 2, Hillol Kargupta.
Project Final Presentation – Dec. 6, 2012 CS 5604 : Information Storage and Retrieval Instructor: Prof. Edward Fox GTA : Tarek Kanan ProjArabic Team Ahmed.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
IR Homework #3 By J. H. Wang May 10, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
Nuhi BESIMI, Adrian BESIMI, Visar SHEHU
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.
ELISQ Seminar Qatar National Library 20 May 2015 Introduction by Edward A. Fox Professor, Computer Science, Virginia Tech Blacksburg, VA USA
Digital Library Services team Indico Workshop - CERN – Invenio: a possible search system for Indico.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
ELISQ Systems Demonstration Sagnik Ray Choudhury Doha -- May 2015.
Webster The Illini-Sports Search Engine AccessSynthesisInterface Crawler ARC Seeds Binding XML Managed Beans Web ARC Reader Content ExtractionEntity Recognition.
VT Web Archiving Anthony Rinaldi and Dev Mehta CS 4624 Clients: Mohamed Magdy and Tarek Kanan Blacksburg, VA 5/6/2014.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Big Data Processing of School Shooting Archives
A Smart Tool to Predict Salary Trends of H1-B Holders
Introducing Apache Mahout
CS6604 Project Ensemble Classification
Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:
CS 5604 Information Storage and Retrieval
SCALABLE OPEN ACCESS Hussein Suleman
Analyzing and Visualizing Disaster Phases from Social Media Streams
Collection Management Webpages Final Presentation
CS6604 Digital Libraries IDEAL Webpages Presented by
Arabic News Summarization
SVM Based Learning System for F-term Patent Classification
VT Web Archiving Anthony Rinaldi and Dev Mehta CS 4624
BUILDING A DIGITAL REPOSITORY FOR LEARNING RESOURCES
Origins, Current State and Future Enhancement
Deep SEARCH 9 A new tool in the box for automatic content classification: DS9 Machine Learning uses Hybrid Semantic AI ConTech November.
AI Discovery Template IBM Cloud Architecture Center
Introducing Apache Mahout
Wil Collins, Will Dickerson Client: Mohamed Magdy and CTRnet
Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi.
Austin Karingada, Jacob Handy, Adviser : Dr
Presentation transcript:

Qatar Content Classification Presenter Mohamed Handosa VT, CS6604 May 6, 2014 Client Tarek Kanan 1

About The Project Funded by QNRF ( Started at VT in 1/1/2013, and running through 12/31/2015. A project to advance digital libraries in the country of Qatar. Collaborating institutes: Penn State, Texas A&M, and Qatar University. 2

Project Plan Build Arabic collections using Heritrix crawler Build a universal taxonomy for Arabic newspapers Use different classifiers to classify Arabic documents Use Apache Solr to index and search Arabic collections Evaluate the performance of the classifiers on Arabic data 3

Building Arabic Collections Data Sources 4

Arabic Newspapers Taxonomy Arabic Newspaper Art 150 instances (Al-Raya) Economy 150 instances (QNA) Politics 150 instances (QNA) Society 150 instances (Al-Raya) Sport 150 instances (QNA) 5

Collection Preprocessing Stemming Arabic words (optional) Normalizing Arabic words (optional) Extracting Arabic words 6

Stemming Arabic Words 7 Root Stemmers – too abstract. “ مقصد ”, “ الاقتـصاد ”  “ قصد ” Light Stemmers – widely used. “ مباحث ”, “ المباحثـات ”  “ مباحث ” P-Stemmer – even better. “ مباحث ”  “ مباحث ” “ المباحثات ”  “ مباحثات ”

Multiclass Classification (21 Classifiers) 8

Binary Classification (15 Classifiers) 9

Arabic Newspapers Classification Framework 10 DocumentRanker SVM (SMO) Naïve Bayes Random Forest F-Measure

Uploading Collections to Apache Solr 1700 Newspaper PDF Files Splitted into PDF text Converted to clean Arabic text Stemmed using proposed P-Stemmer. Classified using 5 SVM binary classifiers. Classified using a SVM multiclass classifier. Solr Cores 11 Solr cores were created For each of the five binary classifiers Positive instances were uploaded to a core For the multiclass classifier Instances of each class were uploaded to a core All instances were uploaded to the last core 11

Contributions Building a collection for Arabic newspapers. Developing a set of tools to process Arabic text. Proposing P-Stemmer, an Arabic light stemmer. Comparing different text classification techniques. Proposing a framework for Arabic Newspapers Classification. Creating 11 Solr cores, 2 per class and 1 containing all instances. 12

Future Work Prepare Testing Set using Solr Upload all instances to a Solr core Execute a query related to a given class Label search outputs as belonging to that class Use the labeled instances to test the classifiers Evaluate Classifiers using Solr 13

Thank You 14