Categorization: Information and Misinformation

Slides:



Advertisements
Similar presentations
Categorization: Information and Misinformation Paul Thompson 7 November 2001.
Advertisements

Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
PolyAnalyst Data and Text Mining tool Your Knowledge Partner TM www
Taxonomies of Knowledge: Building a Corporate Taxonomy Wendi Pohs, Iris Associates
Automatic Classification of Text Databases Through Query Probing Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc.
Helping people find content … preparing content to be found Enabling the Semantic Web Joseph Busch.
Semi-Supervised, Knowledge-Based Information Extraction for the Semantic Web Thomas L. Packer Funded in part by the National Science Foundation. 1.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Presented by Zeehasham Rasheed
1 Web Query Classification Query Classification Task: map queries to concepts Application: Paid advertisement 问题:百度 /Google 怎么赚钱?
Automatic Text Classification through Machine Learning David W. Miller Semantic Web Spring 2002 Department of Computer Science University of Georgia
Towards Semantic Web: An Attribute- Driven Algorithm to Identifying an Ontology Associated with a Given Web Page Dan Su Department of Computer Science.
Grid Service Discovery with Rough Sets Maozhen Li, Member, IEEE, Bin Yu, Omer Rana, and Zidong Wang, Senior Member, IEEE IEEE TRANSACTION S ON KNOLEDGE.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
1 Information Integration and Source Wrapping Jose Luis Ambite, USC/ISI.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
1 Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing,
SIEVE—Search Images Effectively through Visual Elimination Ying Liu, Dengsheng Zhang and Guojun Lu Gippsland School of Info Tech,
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
MACHINE LEARNING 張銘軒 譚恆力 1. OUTLINE OVERVIEW HOW DOSE THE MACHINE “ LEARN ” ? ADVANTAGE OF MACHINE LEARNING ALGORITHM TYPES  SUPERVISED.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
E-Justice is based on e-Law By Anton Tomažič. A New Era in Legal Information So far: “strings of characters” IT can do much more Recognizing the MEANING.
Perception-Based Classification (PBC) System Salvador Ledezma April 25, 2002.
1 LiveClassifier: Creating Hierarchical Text Classifiers through Web Corpora Chien-Chung Huang Shui-Lung Chuang Lee-Feng Chien Presented by: Vu LONG.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
© Paul Buitelaar – November 2007, Busan, South-Korea Evaluating Ontology Search Towards Benchmarking in Ontology Search Paul Buitelaar, Thomas.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Reference Collections: Collection Characteristics.
Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.
Class Imbalance in Text Classification
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Information Extraction. Extracting Information from Text System : When would you like to meet Peter? User : Let’s see, if I can, I’d like to meet him.
Empowering the Knowledge Worker End-User Software Engineering in Knowledge Management Witold Staniszkis The 17th International.
Data Mining – Intro.
Information Organization: Overview
Sentiment analysis algorithms and applications: A survey
School of Computer Science & Engineering
Text Based Information Retrieval
Adrian Tuhtan CS157A Section1
Taxonomies, Lexicons and Organizing Knowledge
What is Pattern Recognition?
Applying Key Phrase Extraction to aid Invalidity Search
Text Categorization Rong Jin.
CSE 635 Multimedia Information Retrieval
Ying Dai Faculty of software and information science,
Panagiotis G. Ipeirotis Luis Gravano
Information Retrieval
Data and Applications Security Developments and Directions
Information Retrieval and Web Design
Semi-Automatic Data-Driven Ontology Construction System
Deep SEARCH 9 A new tool in the box for automatic content classification: DS9 Machine Learning uses Hybrid Semantic AI ConTech November.
Information Organization: Overview
Information Retrieval
System Model Acquisition from Requirements Text
Presentation transcript:

Categorization: Information and Misinformation Paul Thompson 7 November 2001

Outline Supervised machine learning in an industry setting Categorization of case law Categorization of statutes Categorization applied to misinformation

Categorization of Case Law (ICAIL 2001) Query-Based or k-Nearest N eighbor-like Knowledge-Based, Topical View Queries Machine Learning Decision Trees: C4.5. C5.0 Rule-based: Ripper

Categorization of Case Law (cont.) Routing incoming case to 40 broad topical areas, e.g., bankruptcy Query-based Approach: Treat incoming case as NL query by extracting terms (10-300) Machine Learning Approach: C4.5 training on 7,535 labeled example cases

Categorization of Case Law – Issues with data Started with over 11,000 marked cases, not 7,535 Cases categorized by several means By attorney/editors By topical view queries Other Only want cases categorized by editors

Comparison: C4.5, Ripper, and Topical View Queries

Categorization of Statutes (SIG/CR 1997) Machine Learning Approaches: C4.5 and Ripper Training on 125,180 labeled example statute sections (5 states) Testing on 24,475 labeled statute sections (6th state) 35 Categories

Results for C4.5 Category Precision Recall F b =1/2 Counties 48.09 12.29 30.39 Sales 73.68 13.00 38.11 Worker’s Compensation 78.71 91.73 81.01 Insurance 83.19 84.56 83.46 Warehouse Receipts 92.59 86.21 91.24 Overall 67.62 41.61 56.35

Categorization of Urban Renewal Automatic and Manual

Automatic Categorization of “Urban Renewal” – C4.5

Manual Rule Set for Urban Renewal

Manual Rule Set (cont.)

Questionable Validity of Training and Testing Data Statutes editors: three phases of index assignment Phase 1: list all possible category assignments Phase 2: eliminate categories not suited to print environment Phase 3: concatenate categories hierarchically Arguably automatic categorization is phase 1, but this data not retained

Categorization Applied to Misinformation Text may be deliberately misleading Web search engine optimization (Lynch JASIS&T vol. 52 no. 1, 2001) Computer virus hoaxes Stock market fraud Information warfare Manipulating users’ perceptions Semantic Hacking project – ISTS