Categorization: Information and Misinformation Paul Thompson 7 November 2001.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
UCB Computer Vision Animals on the Web Tamara L. Berg CSE 595 Words & Pictures.
PolyAnalyst Data and Text Mining tool Your Knowledge Partner TM www
Taxonomies of Knowledge: Building a Corporate Taxonomy Wendi Pohs, Iris Associates
Kansas State University Department of Computing and Information Sciences Laboratory for Knowledge Discovery in Databases (KDD) KDD Group Research Seminar.
Helping people find content … preparing content to be found Enabling the Semantic Web Joseph Busch.
Semi-Supervised, Knowledge-Based Information Extraction for the Semantic Web Thomas L. Packer Funded in part by the National Science Foundation. 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
DATA MINING CS157A Swathi Rangan. A Brief History of Data Mining The term “Data Mining” was only introduced in the 1990s. Data Mining roots are traced.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Presented by Zeehasham Rasheed
Automatic Text Classification through Machine Learning David W. Miller Semantic Web Spring 2002 Department of Computer Science University of Georgia
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
1 Information Integration and Source Wrapping Jose Luis Ambite, USC/ISI.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
1 Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing,
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
MACHINE LEARNING 張銘軒 譚恆力 1. OUTLINE OVERVIEW HOW DOSE THE MACHINE “ LEARN ” ? ADVANTAGE OF MACHINE LEARNING ALGORITHM TYPES  SUPERVISED.
Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent.
Jeyarasan Kasun Amal Iromi Bala Louiqa. 2 Semantic Search Engine for Financial Documents FIBO and taxonomy of certificates of deposit. Ranking Classification.
E-Justice is based on e-Law By Anton Tomažič. A New Era in Legal Information So far: “strings of characters” IT can do much more Recognizing the MEANING.
Perception-Based Classification (PBC) System Salvador Ledezma April 25, 2002.
PLUG IT IN 5 Intelligent Systems. 1.Introduction to intelligent systems 2.Expert Systems 3.Neural Networks 4.Fuzzy Logic 5.Genetic Algorithms 6.Intelligent.
GUI & Optimizer for the Virtual Pipeline Simulation Testbed Walamitien Oyenan November 10, 2003 MSE Presentation (Phase 2)
Knowledge Discovery and Data Mining Evgueni Smirnov.
1 LiveClassifier: Creating Hierarchical Text Classifiers through Web Corpora Chien-Chung Huang Shui-Lung Chuang Lee-Feng Chien Presented by: Vu LONG.
© Paul Buitelaar – November 2007, Busan, South-Korea Evaluating Ontology Search Towards Benchmarking in Ontology Search Paul Buitelaar, Thomas.
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
Semantic Technologies & GATE NSWI Jan Dědek.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Data Mining In contrast to the traditional (reactive) DSS tools, the data mining premise is proactive. Data mining tools automatically search the data.
PIER Research Methods Protocol Analysis Module Hua Ai Language Technologies Institute/ PSLC.
Interaction LBSC 734 Module 4 Doug Oard. Agenda Where interaction fits Query formulation Selection part 1: Snippets  Selection part 2: Result sets Examination.
MIS 2000 Information Systems for Management Introduction to Course Section Bob Travica.
Practical Issues for Automated Categorization of Web Sites John M. Pierre Metacode Technologies, Inc. 139 Townsend Street San Francisco,
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
IA Tools to Inform IA Summit 2003 Madonnalisa G. Chan.
Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Automatic vs manual indexing Focus on subject indexing Not a relevant question? –Wherever full text is available, automatic methods predominate Simple.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Information Extraction. Extracting Information from Text System : When would you like to meet Peter? User : Let’s see, if I can, I’d like to meet him.
Empowering the Knowledge Worker End-User Software Engineering in Knowledge Management Witold Staniszkis The 17th International.
Witold Staniszkis Empowering the Knowledge Worker End-User Software Engineering in Knowledge Management Witold Staniszkis
Information Organization: Overview
Sentiment analysis algorithms and applications: A survey
School of Computer Science & Engineering
Presented by: Hassan Sayyadi
Data Mining 101 with Scikit-Learn
Adrian Tuhtan CS157A Section1
Taxonomies, Lexicons and Organizing Knowledge
Special Topics in Data Mining Applications Focus on: Text Mining
What is Pattern Recognition?
Machine Learning Week 1.
Text Categorization Rong Jin.
Ying Dai Faculty of software and information science,
Categorization: Information and Misinformation
Information Retrieval
Information Retrieval and Web Design
Semi-Automatic Data-Driven Ontology Construction System
Welcome! Knowledge Discovery and Data Mining
Information Organization: Overview
System Model Acquisition from Requirements Text
Presentation transcript:

Categorization: Information and Misinformation Paul Thompson 7 November 2001

Outline Supervised machine learning in an industry setting –Categorization of case law –Categorization of statutes Categorization applied to misinformation

Categorization of Case Law (ICAIL 2001) Query-Based or k-Nearest Neighbor-like Knowledge-Based, Topical View Queries Machine Learning –Decision Trees: C4.5. C5.0 –Rule-based: Ripper

Categorization of Case Law (cont.) Routing incoming case to 40 broad topical areas, e.g., bankruptcy  Query-based Approach: Treat incoming case as NL query by extracting terms (10-300)  Machine Learning Approach: C4.5  training on 7,535 labeled example cases

Categorization of Case Law – Issues with data Started with over 11,000 marked cases, not 7,535 Cases categorized by several means –By attorney/editors –By topical view queries –Other Only want cases categorized by editors

Comparison: C4.5, Ripper, and Topical View Queries

Categorization of Statutes (SIG/CR 1997)  Machine Learning Approaches: C4.5 and Ripper  Training on 125,180 labeled example statute sections (5 states)  Testing on 24,475 labeled statute sections (6 th state)  35 Categories

Results for C4.5 CategoryPrecisionRecall F b =1/2 Counties Sales Worker’s Compensation Insurance Warehouse Receipts Overall

Categorization of Urban Renewal Automatic and Manual

Automatic Categorization of “Urban Renewal” – C4.5

Manual Rule Set for Urban Renewal

Manual Rule Set (cont.)

Questionable Validity of Training and Testing Data Statutes editors: three phases of index assignment –Phase 1: list all possible category assignments –Phase 2: eliminate categories not suited to print environment –Phase 3: concatenate categories hierarchically Arguably automatic categorization is phase 1, but this data not retained

Categorization Applied to Misinformation Text may be deliberately misleading –Web search engine optimization (Lynch JASIS&T vol. 52 no. 1, 2001) –Computer virus hoaxes –Stock market fraud –Information warfare –Manipulating users’ perceptions Semantic Hacking project – ISTS