I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER Data Mining of Blood Handling Incident Databases Costas Tsatsoulis Information and Telecommunication.

Slides:



Advertisements
Similar presentations
ADBIS 2007 Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred Dimitar.
Advertisements

ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Decision Tree Approach in Data Mining
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Artificial Intelligence MEI 2008/2009 Bruno Paulette.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Evaluating the Performance of IR Sytems
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
IR Models: Review Vector Model and Probabilistic.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
University of Kansas The Intelligent Systems & Information Management Laboratory Costas Tsatsoulis, Director.
CS Instance Based Learning1 Instance Based Learning.
Evaluation of N-grams Conflation Approach in Text-based Information Retrieval Serge Kosinov University of Alberta, Computing Science Department, Edmonton,
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Introduction to Data Mining Engineering Group in ACL.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
CHAPTER 12 ADVANCED INTELLIGENT SYSTEMS © 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang.
0 The Facts Don’t Speak For Themselves: AHRQ 2007 HS Kaplan R Levitan B Rabin Fastman CUMC/NYPH Getting the Story from Aggregate Data AHRQ 2007.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Text mining.
Beyond Co-occurrence: Discovering and Visualizing Tag Relationships from Geo-spatial and Temporal Similarities Date : 2012/8/6 Resource : WSDM’12 Advisor.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 The 5th annual UK Workshop on Computational Intelligence London, 5-7.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
EasyQuerier: A Keyword Interface in Web Database Integration System Xian Li 1, Weiyi Meng 2, Xiaofeng Meng 1 1 WAMDM Lab, RUC & 2 SUNY Binghamton.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.
Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.
Knowledge Learning by Using Case Based Reasoning (CBR)
Web- and Multimedia-based Information Systems Lecture 2.
Neural Network Implementation of Poker AI
Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.
Vector Space Models.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Data Mining and Decision Support
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur
Plan for Today’s Lecture(s)
Text Based Information Retrieval
Information Retrieval and Web Search
Waikato Environment for Knowledge Analysis
Multimedia Information Retrieval
Presented by: Prof. Ali Jaoua
Representation of documents and queries
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Presentation transcript:

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER Data Mining of Blood Handling Incident Databases Costas Tsatsoulis Information and Telecommunication Technology Center Dept. of Electrical Engineering and Computer Science University of Kansas

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER Background Incident reports collected for handling of blood products An initial database was collected to allow experimentation Goals: –Allow the generation of intelligence from data Unique events Event clusters Event trends Frequencies –Simplify the job of the QA Similar reports Less need for in-depth causal analysis –Allow cross-institutional analysis

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER Annual Accidental Deaths in U.S.A.

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER Institute of Medicine Recommendation November 1999 Establish a national focus of research, to enhance knowledge base about patient safety Identify and learn from errors through both mandatory and voluntary reporting systems Raising standards and expectations through oversight organizations Create safety systems through implementation of safe practices at the delivery level

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER Near Miss Event Reporting Useful data base to study system’s failure points Many more near misses than actual bad events Source of data to study human recovery Dynamic means of understanding system operations

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER The Iceberg Model of Near-Miss Events 1/2,000,000 fatalities 1/38,000 ABO incompatible txns 1/14,000 incorrect units transfused 1/2,000,000 1/38,000 1/14,000 Near-Miss Events

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER Intelligent Systems Developed two separate systems: –Case-Based Reasoning (CBR) –Information Retrieval (IR) Goal was to address most of the needs of the users: –Allow the generation of intelligence from data Unique events Event clusters Event trends Frequencies –Simplify the job of the QA Similar reports Less need for in-depth causal analysis –Allow cross-institutional analysis

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER Case-Based Reasoning Technique from Artificial Intelligence that solves problems based on previous experiences Of significance to us: –CBR must identify a similar situation/problem to know what to do and how to solve the problem Use CBR’s concept of “similarity” to identify: –similar reports –report clusters –frequencies

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER What is a Case and how do we represent it? An incident report is a “case” Cases are represented by: –indexes descriptive features of a situation surface or in-depth or both –their values symbolic “Technician” numerical “103 rpm” sets “{Monday, Tuesday, Wednesday}” other (text, images, …) –weights indicate the descriptive significance of the index

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER Finding Similarity Define degrees of matching between attributes of an event report. For example: –“Resident” and “MD” are similar –“MLT,” “MT,” and “QA/QC” are similar A value may match perfectly or partially –“MLT” to “MLT” –“MLT” to “MT” Different attributes of the event report are weighted The sum of the matching attributes with their degree of match and their weights, defines similarity Cases matching over some predefined degree of similarity are retrieved and considered similar

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER Information Retrieval Index, search and recall text without any domain information Preprocess document –remove stop words –stemming Use some representation for documents –vector-space model vector of terms with their weight = tf * idf tf = term frequency = (freq of word)/(freq of most frequent word) idf = inverse document frequency = log 10 ((total docs)/(docs with term)) Use some similarity metric between documents –vector algebra to find the cosine of angle between vectors

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER CBR for From the incident report features selected a subset as indexes Semantic similarity defined –(OR, ER, ICU, L&D) –(12-4am, 4-8am), (8am-12pm, 12-4pm), (4-8pm, 8pm-12am) Domain-specific details defined Weights assigned –fixed –conditional weight of some causal codes based on whether they were established using a rough or in-depth analysis

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER IR for No deletion of stop words –“or” vs. “OR” No stemming Use the vector space model and the cosine comparison measure

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER

Experiments Database of approx. 600 cases Selected 24 reports to match against case base CBR retrieval - CBR_match_value EXPERIMENT 1 IR retrieval - IR_match_value EXPERIMENT 2 Combined retrieval EXPERIMENTS 3-11 –W CBR *CBR_match_value +W IR *IR_match_value –weights range from 0.9 to 0.1 in increments of 0.1 –(0.9,0.1), (0.8,0.2), (0.7,0.3), …, (0.2,0.8),(0.1,0.9) CBR retrieval with all weights set to 1 EXPERIMENT 12 No retrieval threshold set

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER Evaluation Collected top 5 cases for each report for each experiment Because of duplication, each report had cases retrieved for all 12 experiments A random case was added to the set Results sent to experts to evaluate –Almost Identical –Similar –Not Very Similar –Not Similar At All

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER Preliminary Analysis Determine agreement/disagreement with expert’s analysis –is a case similar? –is a case dissimilar? Establish accuracy (recall is more difficult to measure) False positives vs. false negatives What is the influence of the IR component? Are the weights appropriate? What is the influence of varying selection thresholds?

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER Results with 0.66 threshold CBRIRCBR+IR (90::10) CBR equal weights Retr.Non- retr. Retr.Non-retr.Retr.Non-retr.Retr.Non-retr. Retrievable Non- retrievable

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER Results with 0.70 threshold CBRIRCBR+IR (90::10) CBR equal weights Retr.Non- retr. Retr.Non-retr.Retr.Non-retr.Retr.Non-retr. Retrievable Non- retrievable

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER Combined Results CBRIRCBR+IR (90::10) CBR equal weights Retr.Non- retr. Retr.Non- retr. Retr.Non- retr. Retr.Non-retr. Retrievable Non- retrievable Increasing selection threshold

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER Some preliminary conclusions The weights used in CBR seem to be appropriate and definitely improve retrieval In CBR, increasing the acceptance threshold improves selection of retrievable cases but also increases the false positives IR does an excellent job in identifying non-retrievable cases Even a 10% inclusion of IR to CBR greatly helps in identifying non-retrievable cases

I NFORMATION AND T ELECOMMUNICATION T ECHNOLOGY C ENTER Future work Plot performance versus acceptance threshold –identify best case selection threshold –Integrate the analysis of the second expert Examine how CBR and IR can be combined to exploit each one’s strengths: –CBR performs initial retrieval –IR eliminates bad cases retrieved Look into temporal distribution of retrieved reports and adjust their matching accordingly Examine a NLU system for incident reports that have longer textual descriptions Re-run on different datasets Get our hands on large datasets and perform other types of data mining (rule induction, predictive models, probability networks, supervised and unsupervised clustering, etc.)