Raphael Cohen, Michael Elhadad Noemie Elhadad. 1. If it has to do with human readable (more or less) text – it NLP! 2. Search engines. 3. Information.

Slides:



Advertisements
Similar presentations
Yansong Feng and Mirella Lapata
Advertisements

UNDERSTANDING ACADEMIC ARTICLES Research Workshop Series.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
1 SUBJECT DATABASES ENGLISH 115 Hudson Valley Community College Marvin Library Learning Commons.
ENGLISH 3022 RESEARCH & RESOURCES Megan Lowe, Coordinator of Public Services.
Text Mining of Medical Documents Michael Elhadad - Raphael Cohen Dept of Computer Science.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
Sequence Similarity Searching Class 4 March 2010.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
CS 206 Introduction to Computer Science II 10 / 14 / 2009 Instructor: Michael Eckmann.
Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel Review by Newton Alex
Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA) Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Welcome to the CINAHL* tutorial By the end of this tutorial you should be able to: Do a basic search to find references Use search techniques to make your.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
ENGL 1002 ~ RESEARCH & RESOURCES Megan Lowe, Coordinator of Public Services.
Internet Research Finding Free and Fee-based Obituaries Online.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Text Analysis Everything Data CompSci Spring 2014.
How to make a presentation (Oral and Poster) Dr. Bernard Chen Ph.D. University of Central Arkansas July 5 th Applied Research in Healthy Information.
IMSS005 Computer Science Seminar
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
Library Research What to ask? Where to look? Librarian Anna Jones.
Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Web-Based Tools. Constant Change  The landscape of specific web tools available is constantly changing  Learning to identify and evaluate tools to meet.
1 The Ferret Copy Detector Finding short passages of similar texts in large document collections Relevance to natural computing: System is based on processing.
An Introduction to Machine Learning and Natural Language Processing Tools Presented by: Mark Sammons, Vivek Srikumar (Many slides courtesy of Nick Rizzolo)
Medline on OvidSP. Medline Facts Extensive MeSH thesaurus structure with many synonyms used in mapping and multidatabase searching with Embase Thesaurus.
Searching for Information and Library Databases. Knowing… When When Where Where How to find information isn’t easy How to find information isn’t easy.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
How to Create a Document in Google Drive By Tressa Beckler.
Research Topics CSC Parallel Computing & Compilers CSC 3990.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
CH 42 DEVELOPING A RESEARCH PLAN CH 43 FINDING SOURCES CH 44 EVALUATING SOURCES CH 45 SYNTHESIZING IDEAS Research!
Topic Modeling using Latent Dirichlet Allocation
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
1 Running Experiments for Your Term Projects Dana S. Nau CMSC 722, AI Planning University of Maryland Lecture slides for Automated Planning: Theory and.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Data modeling Process. Copyright © CIST 2 Definition What is data modeling? –Identify the real world data that must be stored on the database –Design.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
HANGMAN OPTIMIZATION Kyle Anderson, Sean Barton and Brandyn Deffinbaugh.
Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Information Literacy Learn to find and critically evaluate information sources. Increase your information literacy skills, to more effectively search,
INTRODUCTION TO ACCESS 2010 Winter Basics of Access Data Management System Allows for multiple levels of data Relational Database User defined relations.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Research Paper on BioInformatics
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Statistical NLP: Lecture 7
Map Reduce.
CS 430: Information Discovery
13 Text Processing Hongfei Yan June 1, 2016.
Searching Similar Segments over Textual Event Sequences
From Unstructured Text to StructureD Data
Text Mining of Medical Documents
Presentation transcript:

Raphael Cohen, Michael Elhadad Noemie Elhadad

1. If it has to do with human readable (more or less) text – it NLP! 2. Search engines. 3. Information extraction. 4. Helping the government read your s. 5. Topic Models. 6. Movie reviews aggregators. 7. Spell chekers. 8. …

 Detecting collocations: " קפה עלית ", “ כאב ראש “ Dunning 1994 – Word occurrences, Chi- Square / Maximum Likelyhood  Topic Modeling: “ לידה / הריון “ vs " טפיל " Blei et al – Mixed generative model acquired using Gibbs sampling over word occurrences in document.

 Hospital data is becoming digital.  Textual part of EHR is important. In our Hebrew collection of 900 neurology notes – only 12 prescriptions are indexed.  This data is used for a variety of purposes: Discovering drug side effects (Saadon and Shahar), discovering adverse drug relations, creating summaries for physicians in hospitals, studying diseases and more.

 Observation: Physicians like to copy/paste previous visits to save time (couldn’t do it with paper notes).  Wrenn et al. showed up to 74% redundancy. It occurs in the same patient notes (Thank god…), usually within the same form but not always.

 No fear, other interesting datasets are also redundant: News reports (try Google News) Movie reviews Product reviews Talkbacks in Ynet…  Also, we call ourselves Medical-Informatics, and have our own conferences.

On average 52% identity, but we can see two document populations.

 Conventional wisdom – the more data the better performance of statistical algorithms.  This usually works for huge corpora (the internet).  To solve domain specific problems we have to use smaller corpora (For example, translating CS literature from English to Chinese)  However, redundancy creates false occurrence counts. With some patients having hundreds of redundant notes, this might create a bias in smaller corpora.

 22,564 patient notes of patients with kidney problems.  6,131,879 tokens.  The physician tells us that the most important notes are those from the “primary- health-care-provider” table in the database.  There are 504 patients with such notes, and 1,618 “primary-provider” notes.

Effect on word counts

 Medical concepts are detected using Health- Term-Finder, an NLP program based on the OpenNLP suite and UMLS (Unified Medical Language System) a medical concept repository.  These concepts include drugs, findings, symptoms…  Hey, you said no bio… - annotations are used with names of actors (movie reviews / gossip), corporations (news) and terrorists (online forums and chats).

Effect on UMLS concept counts

Effect on co-occurrence in UMLS concepts

 Build a corpus with controlled amount of redundancy.  Reminiscent of Non-Redundant protein/DNA databases built in the beginning of the last decade [Holes and Sanders (1998)].

 Our easy and naïve approach: We have the patients’ ids. Let’s sample a small number of notes from each patient (The “Last” dataset in the graphs we saw).  Drawbacks: a) Annonimized data-sets are the future (our Soroka collection is on example)- they ain’t got ids. b) Are we throwing out some good data along with the redundant stuff?

 Align all pairs of sequences (Nimrod showed us how to do that last week) and kick out the redundant ones.  Problem: Alignment costs ~O(n ² ), this will take a while.  Solution: BLAST / FASTA algorithms use short identical finger prints (substrings) to only compare sequences likely to be similar and to cut down O(n ² ) to ~O(n) in most cases. *Experts say that using borrowed algorithm from another discipline gets you into journals

 The Bioinfo algorithms are optimized for 4/20 (now 21) alphabets, and the sequences are shorter (usually less than 5K characters).  Texts are easier than DNA, the have defined end of lines and only one reading frame.  Fingerprinting methods for texts already exist in order to find plagiarism.

Sort documents by size. For each document: Find finger prints by lines (For each line, break into substrings of length F) Add to the corpus if there is no document sharing more than Max_redundancy substrings in the corpus

 How long does it take? 5 minutes for our 20K documents. 20 minutes for our 400k documents.  Is it better than the “Last note” naïve approach?