Presentation is loading. Please wait.

Presentation is loading. Please wait.

Raphael Cohen, Michael Elhadad Noemie Elhadad. 1. If it has to do with human readable (more or less) text – it NLP! 2. Search engines. 3. Information.

Similar presentations


Presentation on theme: "Raphael Cohen, Michael Elhadad Noemie Elhadad. 1. If it has to do with human readable (more or less) text – it NLP! 2. Search engines. 3. Information."— Presentation transcript:

1 Raphael Cohen, Michael Elhadad Noemie Elhadad

2 1. If it has to do with human readable (more or less) text – it NLP! 2. Search engines. 3. Information extraction. 4. Helping the government read your emails. 5. Topic Models. 6. Movie reviews aggregators. 7. Spell chekers. 8. …

3  Detecting collocations: " קפה עלית ", “ כאב ראש “ Dunning 1994 – Word occurrences, Chi- Square / Maximum Likelyhood  Topic Modeling: “ לידה / הריון “ vs " טפיל " Blei et al. 2003 – Mixed generative model acquired using Gibbs sampling over word occurrences in document.

4  Hospital data is becoming digital.  Textual part of EHR is important. In our Hebrew collection of 900 neurology notes – only 12 prescriptions are indexed.  This data is used for a variety of purposes: Discovering drug side effects (Saadon and Shahar), discovering adverse drug relations, creating summaries for physicians in hospitals, studying diseases and more.

5  Observation: Physicians like to copy/paste previous visits to save time (couldn’t do it with paper notes).  Wrenn et al. showed up to 74% redundancy. It occurs in the same patient notes (Thank god…), usually within the same form but not always.

6  No fear, other interesting datasets are also redundant: News reports (try Google News) Movie reviews Product reviews Talkbacks in Ynet…  Also, we call ourselves Medical-Informatics, and have our own conferences.

7 On average 52% identity, but we can see two document populations.

8  Conventional wisdom – the more data the better performance of statistical algorithms.  This usually works for huge corpora (the internet).  To solve domain specific problems we have to use smaller corpora (For example, translating CS literature from English to Chinese)  However, redundancy creates false occurrence counts. With some patients having hundreds of redundant notes, this might create a bias in smaller corpora.

9  22,564 patient notes of patients with kidney problems.  6,131,879 tokens.  The physician tells us that the most important notes are those from the “primary- health-care-provider” table in the database.  There are 504 patients with such notes, and 1,618 “primary-provider” notes.

10 Effect on word counts

11  Medical concepts are detected using Health- Term-Finder, an NLP program based on the OpenNLP suite and UMLS (Unified Medical Language System) a medical concept repository.  These concepts include drugs, findings, symptoms…  Hey, you said no bio… - annotations are used with names of actors (movie reviews / gossip), corporations (news) and terrorists (online forums and chats).

12 Effect on UMLS concept counts

13 Effect on co-occurrence in UMLS concepts

14  Build a corpus with controlled amount of redundancy.  Reminiscent of Non-Redundant protein/DNA databases built in the beginning of the last decade [Holes and Sanders (1998)].

15  Our easy and naïve approach: We have the patients’ ids. Let’s sample a small number of notes from each patient (The “Last” dataset in the graphs we saw).  Drawbacks: a) Annonimized data-sets are the future (our Soroka collection is on example)- they ain’t got ids. b) Are we throwing out some good data along with the redundant stuff?

16  Align all pairs of sequences (Nimrod showed us how to do that last week) and kick out the redundant ones.  Problem: Alignment costs ~O(n ² ), this will take a while.  Solution: BLAST / FASTA algorithms use short identical finger prints (substrings) to only compare sequences likely to be similar and to cut down O(n ² ) to ~O(n) in most cases. *Experts say that using borrowed algorithm from another discipline gets you into journals

17  The Bioinfo algorithms are optimized for 4/20 (now 21) alphabets, and the sequences are shorter (usually less than 5K characters).  Texts are easier than DNA, the have defined end of lines and only one reading frame.  Fingerprinting methods for texts already exist in order to find plagiarism.

18 Sort documents by size. For each document: Find finger prints by lines (For each line, break into substrings of length F) Add to the corpus if there is no document sharing more than Max_redundancy substrings in the corpus

19

20  How long does it take? 5 minutes for our 20K documents. 20 minutes for our 400k documents.  Is it better than the “Last note” naïve approach?

21


Download ppt "Raphael Cohen, Michael Elhadad Noemie Elhadad. 1. If it has to do with human readable (more or less) text – it NLP! 2. Search engines. 3. Information."

Similar presentations


Ads by Google