Topic Extraction from Biology Literature: Prior, Labeling, and Switching Qiaozhu Mei.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Colonial Beekeepers Association
Insect Societies Lecture 21.
Genetics III: Examples Mutation Analyses- Strengths: Implicate roles for specific genes in animal behaviors Weakness: Lead to misconception that those.
Honey Bee Biology The Basis for Colony Management
The Secret Lives of Honey Bees Apis mellifera Anatomy, Biology, and the Hive.
A Trip Into the Hive Brian VanIwarden. Parts of the Hive On average there are about 50k bees in a hive during the summer Honey Super Frame w/ wax foundation.
Bee Research By Ms. Kuykendall’s Class Honeybees and Honey Honeybees make honey so they have food to eat.
F exam 2 F learning in a natural environment F special case... flower learning F odor learning in the proboscis extension reflex F summary PART 4: BEHAVIORAL.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Asking translational research questions using ontology enrichment analysis Nigam Shah
How an insect colony works
Section 2 Insect Behavior
EventCube Aviation Safety Data Analysis System Fangbo Tao, Xiao Yu, Jiawei Han 08/10/13.
Relevance Feedback Content-Based Image Retrieval Using Query Distribution Estimation Based on Maximum Entropy Principle Irwin King and Zhong Jin Nov
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Adapted from Dr. Dewey Caron PowerPoint
Honey Bee Apis mellifera Anatomy & Biology
PRESENTATION EVS CLASS IV HONEY BEE PERSENTED BY : NIRUPMA SHARMA.
Overview of Search Engines
Urbana, IL| MAY 22, 2009 Anatomical Localization BeeSpace 5 th Annual Workshop Institute for Genomic Biology University of Illinois at Urbana-Champaign.
Honeybees. Honeybees Contd…. Honeybee is a social insect that can survive only as a member of a community or colony Honeybee is a social insect that.
ENDOCRINE SYSTEM.
BeeSpace: An Interactive Environment for Analyzing Nature and Nurture in Societal Roles Bruce Schatz Institute for Genomic Biology University of Illinois.
The Hive is Hungry: Exploring Bee Colony Search and Foraging Behavior through Simulation Peter Bailis, Peter Lifland Harvard Robobees 11 Dec 2009.
Basic Beekeeping Sponsored by the Colonial Beekeepers Association.
Concept Clustering, Summarization and Annotation Qiaozhu Mei.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Bee’s products Products of the hive include bees wax, propolis, brood, venom, pollen, royal jelly, and of course, honey.
The bee jess prinsen. hand drawings Bees belong to the same order as wasps. Like wasps, bees have mouth parts with a long tongue that is suited for gathering.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
The Effect of Social Environment Behavior Development Blotched Gray Arizona Eastern Barred Tiger Salamander US Canada Distribution by Species.
Dancing Bees Heather Mahaney September 26 th, 2002.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
Beespace Component: Filtering and Normalization for Biology Literature Qiaozhu Mei
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.
BEES ALL ABOUT BEES! Bee is a common name for any of the insects that constitute the superfamily Apoidea of the order.
Statistical Method of Gene Set Annotation Based on Literature Information Xin He 09/25/2007.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Introduction to Genetics and Heredity A. The Theory of Blending Inheritance Each parent contributes factors that blend in their offspring - ex. A short.
Section 4 – Reproduction
Aims: Illustrate the structure of the plant including the parts involved in reproduction. Describe the different types of reproduction in plants. Name.
Annotating Gene List From Literature Xin He Department of Computer Science UIUC.
By Shannon, Mia, and Angela The Life Cycle of a Worker Bee.
Automatic Labeling of Multinomial Topic Models
Relevance Feedback Hongning Wang
BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library.
BUSY BUZZY BEES By Kim Scott. A sample bee hive. There are approximately 3000 bees in here!
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
A Study of Poisson Query Generation Model for Information Retrieval
Honey bees.
Abstract The life cycle of holometabolous insects is distinctly divided into three life stages: the larval, pupal, and adult stages. During the larval.
Bees on the tree of life Bees: 100 million years old Flowers: 160 million years old Hymenoptera (social insects) beesants wasps socialsolitary bumblebees.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Hierarchical Clustering & Topic Models
BeeSpace Informatics Research
… and why we need to care about them!
HONEY BEES! BROUGHT TO YOU BY.
Honey Bee Foraging Dances
Semantic Processing with Context Analysis
By Ms. Kuykendall’s Class 2007
ENDOCRINE SYSTEM.
Volume 23, Issue 24, Pages (December 2013)
Topic Models in Text Processing
Clustering.
Bugscope Sarah McElroy EDUC 140
Presentation transcript:

Topic Extraction from Biology Literature: Prior, Labeling, and Switching Qiaozhu Mei

A Sample Topic actin filaments flight muscle flight muscles Word Distribution (language model) labels Meaningful labels actin filaments flight muscle flight muscles filaments 0.0410238 muscle 0.0327107 actin 0.0287701 z 0.0221623 filament 0.0169888 myosin 0.0153909 thick 0.00968766 thin 0.00926895 sections 0.00924286 er 0.00890264 band 0.00802833 muscles 0.00789018 antibodies 0.00736094 myofibrils 0.00688588 flight 0.00670859 images 0.00649626 Example documents actin filaments in honeybee-flight muscle move collectively arrangement of filaments and cross-links in the bee flight muscle z disk by image analysis of oblique sections identification of a connecting filament protein in insect fibrillar flight muscle the invertebrate myosin filament subfilament arrangement of the solid filaments of insect flight muscles structure of thick filaments from insect flight muscle

Topic/Theme Extraction A theme/topic is represented with a multinomial distribution over words Unigram language models Easier to interpret Easy to add prior Easy for retrieval Assumption: K themes in a collection A document covers multiple themes

Topic Extraction v.s. Clustering Effective to reveal the latent topics, and find most relevant documents to a topic Better interpretation, worse accuracy Effective to add priors (control the topics) Clustering algorithms: Effective to assign documents into non-overlapped clusters Better accuracy, worse interpretation Hard to control

Topic Extraction (Results) Related documents 44 biosis:199598006316: 44 biosis:200000292072: 44 biosis:199293065558: 44 biosis:199799595920: 44 biosis:199395062782: corpora   (0.0438967 ) allata   (0.0315774 ) hormone   (0.0249687 ) juvenile   (0.0184049 ) insulin   (0.0174549 ) embryos   (0.0165997 ) neurosecretory  (0.0127734 ) embryo   (0.0124167 ) biosynthesis  (0.0118067 ) cardiaca   (0.00969471 ) sexta   (0.0088941 ) medium   (0.00865245 ) iran   (0.00703376 ) mannose   (0.00668768 ) volume   (0.00661038 ) synapse   (0.00652483 ) injected   (0.00636151 ) stimulatory effect of octopamine on juvenile hormone biosynthesis in honey bees (apis mellifera): physiological and immunocytochemical evidence May want a more general topic How to tell the algorithm to find a more general topic, like “behavioral maturation”?

Topic Extraction (Results cont.) pollen   (0.467911 ) foraging   (0.0373205 ) foragers   (0.0365857 ) collected   (0.0318249 ) grains   (0.0314324 ) loads   (0.025104 ) collection   (0.0208903 ) nectar   (0.0185726 ) sources   (0.0113751 ) collecting   (0.00999529 ) types   (0.00978636 ) pellets   (0.00942175 ) germination  (0.00733012 ) load   (0.00646375 ) stored   (0.00599516 ) amount   (0.00481306 ) trips   (0.00478013 ) Related Documents 13 biosis:200200039990: 13 biosis:199900297835: 13 biosis:200100318017: 13 biosis:199497516580: 13 biosis:200000045397: the response of the stingless bee melipona beecheii to experimental pollen stress, worker loss and different levels of information input Biased towards “Pollen” Not precisely covering “foraging” How to tell the algorithm to focus on “foraging”?

Topic Extraction (Full Results) 100 topics from biosis-bee: http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/bee-100-basic.html 5 themes for query “food” in biosis-bee; 500 documents: http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/bee-food-5-basic.html

Incorporating Topic Priors Either topic extraction or clustering: Cannot guarantee the themes are expected User exploration: usually has preference. E.g., want one topic/cluster is about foraging behavior Use prior to guild the theme extraction Prior as a simple language model E.g. forage 0.2; foraging 0.3; food 0.05; etc.

Incorporating Topic Priors Original EM: Prior: language model; interpreted as pseudo counts EM with Prior:

Incorporating Topic Priors (results) foraging 0.0498044 food 0.0472535 foragers 0.0310718 dance 0.0266078 source 0.0254369 nectar 0.0162739 distance 0.0141869 forage 0.0141503 information 0.0129047 dances 0.012684 hive 0.0124987 landmarks 0.0119087 dancing 0.0109375 waggle 0.0101672 feeder 0.0101266 rate 0.0085641 sources 0.00825884 recruitment 0.00813717 forager 0.00796914 Prior: forage 0.1 foraging 0.1 food 0.1 source 0.1

Incorporating Topic Priors (results: cont.) age 0.0672687 division 0.0551497 labor 0.052136 colony 0.038305 foraging 0.0357817 foragers 0.0236658 workers 0.0191248 task 0.0190672 behavioral 0.0189017 behavior 0.0168805 older 0.0143466 tasks 0.013823 old 0.011839 individual 0.0114329 ages 0.0102134 young 0.00985875 genotypic 0.00963096 social 0.00883439 Prior: labor 0.2 division 0.2

Incorporating Topic Priors (results: cont.) gene 0.0648303 expression 0.0486273 sequence 0.0407999 sequences 0.0311126 brain 0.0233977 drosophila 0.020891 cdna 0.0186153 predict 0.0166939 expressed 0.0166521 amino 0.0126359 dna 0.010655 genome 0.0101629 conserved 0.0098135 bp 0.00908649 nucleotide 0.00906794 phylogenetic 0.00887771 encoding 0.00866418 melanogaster 0.00798409 Prior: brain 0.1 predict 0.1 gene 0.1 expresion 0.1

Incorporating Topic Priors (results: cont.) behavioral 0.110674 age 0.0789419 maturation 0.057956 task 0.0318285 division 0.0312101 labor 0.0293371 workers 0.0222682 colony 0.0199028 social 0.0188699 behavior 0.0171008 performance 0.0117176 foragers 0.0110682 genotypic 0.0106029 differences 0.0103761 polyethism 0.00904816 older 0.00808171 plasticity 0.00804363 changes 0.00794045 Prior: behavioral 0.2 maturation 0.2

Incorporating Topic Priors (Full results) 30 topics from biosis-bee (first 7 topics w/ prior): http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/bee-30-prior.html 30 topics from biosis-bee (first 2 topics w/ prior): http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/bee-30-prior3.html

Labeling a Topic Themes (Topic models) can be hard to interpret. Give meaningful labels to a topic is hard

What is a Good Label? Suggesting the theme (relevance) Understandable – phrases? High coverage inside topic A theme is often a mixture of concepts Discriminative across topics A theme is usually in the context of k topics …

Our Method Guarantee understandability with a pre-processing step Use phrases as candidate topic labels Other possible choices: entities Satisfy relevance, coverage, and discriminability with a probabilistic framework Good labels = Understandable + Relevant + High Coverage + Discriminative

Labeling a Topic: Candidate Labels Phrase generation: Statistically significant 2-grams Hypothesis testing T-test used; ranked by t-score Other choices? Entities? Behavior ontology? GO: hard to use, because they are not real phrases from literature.

Labeling a Topic: Semantic Relevance Zero-order: use phrases which well cover the top words: Clustering dimensional algorithm birch shape Latent Topic  … Good Label: “clustering algorithm” body Bad Label: “body shape”

Labeling a Topic: Semantic Relevance (cont.) First-order: use phrases with similar context: Clustering dimension partition algorithm hash SIGMOD Proceedings Topic  … P(w|) P(w|l) D(|l) Good Label: “clustering algorithm” join Bad Label: “hash join”

Labeling a Topic (results) female   (0.0892427 ) females   (0.0856834 ) male   (0.0854142 ) males   (0.0812643 ) sex   (0.0577668 ) reproductive  (0.0214618 ) ratio   (0.0142873 ) alleles   (0.0133912 ) diploid   (0.0125172 ) offspring  (0.0120271 ) sexes   (0.0116374 ) investment  (0.0115359 ) mating   (0.00902159 ) number   (0.00823397 ) success   (0.00785498 ) sexual   (0.00751456 ) determination  (0.00663546 ) size   (0.00633002 ) Labels: sex ratio (2.49468) (32 );    male female (2.29508) (51 );  sex determination (2.16534) (21 );   female flowers (1.83686) (23 );    sex alleles (1.79415) (16 );    multiple mating (1.72684) (19 );

Labeling a Topic (results cont.) hormone 0.0536175 jh 0.0518038 juvenile 0.0466941 development 0.0387031 larval 0.0276814 hemolymph 0.0216493 pupal 0.0189934 stage 0.0188286 glands 0.0173832 larvae 0.0169996 adult 0.0154695 instar 0.0149492 haemolymph 0.0140053 vitellogenin 0.0131076 caste 0.0124822 protein 0.0116558 glucose 0.0112673 corpora 0.0105111 Labels: juvenile hormone 2.44992 117 hormone jh 1.58432 49 larval instar 1.53676 20 worker larvae 1.52398 51 corpora allata 1.50391 34

Labeling a Topic (results) foraging 0.0498044 food 0.0472535 foragers 0.0310718 dance 0.0266078 source 0.0254369 nectar 0.0162739 distance 0.0141869 forage 0.0141503 information 0.0129047 dances 0.012684 hive 0.0124987 landmarks 0.0119087 dancing 0.0109375 waggle 0.0101672 feeder 0.0101266 rate 0.0085641 recruitment 0.00813717 forager 0.00796914 Labels food source -6.72378 107 nectar foraging -7.11784 28 nectar foragers -7.58965 47 nectar source -7.78975 16 food sources -7.8487 72 waggle dance -8.21514 31 Prior 0 forage 0.1 0 foraging 0.1 0 food 0.1 0 source 0.1

Labeling a Topic (full results) 100 topics from biosis-bee (w/ labels): http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/bee-100-basic-l.html 100 topics from biosis-fly-genetics (w/ labels): http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/fly-100-l.html

Context Switching Utilize topic extraction for concept switching (two possible ways) Label the same topic model with phrases in another context Use the topic model from context A as prior to extract topics from context B

foraging 0.142473 foragers 0.0582921 forage 0.0557498 food 0.0393453 nectar 0.03217 colony 0.019416 source 0.0153349 hive 0.0151726 dance 0.013336 forager 0.0127668 information 0.0117961 feeder 0.010944 rate 0.0104752 recruitment 0.00870751 individual 0.0086414 reward 0.00810706 flower 0.00800705 dancing 0.00794827 behavior 0.00789228 Labels with bee context foraging trip 2.31174 21 nectar foragers 2.23428 47 tremble dance 2.21407 10 returning foragers 2.18954 16 food sources 2.14453 72 food source 2.13647 107 foraging strategy 2.101 14 individual foraging 2.08334 16 waggle dance 2.07836 31 Labels with fly context foraging behavior 2.45263 27 age related 2.29676 20 drosophila larvae 2.15361 67 feeding rate 1.99218 17 apis mellifera 1.9847 23 diptera drosophilidae 1.9 25

foraging 0.142473 foragers 0.0582921 forage 0.0557498 food 0.0393453 nectar 0.03217 colony 0.019416 source 0.0153349 hive 0.0151726 dance 0.013336 forager 0.0127668 information 0.0117961 feeder 0.010944 rate 0.0104752 recruitment 0.00870751 individual 0.0086414 reward 0.00810706 flower 0.00800705 dancing 0.00794827 behavior 0.00789228 foraging 0.290076 nectar 0.114508 food 0.106655 forage 0.0734919 colony 0.0660329 pollen 0.0427706 flower 0.0400582 sucrose 0.0334728 source 0.0319787 behavior 0.0283774 individual 0.028029 rate 0.0242806 recruitment 0.0200597 time 0.0197362 reward 0.0196271 task 0.0182461 sitter 0.00604067 rover 0.00582791 rovers 0.00306051

Speed of topic extraction # documents # themes Running time 500 5 8.3 s 10 10.6 s 1000 17.6 s 10k 30 350 s 16k 150 4000 s

Questions? Thanks!