Download presentation
Presentation is loading. Please wait.
Published byHugo Newman Modified over 9 years ago
1
Concept Clustering, Summarization and Annotation Qiaozhu Mei
2
Outline Theme extraction Theme summarization Concept clustering Entity Annotation
3
Theme extraction Motivation Extract subtopics/themes from a collection Input A collection of documents, with index of terms/phrases Output A set of word distributions, each represented with top probability words Future Direction Take all kinds of priors: in between “know nothing” and “know a lot”, usually know something with different types of information
4
Theme summarization Motivation The output of themes are not well interpretable. Use phrases to represent a theme (use k phrase to summarize a theme). Input A text collection and a set of themes Output A ranked list of phrases for each theme Future Direction Automatically generated phrases v.s. Parser. (Evaluation)
5
Concept clustering Motivation Group semantically replaceable/similar terms into tight semantic clusters (tight concepts). E.g. synonyms Input A collection of documents and a list of terms Output A set of tight clusters Future Direction Apply heuristics to speed up without degrade the performance (Evaluation)
6
Concept clustering: results GNAME#glutathione GNAME#COII GNAME#PEC ((GNAME#glutathione) ((GNAME#COII) (GNAME#PEC))) GNAME#GST GNAME#IgG4#ap GNAME#COX#2 (((GNAME#GST) (GNAME#IgG4#ap)) (GNAME#COX#2)) GNAME#alpha#bungarotoxin GNAME#D2 ((GNAME#alpha#bungarotoxin) (GNAME#D2)) GNAME#mrjp1 GNAME#apamin GNAME#E1A (((GNAME#mrjp1) (GNAME#apamin)) (GNAME#E1A)) GNAME#Apis GNAME#ribosomal ((GNAME#Apis) (GNAME#ribosomal)) GNAME#alpha#glucosidases GNAME#somatostatin GNAME#G (((GNAME#alpha#glucosidases) (GNAME#somatostatin)) (GNAME#G)) GNAME#alpha#glucosidase GNAME##alpha##glucosidase ((GNAME#alpha#glucosidase) (GNAME##alpha##glucosidase)) GNAME#16S GNAME#mammalian ((GNAME#16S) (GNAME#mammalian)) GNAME#signal GNAME#mocambique ((GNAME#signal) (GNAME#mocambique)) GNAME#sequence GNAME#sequences ((GNAME#sequence) (GNAME#sequences)) GNAME#D GNAME#E ((GNAME#D) (GNAME#E)) GNAME#of GNAME#from ((GNAME#of) (GNAME#from)) GNAME#mellifera GNAME#specific GNAME#venom#specific (((GNAME#mellifera) (GNAME#specific)) (GNAME#venom#specific))
7
Concept clustering: results (II) GNAME#nicotinic GNAME#ER ((GNAME#nicotinic) (GNAME#ER)) GNAME#acetylcholine GNAME#green ((GNAME#acetylcholine) (GNAME#green)) GNAME#EFB GNAME#gp120 GNAME#Penncap#M ((GNAME#EFB) ((GNAME#gp120) (GNAME#Penncap#M))) GNAME#F#actin GNAME#mtDNA ((GNAME#F#actin) (GNAME#mtDNA)) GNAME#tubulin GNAME#mAb ((GNAME#tubulin) (GNAME#mAb)) GNAME#hemolymph GNAME#precursor ((GNAME#hemolymph) (GNAME#precursor)) GNAME#domain GNAME#element ((GNAME#domain) (GNAME#element)) GNAME#Melittin GNAME#mugml#1 GNAME#venom (((GNAME#Melittin) (GNAME#mugml#1)) (GNAME#venom)) GNAME#diastase GNAME#invertase GNAME#CAT ((GNAME#diastase) ((GNAME#invertase) (GNAME#CAT))) GNAME#peroxidase GNAME#catalase ((GNAME#peroxidase) (GNAME#catalase)) GNAME#Vg GNAME#PKG ((GNAME#Vg) (GNAME#PKG)) GNAME#GABA GNAME#dopamine ((GNAME#GABA) (GNAME#dopamine)) GNAME#TPN GNAME#AMCI#1 GNAME#RJ GNAME#SRs (((GNAME#TPN) (GNAME#AMCI#1)) ((GNAME#RJ) (GNAME#SRs))) GNAME#nuclear GNAME#CA ((GNAME#nuclear) (GNAME#CA)) GNAME#synthase GNAME#neuron ((GNAME#synthase) (GNAME#neuron))
8
Concept clustering: results (III) GNAME#immunoglobulin GNAME#DraI GNAME#IgM GNAME#AluI (((GNAME#immunoglobulin) (GNAME#DraI)) ((GNAME#IgM) (GNAME#AluI))) GNAME#Ig GNAME#TNF#beta ((GNAME#Ig) (GNAME#TNF#beta)) GNAME#neurons GNAME#OBPs ((GNAME#neurons) (GNAME#OBPs)) GNAME#Mdh#1 GNAME#Mdh GNAME#NF#kappaB (((GNAME#Mdh#1) (GNAME#Mdh)) (GNAME#NF#kappaB)) GNAME#MRJP1 GNAME#HGL ((GNAME#MRJP1) (GNAME#HGL)) GNAME#promoter GNAME#enzyme ((GNAME#promoter) (GNAME#enzyme)) GNAME#mitochondrial GNAME#homeobox ((GNAME#mitochondrial) (GNAME#homeobox)) GNAME#AncR#1 GNAME#Nasonov GNAME#Sax1 (((GNAME#AncR#1) (GNAME#Nasonov)) (GNAME#Sax1)) GNAME#transcripts GNAME#isozymes ((GNAME#transcripts) (GNAME#isozymes)) GNAME#glutamate GNAME#malate ((GNAME#glutamate) (GNAME#malate)) GNAME#collagen GNAME#IL#1beta GNAME#IL#4 ((GNAME#collagen) ((GNAME#IL#1beta) (GNAME#IL#4))) GNAME#binding GNAME#histone ((GNAME#binding) (GNAME#histone)) GNAME#system GNAME#gC GNAME#OBP (((GNAME#system) (GNAME#gC)) (GNAME#OBP)) GNAME#calmodulin GNAME#PhTX GNAME#deltamethrin (((GNAME#calmodulin) (GNAME#PhTX)) (GNAME#deltamethrin)) GNAME#amylase GNAME#sucrase ((GNAME#amylase) (GNAME#sucrase)) GNAME#TNF#alpha GNAME#IgG#ap GNAME#D1 (((GNAME#TNF#alpha) (GNAME#IgG#ap)) (GNAME#D1)) GNAME#A2 GNAME#A#2 ((GNAME#A2) (GNAME#A#2)) GNAME#IFN#gamma GNAME#DTX ((GNAME#IFN#gamma) (GNAME#DTX)) GNAME#MRJP3 GNAME#Mblk#1 ((GNAME#MRJP3) (GNAME#Mblk#1)) GNAME#antigen GNAME#alleles ((GNAME#antigen) (GNAME#alleles))
9
Concept clustering: results (IV) GNAME#bovine GNAME#aflatoxin ((GNAME#bovine) (GNAME#aflatoxin)) GNAME#albumin GNAME#tryptase ((GNAME#albumin) (GNAME#tryptase)) GNAME#4 GNAME#2 ((GNAME#4) (GNAME#2)) GNAME#region GNAME#site ((GNAME#region) (GNAME#site)) GNAME#AHB GNAME#hexokinase GNAME#rhodopsin (((GNAME#AHB) (GNAME#hexokinase)) (GNAME#rhodopsin)) GNAME#PI GNAME#P1 ((GNAME#PI) (GNAME#P1)) GNAME#pollen GNAME#plants ((GNAME#pollen) (GNAME#plants)) GNAME#lipase GNAME#LDH ((GNAME#lipase) (GNAME#LDH)) GNAME#AL GNAME#SCT GNAME#COI#COII ((GNAME#AL) ((GNAME#SCT) (GNAME#COI#COII))) GNAME#chymotrypsin GNAME#CAP GNAME#NGF (((GNAME#chymotrypsin) (GNAME#CAP)) (GNAME#NGF)) GNAME#PLA GNAME#trehalase ((GNAME#PLA) (GNAME#trehalase)) GNAME#IgG1 GNAME#IgG4 ((GNAME#IgG1) (GNAME#IgG4)) GNAME#inhibitor GNAME#Phospholipase ((GNAME#inhibitor) (GNAME#Phospholipase)) GNAME##s GNAME#P ((GNAME##s) (GNAME#P))
10
Concept clustering: results (V) GNAME#restriction GNAME#Z ((GNAME#restriction) (GNAME#Z)) GNAME#PER GNAME#RAST ((GNAME#PER) (GNAME#RAST)) GNAME#PLA2s GNAME#EC ((GNAME#PLA2s) (GNAME#EC)) GNAME#beta#glucosidase GNAME#GIF ((GNAME#beta#glucosidase) (GNAME#GIF)) GNAME#ASP1 GNAME#ASP2 ((GNAME#ASP1) (GNAME#ASP2)) GNAME#PKC GNAME#elastase GNAME#Permethrin ((GNAME#PKC) ((GNAME#elastase) (GNAME#Permethrin))) GNAME#MLT GNAME#JH#III ((GNAME#MLT) (GNAME#JH#III)) GNAME#RyR GNAME#MHC ((GNAME#RyR) (GNAME#MHC)) GNAME#filaments GNAME#filament ((GNAME#filaments) (GNAME#filament)) GNAME#F1 GNAME#F#1 ((GNAME#F1) (GNAME#F#1)) GNAME#TPNQ GNAME#EEP GNAME#MDH#1 ((GNAME#TPNQ) ((GNAME#EEP) (GNAME#MDH#1))) GNAME#c GNAME#b5 ((GNAME#c) (GNAME#b5)) GNAME#scFv GNAME#Dfd ((GNAME#scFv) (GNAME#Dfd)) GNAME#h2 GNAME#HMAP GNAME#ACh (((GNAME#h2) (GNAME#HMAP)) (GNAME#ACh))
11
Entity Annotation Motivation Annotate an entity (term, biological entity, concept, etc) with different types of structured information Generate a dictionary-like entry for each entity Input A text collection, an index of sentences Output A dictionary-like annotation entry for each entity Future Direction Tune each component of the annotator
12
Entity Annotation: results GNAME#Mdh#1 11 Related terms: GNAME#Hk#1 0.000612038 GNAME#locus 0.000449124 GNAME#Est#6 0.000424242 GNAME#Pgm#1 0.000291602 GNAME#Est#1a 0.000291602 ligustica 0.000288879 GNAME#Est#5 0.000265296 linkage 0.000191466 spinula 0.00017993 characterize 0.000174218 GNAME#dehydrogenase 0.000172905 Segregational 0.000160911 Aegean 0.000160911 GNAME#Adh#1 0.000160911 Marginal 0.000160911 Liguria 0.000160911 GNAME#Mdh#1A 0.000160911
13
Example Sentences: 12182 0.207504 : Segregational analyses demonstrated the absence of close linkage between Lap-D and GNAME#Est#1a, GNAME#Est#2, GNAME#Est#5, GNAME#Est#6, GNAME#Mdh#1, GN\ AME#Hk#1 and GNAME#Pgm#1 GNAME#loci GNAME#of GNAME#Apis GNAME#mellifera. 19949 0.203663 : Genetic linkage studies showed no close linkage between the GNAME#Est#1a GNAME#locus and the genetic markers GNAME#Est#6, GNAME#Mdh#1 and GNAME#Hk#1. 30357 0.176708 : The tests were conducted primarily with biochemical markers ( GNAME#Adh#1, GNAME#Est#1, GNAME#Est#3, GNAME#Est#5, GNAME#Est#6, GNAME#Hk#1, GNAME#Mdh#1\, and GNAME#Pgm#1 ) ; the morphological mutation cordovan ( cd ) is also included. 48736 0.16039 : Marginal populations of A. m. ligustica differ from the central populations of this subspecies in allele frequencies at the GNAME#Mdh#1 GNAME#locus. 45078 0.152925 : Electrophoretic analysis of the GNAME#MDH GNAME#[ GNAME#malate GNAME#dehydrogenase GNAME#] GNAME#enzyme GNAME#system demonstrated that honeybee populations \ of eastern Liguria belong to A. m. ligustica spinula, while, in the Western populations, the frequency of the GNAME#Mdh#1 GNAME#M GNAME#allele, which is characteristic of Fr\ ench A. m. mellifera L., linearly increases toward the French boundary.
14
Entity Annotation: results (II) Semantically Similar entities:: GNAME#Mdh#1 11 1 GNAME#Hk#1 5 0.94811 GNAME#Est#6 6 0.932399 GNAME#Pgm#1 4 0.922537 GNAME#Est#1a 4 0.922091 GNAME#Adh#1 2 0.913576 GNAME#Mdh#1A 2 0.906424 GNAME#Mdh#1B 2 0.906424 GNAME#M 3 0.899708 GNAME#Lap 1 0.898897 GNAME#Est#1 5 0.898051 GNAME#PGM2 2 0.897837 GNAME#aldehyde 1 0.897415 GNAME#Cypermethrin 1 0.897026 GNAME#ACP1 1 0.896832 GNAME#EstIV 1 0.896831 GNAME#MdhIII 1 0.896831 GNAME#Est#2s 1 0.896803 GNAME#aminopeptidases 1 0.896719 GNAME#Mdh#1C 1 0.896674
15
Future Plan Summar: With Microsoft Research. Will help Xu to integrate the synonym extraction into gene summarization. After Summar: Work on the future directions listed for each module. Two general functionalities: Theme extraction, summarization and theme pattern analysis Synonym extraction
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.