Download presentation
Presentation is loading. Please wait.
1
Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson, Bruce Croft
2
Mark Sanderson, University of Sheffield The question is... What paper already presented at this SIGIR is most like the one you’re about to see? We’ll have the answer, right after this!
3
Mark Sanderson, University of Sheffield Concept hierarchies from documents? Hierarchy of concepts, Yahoo General down to specific Child under one or more parents No training data Why? Understandable
4
Mark Sanderson, University of Sheffield Current methods Polythetic clustering
5
Mark Sanderson, University of Sheffield An alternative? Monothetic clustering Clusters based on a single features More ‘Yahoo/Dewey decimal’ like? Easier to understand? » Preferable to users? What about hierarchies of clusters?
6
Mark Sanderson, University of Sheffield How to arrange cluster terms? Existing techniques WordNet » earthquake, volcano (eruption?) Key phrases (Hearst 1998) » “such as”, “especially” Phrase classification (Grefenstette 1997) » NP head or modifier “types of research” from “research things” Hierarchical phrase analysis (Woods 1997) » Head modifier again, “car washing” under “washing”, not “car”
7
Mark Sanderson, University of Sheffield WordNet (aside) 1 sense of earthquake, sense 1 earthquake, quake, temblor, seism -- (shaking and vibration at the surface of the earth resulting from underground movement along a fault plane of from volcanic activity) » geological phenomenon -- (a natural phenomenon involving the structure or composition of the earth) » natural phenomenon, nature -- (all non-artificial phenomena) » phenomenon -- (any state or process known through the senses rather than by intuition or reasoning)
8
Mark Sanderson, University of Sheffield WordNet (aside) 5 senses of eruption, sense 1 volcanic eruption, eruption -- (the sudden occurrence of a violent discharge of steam and volcanic material) » discharge -- (the sudden giving off of energy) » happening, occurrence, natural event -- (an event that happens) » event -- (something that happens at a given place and time)
9
Mark Sanderson, University of Sheffield Start with something simpler? Term clustering? simple monothetic clusters No ordering.
10
Mark Sanderson, University of Sheffield Use subsumption Initially using subsumption. Finds related terms Decides which is more general, which is more specific (idf?) Strict interpretation X s Y iff P(x|y) = 1, P(y|x) < 1 In practice X s Y iff P(x|y) > 0.8, P(y|x) < 1 P(x|y) > 0.8, P(y|x) < P(x|y) x y x y
11
Mark Sanderson, University of Sheffield How to build a “hierarchy” X s Y X s Z X s M X s N Y s Z A s B A s Z B s Z X Y Z MN A B really it’s a DAG
12
Mark Sanderson, University of Sheffield How to display it? DAGs were big Unlikely to get all on screen Only want to see current focus plus route to taken there? Use a method users are familiar with Hierarchical menus X Y Z MN A B Z
13
Mark Sanderson, University of Sheffield What about ambiguity? Monothetic clusters of ambiguous terms? Derive hierarchy from retrieved documents Take a query and retrieve on it, take top 500 documents, build hierarchy from them. Topics/concepts are words/phrases taken from Query Retrieved documents Comparison of frequencies
14
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
15
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
16
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
17
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
18
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
19
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
20
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
21
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
22
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
23
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
24
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
25
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
26
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
27
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
28
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
29
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
30
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
31
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
32
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
33
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
34
Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302
35
Mark Sanderson, University of Sheffield Did you guess the paper? Bit like Peter Anick’s work?
36
Mark Sanderson, University of Sheffield Experiment Test properties of hierarchy Does it mimic (in some way) Yahoo-like categories? Parent related to child? Parent more general than child?
37
Mark Sanderson, University of Sheffield Experimental set-up Gathered eight subjects Presented subsumption categories and ‘random’ categories. Ask if parent child pair are ‘interesting’. » If yes, then what type is relationship, (roughly) from WordNet » Aspect of » Type of » Same as » Opposite of » Don’t know
38
Mark Sanderson, University of Sheffield Results Question of parent/child pairing ‘interesting’ or not Random,51% Subsumption,67% Difference significant from t-test, p<0.002 If interesting, what is parent/child type? Odd?
39
Mark Sanderson, University of Sheffield Yahoo categories?
40
Mark Sanderson, University of Sheffield Results and conclusions Interesting AND (aspect of OR type of) Random,28%(51% * (47% + 8%)) Subsumption,48%(67% * (49% + 23%)) Appears that subsumption and an ordering based on document frequency does a reasonable job. Term frequency work see. » Sparck Jones, K. (1972) A statistical interpretation of term specificity and its application in retrieval, in Journal of Documentation, 28(1): 11-21 » Caraballo, S.A., Charniak, E. (1999) Determining the specificity of nouns from text, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP):
41
Mark Sanderson, University of Sheffield Future work? More user studies. Incorporate other term relationship techniques Other visualisations Application of techniques to whole document collections. Presentation of Cross Language IR results?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.