Presentation is loading. Please wait.

Presentation is loading. Please wait.

 Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Similar presentations


Presentation on theme: " Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,"— Presentation transcript:

1  Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson, Bruce Croft

2  Mark Sanderson, University of Sheffield The question is... • What paper already presented at this SIGIR is most like the one you’re about to see? • We’ll have the answer, right after this!

3  Mark Sanderson, University of Sheffield Concept hierarchies from documents? • Hierarchy of concepts, Yahoo – General down to specific – Child under one or more parents • No training data • Why? – Understandable

4  Mark Sanderson, University of Sheffield Current methods • Polythetic clustering

5  Mark Sanderson, University of Sheffield An alternative? • Monothetic clustering – Clusters based on a single features – More ‘Yahoo/Dewey decimal’ like? – Easier to understand? » Preferable to users? – What about hierarchies of clusters?

6  Mark Sanderson, University of Sheffield How to arrange cluster terms? • Existing techniques – WordNet » earthquake, volcano (eruption?) – Key phrases (Hearst 1998) » “such as”, “especially” – Phrase classification (Grefenstette 1997) » NP head or modifier “types of research” from “research things” – Hierarchical phrase analysis (Woods 1997) » Head modifier again, “car washing” under “washing”, not “car”

7  Mark Sanderson, University of Sheffield WordNet (aside) • 1 sense of earthquake, sense 1 – earthquake, quake, temblor, seism -- (shaking and vibration at the surface of the earth resulting from underground movement along a fault plane of from volcanic activity) » geological phenomenon -- (a natural phenomenon involving the structure or composition of the earth) » natural phenomenon, nature -- (all non-artificial phenomena) » phenomenon -- (any state or process known through the senses rather than by intuition or reasoning)

8  Mark Sanderson, University of Sheffield WordNet (aside) • 5 senses of eruption, sense 1 – volcanic eruption, eruption -- (the sudden occurrence of a violent discharge of steam and volcanic material) » discharge -- (the sudden giving off of energy) » happening, occurrence, natural event -- (an event that happens) » event -- (something that happens at a given place and time)

9  Mark Sanderson, University of Sheffield Start with something simpler? • Term clustering? – simple monothetic clusters – No ordering.

10  Mark Sanderson, University of Sheffield Use subsumption • Initially using subsumption. – Finds related terms – Decides which is more general, which is more specific (idf?) • Strict interpretation – X s Y iff P(x|y) = 1, P(y|x) < 1 • In practice – X s Y iff P(x|y) > 0.8, P(y|x) < 1 – P(x|y) > 0.8, P(y|x) < P(x|y) x y x y

11  Mark Sanderson, University of Sheffield How to build a “hierarchy” • X s Y • X s Z • X s M • X s N • Y s Z • A s B • A s Z • B s Z X Y Z MN A B really it’s a DAG

12  Mark Sanderson, University of Sheffield How to display it? • DAGs were big – Unlikely to get all on screen • Only want to see current focus plus route to taken there? • Use a method users are familiar with • Hierarchical menus X Y Z MN A B Z

13  Mark Sanderson, University of Sheffield What about ambiguity? • Monothetic clusters of ambiguous terms? • Derive hierarchy from retrieved documents – Take a query and retrieve on it, – take top 500 documents, – build hierarchy from them. • Topics/concepts are words/phrases taken from – Query – Retrieved documents – Comparison of frequencies

14  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

15  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

16  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

17  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

18  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

19  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

20  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

21  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

22  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

23  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

24  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

25  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

26  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

27  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

28  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

29  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

30  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

31  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

32  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

33  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

34  Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302

35  Mark Sanderson, University of Sheffield Did you guess the paper? • Bit like Peter Anick’s work?

36  Mark Sanderson, University of Sheffield Experiment • Test properties of hierarchy • Does it mimic (in some way) Yahoo-like categories? – Parent related to child? – Parent more general than child?

37  Mark Sanderson, University of Sheffield Experimental set-up • Gathered eight subjects – Presented subsumption categories and ‘random’ categories. – Ask if parent child pair are ‘interesting’. » If yes, then what type is relationship, (roughly) from WordNet » Aspect of » Type of » Same as » Opposite of » Don’t know

38  Mark Sanderson, University of Sheffield Results • Question of parent/child pairing ‘interesting’ or not – Random,51% – Subsumption,67% – Difference significant from t-test, p<0.002 • If interesting, what is parent/child type? Odd?

39  Mark Sanderson, University of Sheffield Yahoo categories?

40  Mark Sanderson, University of Sheffield Results and conclusions • Interesting AND (aspect of OR type of) – Random,28%(51% * (47% + 8%)) – Subsumption,48%(67% * (49% + 23%)) • Appears that subsumption and an ordering based on document frequency does a reasonable job. – Term frequency work see. » Sparck Jones, K. (1972) A statistical interpretation of term specificity and its application in retrieval, in Journal of Documentation, 28(1): 11-21 » Caraballo, S.A., Charniak, E. (1999) Determining the specificity of nouns from text, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP):

41  Mark Sanderson, University of Sheffield Future work? • More user studies. • Incorporate other term relationship techniques • Other visualisations • Application of techniques to whole document collections. • Presentation of Cross Language IR results?


Download ppt " Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,"

Similar presentations


Ads by Google