Download presentation
Presentation is loading. Please wait.
Published byApril Thompson Modified over 9 years ago
1
1 Constructing Folksonomies from User- Specified Relations on Flickr Anon Plangprasopchok and Kristina Lerman
2
2 Motivation Users Web content hierarchical classification Consume Produce Annotate Organize Discover Annotation / Metadata Organize Search Recommend Leverage Categorize
3
3 Motivation Goal: to induce category knowledge from social annotation produced by many users Metadata from an individual user may be too inaccurate and incomplete… The metadata from different users may complement each other, making it, in combination, meaningful.
4
4 Folksonomy Original definition: classification emerging from the use of tags by users (Thomas Vander Wal) In this work: hidden classification hierarchies from annotation created many users
5
5 Hierarchical Relations in Social Web Appear Implicitly Appear Explicitly Tags: Insect Grasshopper Australian Macro Orthoptera Folder (collection) Sub folder (set) Relations Goal: to induce deeper hierarchies from this metadata
6
6 Outline Motivation Approaches Results Discussion Related work
7
7 Inducing Hierarchy from Tags Existing approaches Graph based (Mika05) build a network of associated tags (node = tag, edge = co- occurrence of tags) suggest applying betweenness centrality and set theory to determine broader/narrower relations Hierarchical Clustering (Brooks06; Heymann06+) Tags appear more frequently would have higher centrality and thus more abstract. Probabilistic subsumption ( Sanderson99+, Schmitz06) x is broader than y if x subsumes y x subsumes y if p(x|y) > t & p(y|x) < t x y
8
8 Inducing Hierarchy from Tags Some difficulties when using tags to induce hierarchy: Above relations induced using subsumption approach on tags [Sanderson99+, Schmitz06] Washington United States Car Automobile Notation: A B (A is broader than B) (hypernym relation) Insect Hongkong Color Brazilian Specificity Rarity Tags are from different facets*
9
9 Inducing Hierarchy from user-specified relations User specified relations, e.g., –Flickr’s Collection-Set, –Delicious’ Bundle-Tag, –Bibsonomy’s Relation-Tag Key intuition: Not so many people specify peculiar relations like –“automobile” “car”, or –“Washington” “United States”
10
10 Simple Strategy Sets Collection The Netherlands - Holanda Set Collection Blijdorp - Rotterdam Tokenize + Stem … Concept relations netherland holanda blijdorp rotterdam holanda rotterdam countri netherland blijdorp 2. Link concepts & Select path blijdorp countri netherland hollandchina …… 1.Remove “noisy” relations -Conflict resolution -Significance test
11
11 Remove noisy relations: 1 st approach Conflict Resolution (when both a->b and b->a appear) –Relation conflicts occur because of noise –Voting scheme: Keep a b (and discard b a) If N u (a b) > 1 and N u (a b) > N u (b a) insect butterfly insect 10 2
12
12 Remove noisy relations: 2 nd approach Significance Test - Use statistical significance test to decide if a b is significant - Null hypothesis: observed relation a b was generated by chance, via the random, independent generation of individual concepts a, b (according to the binomial distribution). # observations reject accept # of a b Is “b” narrower than “a” by chance?
13
13 Link concepts and select path Link concepts: assume that same terms refer to the same concept. anim bug insect moth 26 72 4 18 1 4 possible paths from anim moth: 1)a b i m 2)a i m 3)a m 4)a b m Network Bottleneck idea: “the flow bottleneck is a minimum flow capacity among all relations in the path” 1) a b i m [BN score = min(26,1,18) = 1] 2) a i m [BN score = min(72,18) = 18] 3) a m [BN score = min(10) = 10] 4) a b m [BN score = min(26,4) = 4] 10 anim bug + Select path: link relations from many users can cause a spaghetti graph anim insect anim buginsect
14
14 Evaluation & Data Set Contribution#2:Learning Concept Hierarchies Hypothesis: the approach that takes explicit relations into account can induce better hierarchies. “Better” means more consistent with hand-built hierarchies (ODP ver. 10/08) The baseline approach is subsumption approach [Schmitz06] Collection and set terms are used instead of tags, making it comparable. Data Set: Data from 17 user groups, devoted to wildlife and naturalist photography 21,792 of 39,922 users specify at least one collection 110,543 unique terms (c.f. 166,153 unique terms in ODP), 15,495 terms in common.
15
15 Evaluation methodology ODP has many sub hierarchies: comparing to the induced ones are impractical! Contribution#2:Learning Concept Hierarchies It’s easier to compare when specifying “root concept” and “leaf concepts”, i.e., specifying a certain sub tree to compare. Reference hierarchy Relations (right after tokenized) Induced hierarchy Induce (remove noise+link) (ODP)
16
16 Metrics Taxonomic Overlap [adapted from Maedche02+] –measuring structure similarity between two trees –for each node, determining how many ancestor and descendant nodes overlap to those in the reference tree. Lexical Recall –measuring how well an approach can discover concepts, existing in the reference hierarchy (coverage) Contribution#2:Learning Concept Hierarchies
17
17 Quantitative Results
18
18 Quantitative Results Contribution#2:Learning Concept Hierarchies Manually selecting 32 root nodes Taxonomic Overlap : 27 of them are better than those by subsumption 3 of them get zero score in both approaches Lexical Recall: 28 of them are better than those by subsumption 2 of them get similar score on both approaches the rest, by subsumption, only induce the root node. The proposed approach can induce deeper trees The proposed approach can induce hierarchies more consistent with ODP in almost all cases.
19
19 Sport hierarchy
20
20 Invertebrate hierarchy
21
21 Country hierarchy
22
22 Discussion Simple strategy to aggregate a large number of shallow relations specified by different users into a common, deeper hierarchy Induced hierarchies are more consistent with ODP Future work includes: Term ambiguity Relation types Global path Apply to other datasets
23
23 Related Work Learning concept hierarchy from text data Syntactic based [Hearst92, Caraballo99, Pasca04, Cimiano+05, Snow+06] Word clustering [e.g. Segal+02, Blei+03] Induce concept hierarchy from tags Graph-based & clustering based [Mika05, Brooks+06, Heymann+06, Zhou07+] Probabilistic subsumption [Schmitz06] Ontology alignment [Udrea+07] Exploit user-specified hierarchy GiveALink [Markines06+]
24
24 Questions? Is the metric used in evaluation meaningful? How is the scalability of the system? Wordnet, ODP is already there. Why do we need this system? How is this work related to ontology enrichment? Is it ethical to collect users’ data? –….?
25
25 Spared slides beyond here
26
26 Open Problems Term ambiguity - The current approach: similar terms refer to the similar concept …. but.. “Victoria” Canada Lotus Person name Australia - And has no explicit way to merge synonyms (There are also many acronyms & colloquial terms in Social Web) Spain España A possible solution: concept clustering
27
27 Open Problems Inducing “related-to” relation –“Flora” and “Fauna”, “Pet” and “Family” –Prepositions or some connectors may give some clues, e.g., “flora & fauna” and “Pets – Family” –Tag distributions may also help Nature Flora Fauna Nature Flora Fauna
28
28 Open Problems True parent selection –Tokenizing collection/set names can cause another problem Flora & Fauna Insect Fauna Insect Flora Insect A possible solution: conditional probability ratio
29
29 Conclusion Propose statistical approaches for inducing concepts; inducing concept hierarchies, from social annotation On going work aim to improve induced hierarchies’ quality includes: Resolve term ambiguity Induce “related to” relations Select the right parent Evaluate on more data sets These approaches perform better than existing approaches
30
30 Social Web Adapted from The Social Web: an Information Revolution (courtesy of Kristina Lerman) Content User Produce Consume Annotate Organize Discover spare
31
31 Social Web Delicious : 5.3 million users; over 180 million unique URLs [blog.delicious.com, 2009] - Produce - Consume - Annotate & Organize Flickr: 2 billion photos [techcrunch.com, 2007]/ 4000+ photos upload per min (1/21/2009 morning) 3 Basic Entities Involved (1) User (2) Content (3) Metadata users content
32
32 Motivation Organizing Arranging/ Visualizing users’ content (e.g., semantic directory) Search/Discovery Especially, binary content like photos and videos, where social annotation functions as a semantic index Recommendation Learning users’ taste/ interest Leveraging knowledge bases Updating lexical systems and ontologies for semantic web applications Categorization Understanding how new content fits to existing ones Social Annotation is potentially a good source of evidence for inducing category knowledge, which is useful in many applications, e.g., spare
33
33 Motivation Although metadata from an individual user may be too inaccurate and incomplete, those from different users may complement each other, making them meaningful for the tasks. Goal: to induce category knowledge from social annotation produced by many users
34
34 Evaluation methodology ODP has many sub hierarchies: comparing to the induced ones are impractical! Contribution#2:Learning Concept Hierarchies It’s easier to compare when specifying “root concept” and “leaf concepts”, i.e., specifying a certain sub tree to compare.
35
35 Collection Set Data pre- processing Flickr relations User-specified relations Significance Test Conflict Resolution Subsumption e.g., Find ODP root- leaf pairs that overlap w/Flickr Flickr-ODP root-leaf overlaps Compute Taxonomic Overlap, Lexical Recall e.g., Animal/Mammal/Rodent/Rat Relation weighting & linking Hierarchy Construction Evaluation
36
36 Why subsumption does not work so well? Countri China Ideal Reality
37
37 Africa Hierarchy Contribution#2:Learning Concept Hierarchies
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.