Stephen Brown Museums and the Web Asia 9-12 December 2013 Hong Kong Where are the pictures? Linking photographic records across collections using fuzzy logic Stephen Brown Museums and the Web Asia 9-12 December 2013 Hong Kong
Research question Can fuzzy logic based data mining algorithms be used to identify matches between different online collections? Answer: yes
erps16578 [Royal Museum, the court (i.e. Bargello Museum, the courtyard), Florence, Italy] No person listed 1 photomechanical print : photochrom color. [between ca. 1890 and ca. 1900]. From the Library of Congress The Courtyard of the Bargello, Florence Henry Little Bromide (Print) 1895 From the ERPS collection 13
Method Data preparation (correcting typographical errors, standardizing data such as dates, removing duplicate entries and mapping data to a common metadata schema). Data aggregation (combining standardized records in a single XML database where they can be mined for similarities). Query expansion (extending the range of keywords that are searched for). Field comparison (comparing the contents of individual fields and combining these to produce an overall similarity metric).
Alternative computing logics Classic logic is binary True/False zero/one Set theory Fuzzy logic Degrees of truth Fuzzy set theory 18
The concept of tall people Ben Youngs 5’10” Toby Flood 6’2” Geoff Parling 6’6” 19
The concept of tall people Classical approach: Any one over 6”is tall Ben Youngs 5’10” Toby Flood 6’2” Geoff Parling 6’6” 20
The concept of tall people Classical approach: Any one over 6”is tall Ben Youngs 5’10” Toby Flood 6’2” Geoff Parling 6’6” 21
Classical computing The membership function of the set tall people 1 5” 6” 7” Toby Flood 6’2” Ben Youngs 5’10” Geoff Parling 6’6” 22
The concept of tall people Fuzzy approach: Everyone is tall to some degree (as measure by the membership function) Ben Youngs 5’10” Toby Flood 6’2” Geoff Parling 6’6” 23
Fuzzy computing The membership function of the set tall people 1 5” 6” 5” 6” 7” Toby Flood 6’2” Ben Youngs 5’10” Geoff Parling 6’6” 24
Soft computing The membership function of the set tall people 1 5” 6” 0.95 0.7 0.45 5” 6” 7” Toby Flood 6’2” Ben Youngs 5’10” Geoff Parling 6’6” 25
Fuzzy computing Allows for vagueness in concepts Soft boundaries Partial degrees of truth 26
Lightweight Semantic Similarity B Chrysanthemum 1 Flower A. Chrysanthemum B. Flower
Lightweight Semantic Similarity B Chrysanthemum 1 Flower Chrysanthemum Cosine of the angle between A and B = 0 Therefore, no similarity between A and B Flower
Lightweight Semantic Similarity Chrysanthemum Flower
Lightweight Semantic Similarity Fuzzy term vectors using synset similarity values from WordNet Chrysanthemum Cosine of the angle between A and B > 0 Therefore, some similarity between A and B Flower
Combined similarity metric IF title is good AND person is good THEN match is good. IF title is good AND (date is good OR process is good) THEN match is ok. IF person is good AND title is bad THEN match is ok. IF title is bad AND person is bad THEN match is bad.
Conclusion Large numbers of small amounts of text are common in collections records. Text volumes too small for corpus linguistics analysis. Need for query expansion Text volumes too small for established Semantic Similarity analysis Lightweight semantic