Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CS 430: Information Discovery Lecture 28 (a) Two Examples of Cluster Analysis (b) Conclusion.

Similar presentations


Presentation on theme: "1 CS 430: Information Discovery Lecture 28 (a) Two Examples of Cluster Analysis (b) Conclusion."— Presentation transcript:

1 1 CS 430: Information Discovery Lecture 28 (a) Two Examples of Cluster Analysis (b) Conclusion

2 2 Laptop returns Dates: Tuesday, May 8th 9:00 - 11:00 a.m. Monday, May 14th 1:00 - 3:00 p.m. Tuesday, May 15th 9:00 - 11:00 a.m. Place: Upson Hall 5130 Receipts: Bring a copy of your receipt to the examination

3 3 Example 1: Cluster Analysis of Social Science Journal In the social sciences, subject boundaries are unclear. Can citation patterns be used to develop criteria for matching information services to the interests of users? W. Y. Arms and C. R. Arms, Cluster analysis used on social science citations, Journal of Documentation, 34 (1) pp 1-11, March 1978.

4 4 Methodology Assumption: Two journals are close to each other if they are cited by the same source journals, with similar relative frequencies. Sources of citations: Select a sample of n social science journals. Citation matrix: Construct an m x n matrix in which the ijth element is the number of citations to journal i from journal j. Normalization: All data was normalized so that the sum of the elements in each row is 1.

5 5 Data Pilot study: 5,000 citations from the 1970 volumes of 17 major journals from across the social sciences. Criminology citations: Every fifth citation from a set of criminology journals (3 sets of data for 1950, 1960, 1970). Main file (52,000 citations): (a) Every citation from the 1970 volumes of the 48 most cited source journals in the pilot study. (b) Every citation from the 1970 volumes of 47 randomly selected journals.

6 6 Sample sizes SampleSource journalsTarget journals Pilot17115 Criminology: 19501018 19601349 197027108 Main file: ranked48495 random47254 Excludes journals that are cited by only one source. These were assumed to cluster with the source.

7 7 Algorithm Main analysis used a non-hierarchical method of E. M. L. Beale and M. G. Kendal based on Euclidean distance. For comparison, 36 psychology journals clustered using: single-linkage complete-linkage van Rijsbergen's algorithm Beale/Kendal algorithm and complete-linkage produced similar results Single-linkage suffered from chaining van Rijsbergen algorithm seeks very clear-cut clusters, which were not found in the data

8 8 Non-hierarchical clusters Economics clusters in the pilot study

9 9 Non-hierarchical dendrogram Part of a dendrogram showing non-hierarchical structure

10 10 Conclusion "The overall conclusion must be that cluster analysis is not a practical method of designing secondary services in the social sciences." Because of skewed distributions very large amounts of data are required. Results are complex and difficult to interpret. Overlap between social sciences leads to results that are sensitive to the precise data and algorithms chosen.

11 11 Example 2: Concept Spaces for Scientific Terms Large-scale searches can only match terms specified by the user to terms appearing in documents. Cluster analysis can be used to provide information retrieval by concepts, rather than by terms. Bruce Schatz, William H. Mischo, Timothy W. Cole, Joseph B. Hardin, Ann P. Bishop (University of Illinois), Hsinchun Chen (University of Arizona), Federating Diverse Collections of Scientific Literature, IEEE Computer, May 1996. Federating Diverse Collections of Scientific Literature

12 12 Methodology Approach: Use cluster analysis to generate "concept spaces" automatically, i.e., clusters of terms that embrace a single semantic concept. Data set 1: All terms in 400,000 records from INSPEC, containing 270,000 terms with 4,000,000 links. [24.5 hours of CPU on 16-node Silicon Graphics supercomputer.] Data set 2: 4,000,000 abstracts from the Compendex database covering all of engineering as the collection, partitioned along classification code lines into some 600 community repositories. [ Four days of CPU on 64-processor Convex Exemplar.]

13 13 Concept Space A concept space is a similarity matrix based on co-occurrence of terms. In the largest experiment, 10,000,000 abstracts, were divided into sets of 100,000 and the concept space for each set generated separately. The sets were selected by an existing classification scheme.

14 14 Objectives Semantic retrieval (using concept spaces for term suggestion) Semantic interoperability (vocabulary switching across subject domains) Semantic indexing (concept identification of document content) Information representation (information units for uniform manipulation)

15 15 Use of Concept Space: Term Suggestion

16 16 Future Use of Concept Space: Vocabulary Switching "I'm a civil engineer who designs bridges. I'm interested in using fluid dynamics to compute the structural effects of wind currents on long structures. Ocean engineers who design undersea cables probably do similar computations for the structural effects of water currents on long structures. I want you [the system] to change my civil engineering fluid dynamics terms into the ocean engineering terms and search the undersea cable literature."

17 17 Visual thesaurus for browsing large collections of geographic images Methodology: Divide images into small regions. Create a similarity measure based on properties of these images. Use cluster analysis tools to generate clusters of similar images. Provide alternative representations of clusters. Marshall Ramsey, Hsinchun Chen, Bin Zhu, A Collection of Visual Thesauri for Browsing Large Collections of Geographic Images, May 1997. (http://ai.bpa.arizona.edu/~mramsey/papers/visualThesaurus/visual Thesaurus.html)

18 18

19 19 Self Organizing Maps (SOM)

20 20 Conclusions

21 21 Types of Information Discovery media type textimage, video, audio, etc. searchingbrowsing linking statistical user-in-loop catalogs, indexes (metadata) CS 502 natural language processing CS 474

22 22 Surrogates Textual catalog record about a non-textual item (photograph) Surrogate Text based methods of information retrieval can search a surrogate for a photograph

23 23 Information Discovery People have many reasons to look for information: Known item Where will I find the wording of the US Copyright Act? Facts What is the capital of Barbados? Introduction or overview How do diesel engines work? Related information Is there a review of this article? Comprehensive search What is known of the effects of global warming on hurricanes?

24 24 The End


Download ppt "1 CS 430: Information Discovery Lecture 28 (a) Two Examples of Cluster Analysis (b) Conclusion."

Similar presentations


Ads by Google