Download presentation
Presentation is loading. Please wait.
Published byAllison Dunnaway Modified over 10 years ago
1
LINGO Sandra Gama
2
Internet endless document collection
4
Search Engines
5
NO question answering
6
FAST access to Web content
7
SENSITIVE to query quality
8
we NEED meaningful RESULTS
10
GROUPING by Similarity
11
Semantic structure
12
Groups
13
Description
14
Luxury Car Feline, panther family
15
Description QUALITY
16
How to cluster?
18
Pre- processing Phrase extraction Cluster-Label Induction Cluster-content allocation Filtered docs Frequent phrases Cluster labels user query clustered documents
19
STAGE 1/4: PREPROCESSING Pre- processing Phrase extraction Cluster-Label Induction Cluster-content allocation Filtered docs Frequent phrases Cluster labels user query clustered documents
20
STAGE 1/4: PREPROCESSING 1. Text segmentation 2. Stemming 3. Ignore stop words
21
STAGE 2/4: PHRASE EXTRACTION Pre- processing Phrase extraction Cluster-Label Induction Cluster-content allocation Filtered docs Frequent phrases Cluster labels user query clustered documents
24
Goal
29
How it works
31
1234567891011 abracadabra How many non- empty suffixes? abracadabra bracadabra racadabra acadabra cadabra adabra dabra abra bra ra a 11 suffixes
32
abracadabra bracadabra racadabra acadabra cadabra adabra dabra abra bra ra a Sorted SuffixIndex a11 abra8 abracadabra1 acadabra4 adabra6 bra9 bracadabra2 cadabra5 dabra7 ra10 racadabra3 123456789101112 abracadabra$ 1 2 3 4 5 6 7 8 9 10 11
33
Sorted SuffixIndex a11 abra8 abracadabra1 acadabra4 adabra6 bra9 bracadabra2 cadabra5 dabra7 ra10 racadabra3 1181469257103 Suffix array:
34
STAGE 3/4: CLUSTER-LABEL INDUCTION Pre- processing Phrase extraction Cluster-Label Induction Cluster-content allocation Filtered docs Frequent phrases Cluster labels user query clustered documents
37
A term x document matrix U, ∑, V such that A = U ∑ V T find matrixes
38
D1: Large-scale singular value computations D2: Software for the sparse singular value decomposition D3: Introduction to modern information retrieval D4: Linear algebra for intelligent information retrieval D5: Matrix computations D6: Singular value cryptogram analysis D7: Automatic information organization T1: Information T2: Singular T3: Value T4: Computations T5: Retrieval P1: Singular value P2: Information retrieval
39
D1: Large-scale singular value computations D2: Software for the sparse singular value decomposition D3: Introduction to modern information retrieval D4: Linear algebra for intelligent information retrieval D5: Matrix computations D6: Singular value cryptogram analysis D7: Automatic information organization T1: Information T2: Singular T3: Value T4: Computations T5: Retrieval D1D2D3D4D5D6D7 0.00 0.56 0.00 1.00 0.490.710.00 0.710.00 0.490.710.00 0.710.00 0.720.00 1.000.00 0.83 0.00
40
Abstract concept matrix (SVD) 0.000.750.00-0.660.00 0.650.00-0.280.00-0.71 0.650.00-0.280.000.71 0.390.000.920.00 0.660.000.750.00 U =
41
0.000.561.000.00 0.710.00 1.000.00 0.710.00 1.000.00 1.000.00 0.830.00 1.00 = P T1: Information P2: Information retrieval P1: Singular value T2: Singular T4: Computations T3: Value T5: Retrieval T1: Information T2: Singular T3: Value T4: Computations T5: Retrieval
42
M matrix = U k T P 0.920.00 0.65 0.390.00 0.970.750.00 0.66 Phrases/single words Abstract concepts T1: Information P2: Information retrieval P1: Singular value T2: Singular T4: Computations T3: Value T5: Retrieval
43
Last step
44
Prune overlapping label descriptions ZTZZTZ
45
STAGE 4/4: CLUSTER-CONTENT ALLOCATION Pre- processing Phrase extraction Cluster-Label Induction Cluster-content allocation Filtered docs Frequent phrases Cluster labels user query clustered documents
46
Similarity
47
Cluster Score
48
Evaluation and Results
49
Test Data 10 categories 4 subjects
50
Subject# docsContents Movies77Information about the BladeRunner movie Movies92Information about the Lord of the Rings movie Health Care77Orthopedic equipment and manufactures Photography15Infrared-photography references Computer Science27Articles about data warehouses (integrator DBs) Computer Science42MySQL database Computer Science15Native XML databases Computer Science38PostgreSQL database Computer Science39Java programming language tutorials and guides Computer Science37VI text editor
51
IdentifierMerged Categories G1LRings, MySQL G3LRings, MySQL, Ortho, Infra G5MySQL, XMLDB, Dware, Postgr, JavaTut, Vi G6MySQL, XMLDB, Dware, Postgr, Ortho
52
IdentifierMerged Categories G1Fan fiction/fan art, image galleries, MySQL, wallpapers, LOTR humour, links G3MySQL, news, information on infrared, image galleries, foot orthotics, Lord of the Rings, movie G5Java tutorial, Vim page, federated data warehouse, native XML database, Web, Postgresql database G6MySQL database, federated data warehouse, foot orthotics, orthopedic products, access Postgresql, Web
53
Cluster Contamination Analytical evaluation:
54
LINGO vs. Suffix Tree Clustering
58
Future work
59
Pointer
60
Communication!
61
LINGO Thank you. Search Results Clustering
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.