Dataware’s Document Clustering and Query-By-Example Toolkits John Munson Dataware Technologies 1999 BRS User Group Conference
Document Clustering Automatically creates clusters of similar documents General benefit: provides an overview of the range of topics in a set Multiple specific uses – Familiarization with database before searching – Familiarization with a result set after searching – Assistance in category definition for other uses - Category tree construction - FAQ construction
Dataware’s Clustering Toolkit One API function Source of documents is a BRS result set – which could be backref 0 for entire database – Can specify certain fields for analysis Output indicates member documents for each cluster Application can specify number and max/min size of clusters, etc. US PTO (Patent and Trademark Office) plans to do category tree construction
How It Works Extracts keywords from each document – using our keyword-generation library - which is also in 6.3 keyword generation load filter Repeats these steps: – Compare document and cluster pairs using the keyword lists - How many keywords do two lists share, and how similar are their weights? – Combine the most similar pair into one cluster Stops when n clusters remain (n is configurable)
How It Works Output is a list of clusters, including: – a cluster quality score - Measures how cohesive the cluster is – a ranked list of keywords describing the cluster – a ranked list of member documents - Highest-ranked docs are the most “central”
Speed Tricks Speed is a big issue in clustering – especially for interactive searching – Keyword extraction takes time – Pairwise comparisons don’t scale up well at all – Thus, we use a couple of speed tricks - One trick for database design - One trick inside the clustering function Trick 1: Pre-generate keywords – Use the BRS 6.3 keyword generation load filter – The filter produces a keyword paragraph that looks like this...
Speed Tricks..Keywords: compartment (187.80). mass (156.56). methylhistidine (118.12).... At clustering time, we don’t need to do keyword analysis – Just retrieve keyword lists from engine – Cuts execution time in half
Speed Tricks Trick 2: Cluster a sample of the set (Cutting et al) – Create the desired number of clusters from a small sample – Then compare the remaining documents only to those few clusters, not to all other documents – Saves a huge amount of execution time Another trick for result-set clustering: – Cluster only the top-ranked 100 to 1000 docs A final speed note: CPU speed helps a lot – Clustering is very processor-intensive - 2x CPU speed gives almost 2x clustering speed
Query-By-Example (QBE) Allows an example passage or document to serve as a query Useful when we already have some text or a document about our topic – “Find more like this” – No query formulation required – QBE analyzes the text, then constructs and executes a query
Dataware’s QBE Toolkit One API function Source of example text can be: – a text buffer - e.g. text selected with mouse – a BRS document (or documents) from a result set - e.g. selected from a title list - Can specify certain fields for analysis – a word list with weights or occurrence counts Output is a standard ranked document list
How It Works Extracts keywords from the example text – using... all together now... our keyword-generation library, yet again Keyword selection process likes words that: – occur frequently in the example text – are rare in the database as a whole Getting database statistics can be done: – using field qualification - most accurate but slow – using no qualification - still good, much faster – not at all -- just use occurrence counts in example text -- fastest, but trickier
How It Works Performs a ranked search using the keywords and their weights Flexible fielding: – Analysis of example document(s) can use one set of BRS paragraphs – Search can use a different set Speed trick: – Generate keyword field for database (load filter) – Field-level index it – Use it for QBE searches
That’s all, folks!