Download presentation
Presentation is loading. Please wait.
Published byKelly Gregory Modified over 7 years ago
1
Using Collaborative Filtering to Weave an Information Tapestry
David Goldberg, David Nichols, Brian M. Oki, Douglas Terry Xerox Palo Alto Research Center
2
Problems of current mail systems
Think about any newsgroup you subscribed: hundreds of new postings every day many of them are off the topic many more are not personally interesting to you Finding articles of interest are time-consuming
3
Solution: Collaborative Filtering
Recording people’s reactions to documents they read, called annotations. Based on other people’s feedback, a filtering process can be constructed to read only those articles that are interested to you. A step further from content-based filtering -- not only consider the document’s contents, but also people’s reactions.
4
Tapestry architecture
Documents Indexer Document store Annotation store Filterer R er Little Box Server Client Appraiser Appraiser Tapestry Browser Mail Reader
5
Indexer Understand formats of various types of documents -- one indexing program corresponds to one type of document. (i.e. The format of NetNews articles is different from the articles in the New York Times) Extract indexed fields from document and store them in the database.
6
Document and Annotation Stores
Documents must be immutable due to the continuous semantics supported by the filterer -- WORM disks can be used. Documents are never deleted -- big disk storage. Attributes are extensible and can be set-valued -- several relational tables have to be provided.
7
Appraisers Further classify and organize messages based on priorities, selected by which filter query, or any predicate you specified. They are kept in the client side -- running only over the contents of the little box instead of the incoming document stream gains performance.
8
Interaction with the Tapestry service
Using tapestry browser is preferable but not required -- you can continue to use your favorite mail reader. Tapestry browser only keeps document identifiers because of the immutable property of document store. Once a message is deleted, it still exists in the document store.
9
Mechanisms of retrieving documents
Document arrived Document store Filter Queries ad hoc queries Appraisers Browser
10
TQL: Tapestry Query Language
Advantages over SQL: Support extensible set of fields in a document. Support sets. Easy to use -- It is specialized. Disadvantages over SQL: Complicate the implementation: TQL has to be converted to SQL before executing, because Tapestry is built on top of a commercial database which only supports SQL.
11
Common document fields and their types
Field Types to date sender cc subject newsgroups in-reply-to words ts (timestamp) set of strings date string set of documents time
12
Annotations Annotations are separate complex objects -- they are not treated as additional document fields. The field ‘msg’ in an annotation object links it to its document. The field ‘type’ in an annotation object defines which complex object it refers to -- each type of annotation has its own structure.
13
Example of TQL Select all messages sent to ‘Joe’ and ‘Mike’, and whose subject field or the body contained the word ‘CS294-7’, and to which none of them has sent a reply, and which has been endorsed by somebody. m.to = {‘Joe’, ‘Mike’} AND (m.subject LIKE ‘%CS294-7%’ OR m.words={‘CS294-7’}) AND NOT EXISTS (mreply: (mreply.sender=‘Joe’ OR mreply.sender=‘Mike’) AND mreply.in_reply_to = {m}) AND EXISTS (a: a.type=‘endorsement’ AND a.msg=m)
14
Filterer: Continuous Semantics
Problems with periodic execution: most of the retrieving messages are overlapped with the previous execution. unpredictable behavior: consider the query in the previous slide: (assume every condition is satisfied once the message arrives) message arrives Joe replies No No No User A sees: Yes No Inconsistent User B sees:
15
Filterer: Continuous Semantics (continued)
Guarantee: every user with the same filter query should see the same result -- time-independent. Solution: Continuous Semantics The results of a filter query is the set of data that would be returned if the query were executed at every instant in time.
16
Filterer: Implementation
Monotone query: Definition: A query whose result set is non-decreasing over time. Property: Continuous Semantics is guaranteed by periodically executing the monotone query. Implication: Document and annotation stores have to be immutable. Incremental query: A query which returns only the new results in a time interval.
17
Filterer: Implementation (continued)
Step 1: Query Transformation in TQL Filter Query Monotone Query Incremental Query Step 2: Query Translation TQL SQL Step 3: Query Optimization stored procedure (maintained in the database) Query optimizer SQL
18
Example of Query Transformation
Filter Query Monotone Query Consider the query in slide #13: m.to = {‘Joe’, ‘Mike’} AND (m.subject LIKE ‘%CS294-7%’ OR m.words={‘CS294-7’}) AND m.ts + [2 weeks] <= now() AND NOT EXISTS (mreply: (mreply.sender=‘Joe’ OR mreply.sender=‘Mike’) AND mreply.in_reply_to = {m} AND mreply.ts <= m.ts + [2 weeks]) AND EXISTS (a: a.type=‘endorsement’ AND a.msg=m) Note: the meaning is slightly different from the original one. It returns messages that are not replied by ‘Joe’ or ‘Mike’ within 2 weeks.
19
Example of Query Transformation
Monotone Query Incremental Query(from last_t to now()) Consider the query in the previous slide: m.to = {‘Joe’, ‘Mike’} AND (m.subject LIKE ‘%CS294-7%’ OR m.words={‘CS294-7’}) AND m.ts + [2 weeks] <= now() AND (last_t < m.ts + [2 weeks] AND m.ts + [2 weeks] <= now()) AND NOT EXISTS (mreply: (mreply.sender=‘Joe’ OR mreply.sender=‘Mike’) AND mreply.in_reply_to = {m} AND mreply.ts <= m.ts + [2 weeks]) AND EXISTS (a: a.type=‘endorsement’ AND a.msg=m) This line can be eliminated.
20
Discussions Monotone query transformation mismatch between what the user expects and the actual result set. Immutable property of document and annotation stores means inflexibility. Lots of relational tables means more join operations -- query optimizer is critical for good performance. Security issues are not addressed. Complexity of the design -- TQL is used on top of relational database.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.