SNFS: The design and implementation of a Social Network File System Ch. Kaidos, A. Pasiopoulos N. Ntarmos, P. Triantafillou University of Patras.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Publish-Subscribe Approach to Social Annotation of News Top-k Publish-Subscribe for Social Annotation of News Joint work with: Maxim Gurevich (RelateIQ)
Evaluating the Robustness of Learning from Implicit Feedback Filip Radlinski Thorsten Joachims Presentation by Dinesh Bhirud
The Mathematics of Information Retrieval 11/21/2005 Presented by Jeremy Chapman, Grant Gelven and Ben Lakin.
Text Categorization.
Boolean and Vector Space Retrieval Models
CSE3201/4500 Information Retrieval Systems
1 A Systematic Review of Cross- vs. Within-Company Cost Estimation Studies Barbara Kitchenham Emilia Mendes Guilherme Travassos.
Application of Ensemble Models in Web Ranking
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Evaluating the Performance of IR Sytems
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
IR Models: Review Vector Model and Probabilistic.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
Tag-based Social Interest Discovery
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
A Survey on Social Network Search Ranking. Web vs. Social Networks WebSocial Network Publishing Place documents on server Post contents on social network.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
The Development of a search engine & Comparison according to algorithms Sungsoo Kim Haebeom Lee The mid-term progress report.
Social Networking Algorithms related sections to read in Networked Life: 2.1,
Querying Structured Text in an XML Database By Xuemei Luo.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Web- and Multimedia-based Information Systems Lecture 2.
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
User-Friendly Systems Instead of User-Friendly Front-Ends Present user interfaces are not accepted because the underlying systems are too difficult to.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Large Scale Search: Inverted Index, etc.
Information Retrieval in Practice
Search Engine Architecture
Indexing & querying text
Information Retrieval in Practice
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Implementation Issues & IR Systems
Basic Information Retrieval
Implementation Based on Inverted Files
6. Implementation of Vector-Space Retrieval
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Inverted Indexing for Text Retrieval
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Presentation transcript:

SNFS: The design and implementation of a Social Network File System Ch. Kaidos, A. Pasiopoulos N. Ntarmos, P. Triantafillou University of Patras

Shameless plug.. If interested, please check out eXO: Decentralized Autonomous Scalable Social Networking, 5 th Conference on Innovative Data Systems Research (CIDR2011), 2011.

Social Networks Our Take: 1.Search for People (friends, experts, …) Content (books, photos, videos, blogs, websites, …) 2.Form entities (collections) Friends-lists, content-libs 3.Search for entities Using previously-formed collections… 4.SNFS currently provides the foundation for these…

Tagging Tag 1 Tag 2 Tag 3 Tag 4 Tag 5 Profiles: sets of tags describing entities. Search for: based on profiles. Ranked retrieval (top-k)

Current State 5,000,000,000 photos 3,000 photos/min(as of September 2010) 2,000,000,000 videos served up each day (May 2010) 600,000,000 monthly active users (January 2011) 15,000,000 books (October 2010) 130,000,000 by the end of the decade

Current State Need to access published content 22,750,000,000 queries in search engines 4,000,000,000 queries in YouTube 351,000,000 queries in Facebook 416,000,000 queries in MySpace (U.S. market figures, December 2009) ?

Current State How do I find stuff I want? How do I provide intresting objects to my users?

Proposal A content-aware file system for Social Network Systems Usefull to users And service providers too!

Previous Work on File Indexing 1991 – Semantic File Systems by Gifford 1996 – BeFS by Giampaolo and Meurillon, part of the BeOS BeOS never had commercial success – Indexing Service on Windows NT, not needed at the time Remnant of the Object File System from the unmaterialized Cairo project Typically no ranked retrieval No users input (tags) No user relationships

Desktop Searches 2004 – Windows Desktop Search, widely popular – Mac OS X's Spotlight, Google Desktop, Beagle, Strigi, Tracker... Typically no ranked retrieval ? No user relationships no exploits from relations for searching

Problems Power tools for power users... But for average users... Boolean operators??? SQL like queries???

Previous Work on Ranked Retrieval 1968 – SMART system by Salton, introduced weights in retrieval, instead of classical Boolean retrieval 1975 – Vectors and cosine similarity by Salton 1988 – Other functions for similarity tested and evaluated by Salton and Buckley 2003 – Fagin proposes and compares several efficient algorithms for top-k retrieval

Design

Design – SNFS Tags are extracted from object, stemmed and frequency is counted Weights for each tag and document are calculated Each object is associated with a unique id in a Tree A tf-idf weighting scheme was chosen

Design – SNFS Term Weight and Object ID are stored in an inverted index Each posting list of the index is a B+Tree stored in secondary memory The position of the root of the B+Tree in the index is stored in a Red Black Tree

Design – Search and retrieval The query is split in terms and stemmed The score of each document is calculated using a threshold algorithm and a tf-idf function

Threshold Algorithms Input: Posting lists sorted on weight (decreasing) t1 t3 t2 depth1 d1 d3 d2 NRA (No Random Access) Algorithm d4 d5 d2 2 Doc ID Score Doc ID d1 t1 s1 d2 s2 d3 s3 d4 d5 s5 s4 +s6 d4 d3 d2 3 +s7 +s8 +s9 Thresholds1+s2+s3 s4+s5+s6 s7+s8+s9 When no score bellow the top-k objects can be improved to exceed the threshold the algorithm halts

Threshold Algorithms Input: Posting lists sorted on weight (decreasing) TA (Threshold Algorithm with random accesses) t1 t3 t2 1 d1 d3 d2 d4 d5 d2 2 d4 d3 d2 3 Thresholds1+s2+s3 s4+s5+s6 s7+s8+s9 Doc ID Score Doc ID d1 s1 d2 s2 d3 s3 d4 d5 s5 s4 +s6+s7 +s8 +s9 depth d5 +s10 When score of the last object is bellow threshold the algorithm halts

Qualitative Comparison NRATA Disk Accesses State Keeping and computation System Calls We expect TA to perform many more slow disk accesses Can NRA's large state keeping keeping and computation need overcome TA's disk accesses? We implement both, on hard disk and on RAM-disk to find out...

Implementation with FUSE

Testing - 4 real world test sets - files containing tags from online objects - index is normally on secondary memory - ram-disk used to evaluate the effect of disk accesses

Results demanded vs Time Disk based index NRA TA

Results demanded vs Time RAM based index NRA TA

Query Terms vs Time Disk based index NRA TA

Query Terms vs Time RAM based index NRA TA

Beagle vs NRA Terms vs time Results vs time

Conclusions SNFS: - Indexing, storage, and ranked retrieval of entities in a SN. - Study of efficiency of algorithms and implementations, using real-world data, and various implementations. - Competitive performance, (eg against Beagle). - Many ways of further expansion

Future Work - Expansion for distributed systems and clouds - Distributed file systems (HDFS) - Distributed data structures - Tagging, Indexing, and searching for entity- collections – straightforward, as our object implementation/abstraction captures this. -Establishing entities consisting of relationships between entities, using advanced-tagging, and searching for these…