Presented by, MySQL AB® & O’Reilly Media, Inc. Sphinx High-performance full-text search for MySQL Andrew Aksyonoff, Peter.

Slides:



Advertisements
Similar presentations
Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru.
Advertisements

The riddles of the Sphinx Full-text engine anatomy atlas.
Fast Data at Massive Scale Lessons Learned at Facebook Bobby Johnson.
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Dos and don’ts of Columnstore indexes The basis of xVelocity in-memory technology What’s it all about The compression methods (RLE / Dictionary encoding)
Final Project of Information Retrieval and Extraction by d 吳蕙如.
Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.
LCT2506 Internet 2 Further SQL Stored Procedures.
By Morris Wright, Ryan Caplet, Bryan Chapman. Overview  Crawler-Based Search Engine (A script/bot that searches the web in a methodical, automated manner)
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
 What I hate about you things people often do that hurt their Web site’s chances with search engines.
1 Copyright 2006 MySQL AB The World’s Most Popular Open Source Database Full Text Search in MySQL 5.1 New Features and HowTo Alexander Rubin Senior Consultant,
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
GIS Concepts ‣ What is a table? What is a table? ‣ Queries on tables Queries on tables ‣ Joining and relating tables Joining and relating tables ‣ Summary.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Search Engines and Information Retrieval Chapter 1.
Copyright ®xSpring Pte Ltd, All rights reserved Versions DateVersionDescriptionAuthor May First version. Modified from Enterprise edition.NBL.
_______________________________________________________________________________________________________________ PHP Bible, 2 nd Edition1  Wiley and the.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Access Path Selection in a Relational Database Management System Selinger et al.
1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.
Database Management 9. course. Execution of queries.
Ideas to Improve SharePoint Usage 4. What are these 4 Ideas? 1. 7 Steps to check SharePoint Health 2. Avoid common Deployment Mistakes 3. Analyze SharePoint.
Web Scripting [PHP] CIS166AE Wednesdays 6:00pm – 9:50pm Rob Loy.
Module 5 Planning for SQL Server® 2008 R2 Indexing.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
The Internet 8th Edition Tutorial 4 Searching the Web.
Module 10 Administering and Configuring SharePoint Search.
Digas Digital Archiving System. Digas is the database program used for research and fact checking in the Research Department (“Dokumentation”, ~ 60 researchers)
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
1 CS 430 Database Theory Winter 2005 Lecture 2: General Concepts.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
MongoDB is a database management system designed for web applications and internet infrastructure. The data model and persistence strategies are built.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Web Technologies Lecture 8 Server side web. Client Side vs. Server Side Web Client-side code executes on the end-user's computer, usually within a web.
Finding a PersonBOS Finding a Person! Building an algorithm to search for existing people in a system Rahn Lieberman Manager Emdeon Corp (Emdeon.com)
Clusterpoint Margarita Sudņika ms RDBMS & NoSQL Databases & tables → Document stores Columns, rows → Schemaless documents Scales UP → Scales UP.
Distributed Time Series Database
IMS 4212: Database Implementation 1 Dr. Lawrence West, Management Dept., University of Central Florida Physical Database Implementation—Topics.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
CS4432: Database Systems II
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Integrating and Extending Workflow 8 AA301 Carl Sykes Ed Heaney.
CS520 Web Programming Full Text Search Chengyu Sun California State University, Los Angeles.
TerarkDB Introduction Peng Lei Terark Inc ( ). All rights reserved 1.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Diving into Query Execution Plans ED POLLACK AUTOTASK CORPORATION DATABASE OPTIMIZATION ENGINEER.
Basics Components of Web Design & Development Basics, Components, Design and Development.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Full Text Search with Sphinx OSCON 2009 Peter Zaitsev, Percona Inc Andrew Aksyonoff, Sphinx Technologies Inc.
Sphinx 2009 Vocals: Peter Zaitsev, Percona Lyrics: Andrew Aksyonoff, Sphinx.
Andrew Aksyonoff // Sphinx Technologies Inc. // 2009
Why indexing? For efficient searching of a document
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Triple Stores.
COMP 430 Intro. to Database Systems
Practical Database Design and Tuning
Sphinx High-performance full-text search for MySQL
Triple Stores.
Presentation transcript:

Presented by, MySQL AB® & O’Reilly Media, Inc. Sphinx High-performance full-text search for MySQL Andrew Aksyonoff, Peter Zaitsev,

What’s Sphinx?  FOSS full-text search engine  Specially designed for indexing databases  Integrates well with MySQL  Provides greatly improved full-text search  Sometimes, can improve non-full-text queries  By more efficient processing (in some cases)  By distributed processing on a cluster (in all)  Details later in this talk

Why Sphinx?  Major reasons  Better indexing speed  Better searching speed  Better relevance  Better scalability  “Minor” reasons  Many other features  Like fixed RAM usage, “faceted” searching, geo- distance, built-in HTML stripper, morphology support, 1-grams, snippets highlighting, etc.

The meaning of “better”  Better indexing speed  times faster than MySQL FULLTEXT  4-10 times faster than other external engines  Better searching speed  Heavily depends on the mode (boolean vs. phrase) and additional processing (WHERE, ORDER BY, etc)  Up to 1000 (!) times faster than MySQL FULLTEXT in extreme cases (eg. large result set with GROUP BY)  Up to 2-10 times faster than other external engines

The meaning of “better” 2.0  Better relevancy  Sphinx phrase-based ranking in addition to classic statistical BM25  Sample query – “To be or not to be”  Optional, can be turned off for performance  Better scalability  Vertical – can utilize many CPU cores, many HDDs  Horizontal – can utilize many servers  Out of the box support  Transparent to app, matter of server config changes

How does it scale?  Distributed searching with several machines  Fully transparent to calling application  Biggest known Sphinx cluster  1,200,000,000+ documents (yes, that’s a billion)  1.5 terabytes  1+ million searches/day  7 boxes x 2 dual-core CPUs = 28 cores  Busiest known Sphinx cluster  30+ million searches/day using 15 boxes

How does it work?  Two standalone programs  indexer – pulls data from DB, builds indexes  searchd – uses indexes, answers queries  Client programs talk to searchd over TCP  Via native APIs (PHP, Perl, Python, Ruby, Java)...  Via SphinxSE, pluggable MySQL engine  indexer periodically rebuilds the indexes  Typically, using cron jobs  Searching works OK during rebuilds

Indexing workflow  Data sources – “where to get the data?”  MySQL, Postgres, XML pipe…  Local indexes – “how to index the data?”  Also storage location, valid characters list, stop words, stemming, word forms dictionaries, tokenizing exceptions, substring indexing, N-grams, HTML stripping…  Arbitrary number of indexes  Arbitrary number of sources per index  Can pull data from different DB boxes in a shard

Sample – all eggs in one basket MySQL Host DB01 MySQL Host DB02 Source SRC01 Host SEARCHER Source SRC02 Host SEARCHER Index FULLINDEX Host SEARCHER Combining sharded database data for the ease of use

Distributed indexes  Essentially, lists of local and remote indexes index dist1 { type = distributed local = chunk1 agent = box02:3312:chunk02 agent = box03:3312:chunk03 agent = box04:3312:chunk03 }  All local indexes are searched sequentially  All remote indexes are searched in parallel  All results are merged

Sample – divide and conquer MySQL Host UBERDB Source CHUNK01 Index CHUNK01 Host GREP01 Source CHUNK02 Index CHUNK02 Host GREP02 Source CHUNK10 Index CHUNK10 Host GREP10 Distributed index DIST01 Host WEB01 Sharding full-text indexes to improve searching latency...

Searching 101 – the client side  Create a client object  Set up the options  Fire the query <?php include ( “sphinxapi.php” ); $cl = new SphinxClient (); $cl->SetMatchMode ( SPH_MATCH_PHRASE ); $cl->SetSortMode ( SPH_SORT_EXTENDED, “price desc” ); $res = $cl->Query ( “ipod nano”, “products” ); var_dump ( $res ); ?>

Searching 102 – match contents  Matches will always have document ID, weight  Matches can also have numeric attributes  No string attributes yet (pull ‘em from MySQL) print_r ( $result[“matches”][0] ): Array ( [id] => 123 [weight] => [attrs] => Array ( [group_id] => [added] => ) )

Searching 103 – why attributes  Short answer – efficiency  Long answer – efficient filtering, sorting, and grouping for big result sets (over 1,000 matches)  Real-world example:  Using Sphinx for searching only and then sorting just 1000 matches using MySQL – up to 2-3 seconds  Using Sphinx for both searching and sorting – improves that to under 0.1 second  Random row IO in MySQL, no row IO in Sphinx  Now imagine there’s 1,000,000 matches…

Moving parts  SQL query parts that can be moved to Sphinx  Filtering – WHERE vs. SetFilter() or fake keyword  Sorting – ORDER BY vs. SetSortMode()  Grouping – GROUP BY vs. SetGroupBy()  Up to 100x (!) improvement vs. “naïve” approach  Rule of thumb – move everything you can from MySQL to Sphinx  Rule of thumb 2.0 – apply sacred knowledge of Sphinx pipeline (and then move everything)

Searching pipeline in 30 seconds  Search, WHERE, rank, ORDER/GROUP  “Cheap” boolean searching first  Then filters (WHERE clause)  Then “expensive” relevance ranking  Then sorting (ORDER BY clause) and/or grouping (GROUP BY clause)

Searching pipeline details  Query is evaluated as a boolean query  CPU and IO, O(sum(docs_per_keyword))  Candidates are filtered  based on their attribute values  CPU only, O(sum(docs_per_keyword))  Relevance rank (weight) is computed  CPU and IO, O(sum(hits_per_keyword))  Matches are sorted and grouped  CPU only, O(filtered_matches_count)

Filters vs. fake keywords  The key idea – instead of using an attribute, inject a fake keyword when indexing sql_query = SELECT id, title, vendor... $sphinxClient->SetFilter ( “vendor”, 123 ); $sphinxClient->Query ( “laptop”, “products” ); vs. sql_query = SELECT id, title, CONCAT(‘_vnd’,vendor)... $sphinxClient->Query ( “laptop _vnd123”, “products” );

Filters vs. fake keywords  Filters  Will eat extra CPU  Linear by pre-filtered candidates count  Fake keywords  Will eat extra CPU and IO  Linear by per-keyword matching documents count  That is strictly equal (!) to post-filter matches count  Conclusion  Everything depends on selectivity  For selective values, keywords are better

Sorting  Always optimizes for the “limit”  Fixed RAM requirements, never an IO  Controlled by max_matches setting  Both server-side and client-side  Defaults to 1000  Processes all matching rows  Keeps at most N best rows in RAM, at all times  MySQL currently does not optimize that well  MySQL sorts everything, then picks up N best

Grouping  Also in fixed RAM, also IO-less  Comes at the cost of COUNT(*) precision  Fixed RAM usage can cause underestimates  Aggregates-only transmission via distributed agents can cause overestimates  Frequently that’s OK anyway  Consider 10-year per-day report – it will be precise  Consider “choose top-10 destination domains from 100-million links graph” query – 10 to 100 times speedup at the cost of 0.5% error might be acceptable

More optimization possibilities  Using query statistics  Using multi-query interface  Choosing proper ranking mode  Distributing the CPU/HDD load  Adding stopwords  etc.

Query statistics  Applies to migrating from MySQL FULLTEXT  Total match counts are immediately available – no need to run 2 nd query  Per-keyword match counts are also available – can be used not just as minor addition to search results – but also for automatic query rewriting

Multi-query interface  Send many independent queries in one batch, allow Sphinx optimize them internally  Always saves on network roundtrip  Sometimes saves on expensive operations  Most frequent example – same full-text query, different result set “views”

Multi-query sample $client = new SphinxClient (); $q = “laptop”; // coming from website user $client->SetSortMode ( SPH_SORT_EXTENDED, desc”); $client->AddQuery ( $q, “products” ); $client->SetGroupBy ( SPH_GROUPBY_ATTR, “vendor_id” ); $client->AddQuery ( $q, “products” ); $client->ResetGroupBy (); $client->SetSortMode ( SPH_SORT_EXTENDED, “price asc” ); $client->SetLimit ( 0, 10 ); $result = $client->RunQueries ();

Offloading non-full-text queries  Basic “general” SQL queries can be rewritten to “full-text” form – and run by Sphinx SELECT * FROM table WHERE a=1 AND b=2 ORDER BY c DESC LIMIT 60,20 $client->SetFilter ( “a”, array(1) ); $client->SetFilter ( “b”, array(2) ); $client->SetSortBy ( SPH_SORT_ATTR_DESC, “c” ); $client->SetLimit ( 60, 20 ); $result = $client->Query ( “”, “table” );  Syntax disclaimer – we are a full-text engine!  SphinxQL coming at some point in the future

Why do that?  Sometimes Sphinx reads outperform MySQL  Sphinx always does RAM based “full scan”  MySQL index read with bad selectivity can be slower  MySQL full-scan will most likely be slower  MySQL can’t index every column combination  Also, Sphinx queries are easier to distribute  But Sphinx indexes are essentially read-only  well, almost (attribute update is possible)  Complementary to MySQL, not a replacement

SELECT war story  Searches on Sahibinden.com  Both full-text and not  “Show all auctioned items in laptops category with sellers from Ankara in $1000 to $2000 range”  “Show matches for ‘ipod nano’ and sort by price”  Many columns, no way to build covering indexes  Sphinx full scans turned out being 1.5-3x better than MySQL full scans or 1-column index reads  Also, code for full-text and non-full-text queries was unified

GROUPBY war story  Domain cross-links report on BoardReader.com  “Show top 100 destination domains for last month”  “Show top 100 domains that link to YouTube”  ~200 million rows overall  Key features of the report queries  They always group by domain, and sort by counts  The result sets are small  Approximate results are acceptable – don’t care whether there were exactly 813,719 or 814,101 links from domain X

GROUPBY war story  MySQL prototype took up to 300 seconds/query  Sphinx queries were much easier to distribute  URLs are preprocessed, then full-text indexed  → test$com, test$com$path, test$com$path$doc, …  Queries are distributed over 7-machine cluster  Now takes within 1-2 seconds in the worst case  This is not the main cluster load  Main load is searching 1.2B documents

Summary  Discussed Sphinx full-text engine  Discussed its pipeline internals – helps to optimize queries  Discussed how it can be used to offload and/or optimize “general” SQL queries  Got full-text queries? Try Sphinx  Got questions? Now is the time!