Adam Koehler Index Speed Demons - How To Turbo-Charge Your Text Based Queries Using Full-Text Indexing
Thank you sponsors!
About Me: Adam Koehler, Senior Database Administrator at ScriptPro LLC 15 years of progressive experience with SQL Server from 7.0 to 2014 E-mail: ajkoehl@gmail.com Twitter: @sql_geek LinkedIn: https://www.linkedin.com/in/adam-j-koehler Blog: https://sqlgeekery.wordpress.com
What are we going to cover? Relational & full-text indexes: What are they? How are they implemented? What are the benefits? What are the downsides? Other search products: Apache Lucene.NET
Relational Indexes – What are they? A set of pages that are organized in a B-tree structure with a multiple level hierarchy Can be defined as a clustered or non-clustered index Built into the SQL Server Engine itself Can be created on tables or views
Relational Indexes – Clustered Indexes Is the physical ordering of the data into an organized structure based on the key values of the index Only 1 allowed per table based on the physical order (ascending or descending) The leaf nodes of a clustered index are actually the data rows themselves
Relational Indexes – Non-clustered Indexes Contain a row-locator to the clustered index or the data row if the table is a heap Can create up to 999 on an individual table Can add non-key columns as included columns on the leaf level of the index that allow for fully-covered queries to execute optimally
Relational Indexes – Benefits They’re easy to implement, no additional code required Commonly used, so there’s a ton of information out there on indexing strategies Not constrained by data type limitations
Relational Indexes – Downsides Dependent on the index structure, as the table gets bigger, so does the indexes on the table As the indexes get larger, the time to query data based on that index can increase Fragmentation can occur in the indexes, which can increase space usage & slow down queries Queries against this data are row by row and byte by byte, which can be slow, dependent on the amount of data you’re dealing with Certain data types cannot be key columns varchar(max),nvarchar(max), varbinary(max), xml
Relational Indexes – Implementation Uses the CREATE CLUSTERED INDEX & CREATE NONCLUSTERED INDEX statement. Have visibility into indexes using the following DMV’s: sys.dm_db_index_physical_stats sys.dm_db_index_operational_stats sys.dm_db_index_usage_stats sys.dm_db_partition_stats sys.allocation_units
DEMO
Full-Text Indexing – What is it? A token (word) -based index that allows for searching against character and BLOB data types (such as Excel & Word documents) Been a part of SQL Server since SQL 7.0 Significantly updated in SQL Server 2008 to fully integrate into the SQL Server Engine
Full-Text Indexes – Architecture Consists of two parts Full-Text Engine in sqlservr.exe Responsible for query compilation and processing Filter daemon host process - fdhost.exe Responsible for loading the filters that the Full-Text Engine uses Is the MSSQLFDLauncher service
Full-Text Indexes – Full-Text Engine SQLServr.exe is responsible for the following components of Full-Text Search: User Tables Full-text gatherer Works with the full-text crawl threads for scheduling and executing the populating of the indexes and monitoring full-text catalogs Thesaurus files Stored in <sql instance directory>\MSSQL\FTData Stoplist objects Common words that are noise words not to search on Query Processor If a query contains a full-text search, the processor passes it off to the Full-Text Engine for compilation and execution Full-Text Engine Index Writer Builds the structure used to store the indexed items
Full-Text Indexes –FD Host Process Is responsible for accessing, filtering, and word breaking data from tables and stemming the query input. Has the following components: Protocol Handler pulls the data from memory for processing and accesses data from user tables. Filters Data in varbinary, varbinary(max), image or xml columns require filtering the data in the document before it can be indexes. The filters are based on the document type and extract chunks of data from the documents removing formatting and leaving the text and position information. Word breakers and stemmers Are language specific components that find word boundaries based on the literal rules of a given language (breaking). Stemmers conjucate verbs and perform expansion of word tenses. At the time of indexing, the filter daemon uses these to perform linguistic analysis on the text data from a given column based on the language defined on the index itself.
Full-Text Indexes – Search Processing https://docs.microsoft.com/en-us/sql/relational-databases/search/full-text-search
Full-Text Indexes – Benefits Allows for semantic search operations against fields in the database As long as automatic population is turned on, full-text index maintenance is fairly simple The size of the full-text index on the table is usually smaller than that of a relational index
Full-Text Indexes – Downsides Requires modification of existing code to support searches Only one Full-Text index allowed per table Can only be created on the following data types: char, varchar, nchar, nvarchar text, ntext image xml varbinary and varbinary(max) columns
Full-Text Indexes – Implementation The FDHost service must be started Named Pipes must be an enabled network protocol for SQL Server Must create a full-text catalog first in order to group any full-text indexes together (CREATE FULLTEXT CATALOG) Can have multiple catalogs per database Must have a unique key index defined on the table you’re going to put the full-text index on (i.e. primary key or unique index)
Full-Text Indexes – Implementation Have visibility into the Full-Text subsystem via the following DMVs/DMFs Database level: Sys.fulltext_indexes Sys.fulltext_catalogs Sys.fulltext_stopwords Sys.fulltext_stoplists Sys.dm_fts_index_keywords Sys.dm_fts_index_keywords_by_document Sys.dm_fts_index_keywords_position_by_document Instance Level: Sys.dm_fts_active_catalogs Sys.dm_fts_fdhosts Sys.dm_fts_index_population Sys.dm_fts_memory_Buffers Sys.dm_fts_memory_pools Sys.dm_db_fts_index_phyiscal_stats Sys.dm_Fts_parser
Full-Text Indexes – Usage In order to use the full-text index, your query must include one of the following functions: FREETEXT, FREETEXTTABLE CONTAINS CONTAINSTABLE
Full-Text Indexes – CONTAINS Used in the WHERE clause of a query Searches for precise or less precise matches to single words and phrases Can search for the following: Prefix of a word or phrase Word near another word A word that is inflectionally generated from another (i.e. drive, drives, drove, driving, driven) Synonyms of another word using a thesaurus
Full-Text Indexes – CONTAINSTABLE Returns a table of zero or one or more rows for the columns queried containing precise or less precise matches to single words and phrases, proximity of words within a distance of one another or weighted matches. Used in the FROM clause Returns a relevance ranking value and full-text key in the result set
Full-Text Indexes – FREETEXT Used in the WHERE clause of a query Searches for values that match the meaning and not the exact wording of the search criteria Queries using FREETEXT are less precise than CONTAINS Matches are generated if any term or form of any term is found
Full-Text Indexes – FREETEXTTABLE Uses the same search conditions as FREETEXT, but also adds a rank and key value for each row Used in the FROM clause of a query like CONTAINSTABLE
DEMO
Apache Lucene.NET – What is it? Port of the java Lucene search library to .NET. Based on an inverted index Mapping from content to locations in files or a database Used in search engine indexing
Apache Lucene.Net – Benefits Allows C# developers to index documents and tables only having to learn basic T-SQL constructs Is a module that can be pre-built into C# applications with minimal effort It does not interact with the database, except when the query is executed to build and maintain the index files on disk.
Apache Lucene.Net – Downsides Separate files exist on disk that must be maintained & backed up with file backups to make sure that the indexing service still runs. Cannot tune the queries against the index files without recompiling your application Unless those queries are in stored procedures, then you can tune the stored procedures
Apache Lucene.Net – Implementation Main components of Lucene.NET Analyzer – Breaks down the search criteria into single words/terms IndexWriter – Coordinates with the Analyzer and moves results into storage IndexSearcher – performs the actual search against the index file Document – entity which is to be retrieved by the index Table in a database Field – metadata that describes a document. This data is what is searchable Columns in a table Store Directory – Directory in which the index files are stored
DEMO
Summary Relational indexes are the easiest to implement to get good performance boosts on your systems. Full-Text indexes increase what can be indexed on your database and allow for search engine-like queries against SQL Server and can speed up your character based queries dramatically Apache Lucene.NET is nice for C# developers, but not for DBA’s to implement
Links, and Thank you! CREATE FULLTEXT INDEX http://bit.ly/2wwBLhR Query with Full-Text Search http://bit.ly/2xcTpvB Apache Lucene.NET http://bit.ly/2fVhorU Lucene.NET main concepts http://bit.ly/2xXjPka Lucene.NET Sample application http://bit.ly/2yFt85l E-mail: ajkoehl@gmail.com Twitter: @sql_geek LinkedIn: http://www.linkedin.com/in/adam-j-koehler