NOTE: The demo indexes and sample application for querying the index are no longer available due to data reasons, sorry!

Full text searching as a Service
Azure Search Welcome, over the next hour I want to take you on a whistlestop tour of Azure Search. To give you a little background on what it is, why you may need to use it and more importantly how to get started.. Full text searching as a Service

PREMIER SPONSOR GOLD SPONSORS SILVER SPONSORS BRONZE SPONSORS
But first today wouldn’t be possible with the support of great partners like those mentioned here, corporate bit over… BRONZE SPONSORS SUPPORTERS

About me Technical Lead for Confused.com
Car Buying and Selling ( So who am I, well my name is Matthew Fortunka, I am a technical lead at confused.com (part of the architecture team). Confused.com is probably best known as a car insurance aggregator, the first car insurance aggregator! The aggregator market, as you may have noticed from the constant TV ads, is a very competitive one and each of the big 4 players are constantly looking for new ideas to gain an advantage. About 2 years ago Confused set up a team to create a site dedicated to car buying and selling, in a attempt to do for dealers what we did for insurance 15 years ago.

Used cars Car Stock Data Cleaner Datastore Web Site
The concept was simple, we get a feed of car stock in every day. We look at it, scream in horror at the quality of the data. Clean the data, store it and then present it to the user. Nothing too out of the ordinary so far. What surprised us was the magnitude of the data and the number of things you could allow people to search for.

What are Car buyers looking for?
Make Model Trim Price? Has Air con? Has Climate Control? Manual or Automatic Petrol, Diesel, Electric, Hybrid etc. Has Bluetooth? Mileage Car Data is typically a mix of “highly structured” data such as Make, Model, Mileage, Price and more fluid data such as Dealer Descriptions, Feature lists and Reviews all of which are typically blocks of text. Since search was going to be at the heart of everything we did we started to look around at fast solutions to dealing with this sort of data.

WHY NOT SQL? Full Text Indexing Availability Isolation from OLTP
Faceting Rich Querying Syntax Obligatory meme slide At the time full text searching wasn’t available in SQL Azure We are still using a lightweight SQL server for Transactional operations (saving searches, selling cars etc.) but the site can run without it (albeit with reduced functionality) Facetting

Full text searching – A LITTLE HISTORY
Apache Lucene Full Text Search Library Originally Java (now everywhere!) Search Ranking Sorting Faceting FAST! Distributed Lucene Clusters Highly Available Scalable Schema Free The idea of full text searching isn’t a new one, Linux bods will happily tell you this is a solved problem and will spend hours regaling you with tales of Lucene. Lucene was originally a Java solution to dealing with the problem of querying and presenting text. It offers a comprehensive query syntax with support for faceting (more on that later) and as an added benefit is very fast. Projects like ElasticSearch and Solr then came along to build server solutions on top of this framework.

Azure Search Now Confused is very much a Microsoft shop, we bet big on Azure and more specifically the “X-As-a-service” model and have been very pleased with the results. Managing servers is something we don’t do a lot of any more and are generally happier for it, so the prospect of setting up an ElasticSearch cluster wasn’t something that appealed. At about the same time, Microsoft announced the first public preview of Azure Search that was built on top of ElasticSearch, essentially ElasticSearch as a service!

Azure Search is NOT ElasticSEARCH-AS-A-SERVICE
PROS Tightly integrated with Azure* Easy to set up Easily Scalable CONS Requires a schema No support nested document types Reduced Query Syntax Before I continue it is worth pointing out that Azure Search is not ElasticSearch as a service. It is a subset. On the plus side it is very nicely integrated with Azure, it is easy to scale and is very quick and easy to setup. On the downside you loose the schema-free design of ElasticSearch and some of the big ticket features like nested documents are missing, plus if you like the ability to tinker with servers you are out of luck.

Creating a Search Service
(or via Resource Management Powershell) Free v Basic v Standard Search Units Partitions Replicas Free v Basic v Standard Standard isn’t standard (multiple versions of standard depending on default partition size and the number of queries per second that are supported Search Units – parallel to SQL Azure Replica – copy of index (good for reads, recommend 3 replicates for high availablility) Partitions – index is sharded (good for concurrent writes) Each replica has an equivalent number of partitions. Each Replica is 1 search unit, Each Partition is 1 search unit so 3 replicas each with 2 partitions = 6 search units!

CREATE A NEW SEARCH SERVICE - DEMO
Enough talk, lets create a service. Start in azure portal Search for “Search Services” Add new

Index Indexer Data source ANATOMY of Search
Now that we have our service, these are the core elements of our search service. An Index is analogous to a SQL table (actually more like a schemad NoSQL store), each search service can have multiple indexes. The index is populated from a data source by means of an indexer. (a bit like a specialised SSIS package)

Index Fields INDEX STRUCTURE Scoring Profiles Suggesters Options
An index is made up of a collection of fields (map to columns), the index can provide one or more suggesters. Suggesters are fields that can be used for Autocomplete operations. Scoring profiles are ways of manipulating search results, more on those later. The index will also have some options to control how it can be accessed.

INDEX STRUCTURE - FIELDS
Name Type Is a Key field Is Searchable Is Filterable Is Sortable Is Facetable Is Retrievable Fields are essentially like SQL columns but with an additional set of meta data associated with them to control how they participate in search operations. The field types are pretty much what you expect, string, bools, ints, datetimes and more interestingly geo-spatial. Fields can opt to take part in searching, ordering and selection. Types – String, Collection of String, Bool, Int32, Int64, Double, DateTimeOffset, GeographyPoint Searchable – optional on string or collection of string, invalid for other types Filterable ODATA filter Sortable – not valid for collection of string Facetable – not valid for geography point

INDEX STRUCTURE – FIELDS AND SUGGESTERS
{ "name" : "cars", "fields" : [ {"name" : "carId", "type" : "Edm.String", "key" : true, "searchable":false}, {"name" : "make", "type" : "Edm.String"}, {"name" : "model", "type" : "Edm.String"}, {"name" : "trim", "type" : "Edm.String"}, {"name" : "price", "type" : "Edm.Double"}, {"name" : "year", "type" : "Edm.Int32"}, {"name" : "features", "type" : "Collection(Edm.String)"}, {"name" : "description", "type" : "Edm.String", "filterable":false}, {"name" : "position", "type" : "Edm.GeographyPoint"} ], "suggesters": [ "name" : "makeModelSuggester", "searchMode" : "analyzingInfixMatching", "sourceFields" : ["make", "model"] } ] The format for defining a simple index uses JSON, the index name is defined (in this case cars) and the collection of fields are declared, by default all string fields will be searchable, and most types will be facetable, sortable and filterable (the rules vary depending on the field type). Note the gotcha of needing to create the fields that are using the suggester when the suggester is created otherwise it will complain.

Search OPERATIONS – REST API
Vary by the HTTP verb used. POST/PUT – Create/Update DELETE GET – Get or List HTTP Status codes Api keys The best way to perform any operations on the index is via the REST api. All operations follow the same pattern, we have an HTTP verb to signify the type of operation and a url + data (either via headers, body or query string). Operation status is signalled via a HTTP status code (e.g. 200/204 for success, 500 for error) For those that don’t like getting there hands dirty with HTTP packets there is a nuget package which wraps all the calls up in a nice object model. All operations require the use of an api key, these can be found via the azure portal. There are 2 types of keys, admin keys allow you to control the structure of the search service, query keys let you retrieve data. Also available as a .net nuget package (wrapper around the rest calls)

Creating the index HTTP POST Content-Type: application/json Body
To create the index we send an HTTP PUT (or POST) to the search service, the body of the message will contain our index definition (as seen earlier) (Not sure this slide is needed) 2 types of keys (admin and query) This operation can be done via the portal. search index name].search.windows.net /indexes/[index name]?api-version=

Creating the index - DEMO
Lets see that in action (DEMO site no longer available)

WORKING WITH DOCUMENTS – REST API
search index name].search.windows.net /indexes/[index name]/docs/index?api-version= { "value" : [ : "upload", "carId" : "1", "make" : "ford", "model" : "fiesta", "year": 2014, "price" : , "position": { "type": "Point", "coordinates": [ , ] }, "features": ["AIRCON", "BLUETOOTH"] } ] So now we have an empty index we need to populate it with data, the REST api can be used to manage adding and removing documents. In this example you can see a single document that will be either inserted or will replace the current document (matched on the key field). These operations can contain up to 1000 document operations (or 16MB) which is fine for ad hoc document management but gets a little cumbersome with large indexes. Good for adhoc document inserts Up to 1000 documents or 16MB per batch HTTP POST @search.action – upload (insert or replace all fields), merge (update specified fields only, will fail if doc doesn’t exist), mergeOrUpload (merge if exists else insert), delete.

WORKING WITH DOCUMENTS – INDEXERS
Data Source Indexer Index Indexers allow the synchronization of documents with an underlying data store. The data source manages the connection, currently supported are direct connections to documentDB, SQL Azure, Blob and Table storage. The last 2 are in preview and allow indexing of a wide variety of document formats including HTML and PDFs. For SQL Azure the data source describes which table or view it should synchronize against, which column denotes the record has changed and which column denotes a document should be deleted. The indexer links the data source to the index and controls the frequency by which it will check for changes. The indexer can also be run manually via a REST command. Data Source – DocDB, SQL Azure, Blob (PDF, HTML, XML, ZIP) or Table How to identify a new document How to identify a document that should be deleted. (For SQL either a DateTime and bit field or using integrated change tracking (Tables only)) Indexer – How often should we check this data.

INDEXERS - DEMO From SQL Azure show the table, the data source definition, the populated index. Update data in the db, rerun the indexer. If time show the Blob storage indexer Worth pointing out all of the following can be done via the azure portal (but it will try and create an index for you instead of reusing an existing one) For the blob storage one it uses the default field names for document properties

SEARCHING FOR DOCUMENTS – Free TEXT
HTTP GET search index name].search.windows.net /indexes/[index name]/docs?search=[search text]&api-version= HTTP POST search index name].search.windows.net /indexes/[index name]/docs/search? search=[search text]&api-version= Return all fields in the documents Ordered according to search score Returns only the top 50 results Searches in all “searchable” fields Matches any of the search terms Uses the default query syntax Uses the default scoring profile (defined in the index definition) Now that we have a fully populated index we can start querying it, here we see the simplest form of query, the index will be searched for documents matching the search text using the default settings shown here. The index can be queryied using with a POST or a GET request, the POST request allows for larger querys (16MB vs 8KB). HTTP GET or POST REQUEST (Get limited to 8KB, Post to 16MB) Azure Portal Mention scoring profiles again. Query syntax by default allows + and * wildcard matching, “any” matching instead of “all”

INDEX StrucTURE – SCORING PROFILES
Field 1: …………….……. Field 2: ………Ford..…. Field 3:…………………... Field 1: …….Ford……. Field 2: ………...………. Field 3:…………………... By default all fields carry the same “weight” Boosting Fields Ranges Document Age Geo Field 1: …………………. Field 2: ………...………. Field 3:……Ford……... When we search for a term each field is given equal weight, the accuracy of a match determines the search score for a document, in the example on screen each document would have an equal search score and would be returned to the caller using the default sort order. Scoring profiles allow us to boost the search score based on a few options. We can boost based on matches found in a particular field (e.g. rank results higher if the match is found in Field 2), we can boost on ranges (for example matches in documents with a 5 star review will be ranked higher than a 3 star review), by the age of document and by the distance of geo field from a known location. Multiple search fields (e.g. Make, Model, Description) – by default each field has equal weight when calculating the validity of the match Scoring profiles help you tweak this by assigning a higher score if the match is in the make than in the description. Custom profiles include boosting based on age (newest come higher) Favour results closer to a known location Specific fields (e.g. Make higher than description) A specific range (e.g. results in 5-star ratings count for more than result in 1-start)

SEARCHING FOR DOCUMENTS – Simple QUERY
Description Example + AND ford+fiesta | OR Ford|fiesta - NOT ford-fiesta * suffix for* “ “ phrase “ford fiesta” The search text by default is parsed using the simple query syntax, this allows for basic boolean queries. For example if I want to search for documents containing ford and fiesta the plus symbol would be used to denote the second term. The simple query syntax is useful but doesn’t really offer anything above what could already be done in SQL server. For more power we can specify the queryType to “full” and use a bigger subset of the Lucene syntax.

SEARCHING FOR DOCUMENTS – LUCENE QUERY
Regular Expression searching Field limitations Term boosting Proximity search The Lucene syntax gives us fine grained control over how we search. We can use regular expressions to match patterns, we can restrict specific search terms to specific fields, we can boost documents containing some search terms above others and control how close together search terms should be. In the example above we are searching for documents where the make field contains “ford” and any of the searchable fields match the regex for a mobile phone number. Documents with a matching phone number will be boosted higher than those without. Extension to simple query syntax RegExp class compatible regexs (note: uses the Lucene regex syntax not the .net one) Field Limitations – search within a single fie;d Term boosting – boost one term over another e.g. ford^2 bmw will score fords higher than bmw Proximity search – find me fiesta within 4 words of ford. Sample text shows make equals ford and results containing the text “one careful owner” will be scored higher. search=make:ford^3 OR /07[0-9]{3}\ *[0-9]{6}/&queryType=full

SEARCHING FOR DOCUMENTS – OPTIONS
Query string Description Example $select Return only these fields $select=carId,make,model $skip Get the next “page” of results $skip=10 $top The number of results in a page $top=100 $count Return the total number of results $count=true searchFields Only search for the text in these fields searchFields=make,model highlight highlightPreTag highlightPostTag Return results pre-highlighted Highlight=make&highlightPreTag=<div>&highlightPostTag=</div> To control the search results (using either query syntax) we have a number of options available. These can restrict the number of fields returned and the allow for paging of documents. Be warned, parameter names change depending on whether you are using HTTP GET or HTTP POST

SEARCHING FOR DOCUMENTS – ODATA FILTERS
Logical operators (and, or, not) Comparisons (eq, ne, gt, lt, gr, le) Collection filters Geospatial filters To refine the documents that can be search further we can use the ODATA filters. These allow us to restrict the results to only documents that match a certain criteria. The onscreen example shows searching for documents containing the text “one careful owner” in cars that cost less than £20000 and have the Bluetooth feature. These can also be used to restrict results based on geographical area (either a radius from a fixed point or within a polygon of points) For more SQL-like filtering Collection filters – collection contains any items, collection contains any items that match query Geospatial – only show items within a certain distance, only show items within a polygon search=“one careful owner”&$filter=price le 20000 AND features/any(t: t eq ‘BLUETOOTH')

SEARCHING FOR DOCUMENTS – FACETING
Group by Field (e.g. Number of documents per Make) Group by interval (e.g. Hour, Day, Month) Group by values (e.g. Price Ranges £1001-£2000, £2001-£3000) One of the most powerful features of search indexes is faceting, this is the idea that when I return results, also return me the number of results if I restricted my search by other terms. In the example on screen you can see this in action, When I search for a new TV, I will get the number of results for each of the manufacturers along with my current result set. Azure search allows this data to be returned broken down by distinct items in a field (e.g. Manufacturer name), by and interval (e.g. per day) or by a range (prices in 1000 bands) Mention controlling sort (e.g. top 5 makes sorted by number of documents asc or desc)

Working WITH DOCUMENTS – DEMO
Now that we have all the pieces in place, lets show you it in action. simple query, lucene query Control an option (e.g. top) Scoring Profile (will need to update index) Odata filter Facetting highlighting

If you want to KNOW MORE…
So why should you use azure search If you need to do complex querying on large amounts of text but don’t want the headaches of managing servers, give it a look! Azure search integrates simply with other azure services and with a little work can bring in data from any source

Questions?

Please give us your feedback:
sqlrelay.co.uk/feedback Thank you

END OF THE LINE

NOTE: The demo indexes and sample application for querying the index are no longer available due to data reasons, sorry!

Similar presentations

Presentation on theme: "NOTE: The demo indexes and sample application for querying the index are no longer available due to data reasons, sorry!"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NOTE: The demo indexes and sample application for querying the index are no longer available due to data reasons, sorry!

Similar presentations

Presentation on theme: "NOTE: The demo indexes and sample application for querying the index are no longer available due to data reasons, sorry!"— Presentation transcript:

Similar presentations

About project

Feedback