NOTE: The demo indexes and sample application for querying the index are no longer available due to data reasons, sorry!

Slides:



Advertisements
Similar presentations
JQuery MessageBoard. Lets use jQuery and AJAX in combination with a database to update and retrieve information without refreshing the page. Here we will.
Advertisements

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Microsoft ® Official Course Interacting with the Search Service Microsoft SharePoint 2013 SharePoint Practice.
Objectives of the Lecture :
131 Agenda Overview Review Roles Lists Libraries Columns.
PostgreSQL and relational databases As well as assignment 4…
4-1 INTERNET DATABASE CONNECTOR Colorado Technical University IT420 Tim Peterson.
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
Mail merge I: Use mail merge for mass mailings Perform a complete mail merge Now you’ll walk through the process of performing a mail merge by using the.
NMED 3850 A Advanced Online Design January 12, 2010 V. Mahadevan.
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
Examining data using Microsoft Access Queries Using Criteria and Calculations SESSION 3.2 This section covers specifying an exact match condition in a.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
Microsoft Access. Microsoft access is a database programs that allows you to store retrieve, analyze and print information. Companies use databases for.
Python: Building Geoprocessing Tools David Wynne, Ghislain Prince.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
This was written with the assumption that workbooks would be added. Even if these are not introduced until later, the same basic ideas apply Hopefully.
Perform a complete mail merge Lesson 14 By the end of this lesson you will be able to complete the following: Use the Mail Merge Wizard to perform a basic.
Fox Scientific, Inc. ONLINE ORDERING 101. Welcome to our website On our main page you can find current promotions, the vendors we offer, technical references.
Emdeon Office Batch Management Services This document provides detailed information on Batch Import Services and other Batch features.
N5 Databases Notes Information Systems Design & Development: Structures and links.
Search Engine Optimization
Introduction to Mongo DB(NO SQL data Base)
Autodesk Dev Days 2015 The road ahead DevDays 2015
SharePoint 101 – An Overview of SharePoint 2010, 2013 and Office 365
Summit 2006: Knowledge is Power Steve Heister Helen Robie
Internet Made Easy! Make sure all your information is always up to date and instantly available to all your clients.
Stop the madness - How to balance to the GL
Unit 9.1 Learning Objectives Data Access in Code
Creating Oracle Business Intelligence Interactive Dashboards
Take a REST from manual searching: PDBe, programmatically
Microsoft Office Access 2010 Lab 2
Searching and Indexing
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Data Virtualization Tutorial… CORS and CIS
Brian Leonard ブライアン レオナルド
Week 12 Option 3: Database Design
Microsoft Dynamics.
NOSQL databases and Big Data Storage Systems
The Price IS Right: What can the billing module do for me?
Intro to PHP & Variables
Order Management For Shippers.
Ben Burbridge, Rebecca Jones, Hilary Newman Product Development
MVC Framework, in general.
Searching for Rio: Azure Search, NBC Sports, and the Olympics
SharePoint Essentials Toolkit
CS6604 Digital Libraries IDEAL Webpages Presented by
Part of the Multilingual Web-LT Program
OOP Paradigms There are four main aspects of Object-Orientated Programming Inheritance Polymorphism Abstraction Encapsulation We’ve seen Encapsulation.
Tutorial 3 – Querying a Database
GDSS – Digital Signature
Teaching slides Chapter 8.
Dealing with images in a resume form.
Managing Rosters Screener Training Module Module 5
Introduction to Database Systems
Tableau Groups VS Sets The difference between Tableau’s Groups and Tableau Sets was something that confused me a little when first started with Tableau.
Project Management in SharePoint
Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1
Free-form Friday – Putting Courses on the Learn Center
Spreadsheets, Modelling & Databases
Inside a PMI Online Course
NoSQL Overview + Elasticsearch Quick Dive
Bryan Soltis – Kentico Technical Evangelist
Databases This topic looks at the basic concept of a database, the key features and benefits of a Database Management System (DBMS) and the basic theory.
Intro to Azure Search Julie Smith 2019.
Microsoft Azure Data Catalog
Advanced Tips and Tricks
Intro to Azure Search Julie Smith 2019.
Presentation transcript:

NOTE: The demo indexes and sample application for querying the index are no longer available due to data reasons, sorry!

Full text searching as a Service Azure Search Welcome, over the next hour I want to take you on a whistlestop tour of Azure Search. To give you a little background on what it is, why you may need to use it and more importantly how to get started.. Full text searching as a Service

PREMIER SPONSOR GOLD SPONSORS SILVER SPONSORS BRONZE SPONSORS But first today wouldn’t be possible with the support of great partners like those mentioned here, corporate bit over… BRONZE SPONSORS SUPPORTERS

About me Technical Lead for Confused.com Car Buying and Selling (https://www.confused.com/buying-selling) So who am I, well my name is Matthew Fortunka, I am a technical lead at confused.com (part of the architecture team). Confused.com is probably best known as a car insurance aggregator, the first car insurance aggregator! The aggregator market, as you may have noticed from the constant TV ads, is a very competitive one and each of the big 4 players are constantly looking for new ideas to gain an advantage. About 2 years ago Confused set up a team to create a site dedicated to car buying and selling, in a attempt to do for dealers what we did for insurance 15 years ago.

Used cars Car Stock Data Cleaner Datastore Web Site The concept was simple, we get a feed of car stock in every day. We look at it, scream in horror at the quality of the data. Clean the data, store it and then present it to the user. Nothing too out of the ordinary so far. What surprised us was the magnitude of the data and the number of things you could allow people to search for.

What are Car buyers looking for? Make Model Trim Price? Has Air con? Has Climate Control? Manual or Automatic Petrol, Diesel, Electric, Hybrid etc. Has Bluetooth? Mileage Car Data is typically a mix of “highly structured” data such as Make, Model, Mileage, Price and more fluid data such as Dealer Descriptions, Feature lists and Reviews all of which are typically blocks of text. Since search was going to be at the heart of everything we did we started to look around at fast solutions to dealing with this sort of data.

WHY NOT SQL? Full Text Indexing Availability Isolation from OLTP Faceting Rich Querying Syntax Obligatory meme slide At the time full text searching wasn’t available in SQL Azure We are still using a lightweight SQL server for Transactional operations (saving searches, selling cars etc.) but the site can run without it (albeit with reduced functionality) Facetting

Full text searching – A LITTLE HISTORY Apache Lucene Full Text Search Library Originally Java (now everywhere!) Search Ranking Sorting Faceting FAST! Distributed Lucene Clusters Highly Available Scalable Schema Free The idea of full text searching isn’t a new one, Linux bods will happily tell you this is a solved problem and will spend hours regaling you with tales of Lucene. Lucene was originally a Java solution to dealing with the problem of querying and presenting text. It offers a comprehensive query syntax with support for faceting (more on that later) and as an added benefit is very fast. Projects like ElasticSearch and Solr then came along to build server solutions on top of this framework.

Azure Search Now Confused is very much a Microsoft shop, we bet big on Azure and more specifically the “X-As-a-service” model and have been very pleased with the results. Managing servers is something we don’t do a lot of any more and are generally happier for it, so the prospect of setting up an ElasticSearch cluster wasn’t something that appealed. At about the same time, Microsoft announced the first public preview of Azure Search that was built on top of ElasticSearch, essentially ElasticSearch as a service!

Azure Search is NOT ElasticSEARCH-AS-A-SERVICE PROS Tightly integrated with Azure* Easy to set up Easily Scalable CONS Requires a schema No support nested document types Reduced Query Syntax Before I continue it is worth pointing out that Azure Search is not ElasticSearch as a service. It is a subset. On the plus side it is very nicely integrated with Azure, it is easy to scale and is very quick and easy to setup. On the downside you loose the schema-free design of ElasticSearch and some of the big ticket features like nested documents are missing, plus if you like the ability to tinker with servers you are out of luck.

Creating a Search Service https://portal.azure.com (or via Resource Management Powershell) Free v Basic v Standard Search Units Partitions Replicas Free v Basic v Standard Standard isn’t standard (multiple versions of standard depending on default partition size and the number of queries per second that are supported Search Units – parallel to SQL Azure Replica – copy of index (good for reads, recommend 3 replicates for high availablility) Partitions – index is sharded (good for concurrent writes) Each replica has an equivalent number of partitions. Each Replica is 1 search unit, Each Partition is 1 search unit so 3 replicas each with 2 partitions = 6 search units!

CREATE A NEW SEARCH SERVICE - DEMO Enough talk, lets create a service. Start in azure portal Search for “Search Services” Add new

Index Indexer Data source ANATOMY of Search Now that we have our service, these are the core elements of our search service. An Index is analogous to a SQL table (actually more like a schemad NoSQL store), each search service can have multiple indexes. The index is populated from a data source by means of an indexer. (a bit like a specialised SSIS package)

Index Fields INDEX STRUCTURE Scoring Profiles Suggesters Options An index is made up of a collection of fields (map to columns), the index can provide one or more suggesters. Suggesters are fields that can be used for Autocomplete operations. Scoring profiles are ways of manipulating search results, more on those later. The index will also have some options to control how it can be accessed.

INDEX STRUCTURE - FIELDS Name Type Is a Key field Is Searchable Is Filterable Is Sortable Is Facetable Is Retrievable Fields are essentially like SQL columns but with an additional set of meta data associated with them to control how they participate in search operations. The field types are pretty much what you expect, string, bools, ints, datetimes and more interestingly geo-spatial. Fields can opt to take part in searching, ordering and selection. Types – String, Collection of String, Bool, Int32, Int64, Double, DateTimeOffset, GeographyPoint Searchable – optional on string or collection of string, invalid for other types Filterable ODATA filter Sortable – not valid for collection of string Facetable – not valid for geography point

INDEX STRUCTURE – FIELDS AND SUGGESTERS { "name" : "cars", "fields" : [ {"name" : "carId", "type" : "Edm.String", "key" : true, "searchable":false}, {"name" : "make", "type" : "Edm.String"}, {"name" : "model", "type" : "Edm.String"}, {"name" : "trim", "type" : "Edm.String"}, {"name" : "price", "type" : "Edm.Double"}, {"name" : "year", "type" : "Edm.Int32"}, {"name" : "features", "type" : "Collection(Edm.String)"}, {"name" : "description", "type" : "Edm.String", "filterable":false}, {"name" : "position", "type" : "Edm.GeographyPoint"} ], "suggesters": [ "name" : "makeModelSuggester", "searchMode" : "analyzingInfixMatching", "sourceFields" : ["make", "model"] } ] The format for defining a simple index uses JSON, the index name is defined (in this case cars) and the collection of fields are declared, by default all string fields will be searchable, and most types will be facetable, sortable and filterable (the rules vary depending on the field type). Note the gotcha of needing to create the fields that are using the suggester when the suggester is created otherwise it will complain.

Search OPERATIONS – REST API Vary by the HTTP verb used. POST/PUT – Create/Update DELETE GET – Get or List HTTP Status codes Api keys The best way to perform any operations on the index is via the REST api. All operations follow the same pattern, we have an HTTP verb to signify the type of operation and a url + data (either via headers, body or query string). Operation status is signalled via a HTTP status code (e.g. 200/204 for success, 500 for error) For those that don’t like getting there hands dirty with HTTP packets there is a nuget package which wraps all the calls up in a nice object model. All operations require the use of an api key, these can be found via the azure portal. There are 2 types of keys, admin keys allow you to control the structure of the search service, query keys let you retrieve data. Also available as a .net nuget package (wrapper around the rest calls)

Creating the index HTTP POST Content-Type: application/json Body To create the index we send an HTTP PUT (or POST) to the search service, the body of the message will contain our index definition (as seen earlier) (Not sure this slide is needed) 2 types of keys (admin and query) This operation can be done via the portal. https://[your search index name].search.windows.net /indexes/[index name]?api-version=2015-02-28

Creating the index - DEMO Lets see that in action (DEMO site no longer available)

WORKING WITH DOCUMENTS – REST API https://[your search index name].search.windows.net /indexes/[index name]/docs/index?api-version=2015-02-28 { "value" : [ "@search.action" : "upload", "carId" : "1", "make" : "ford", "model" : "fiesta", "year": 2014, "price" : 7999.00, "position": { "type": "Point", "coordinates": [-122.131577, 49.678581] }, "features": ["AIRCON", "BLUETOOTH"] } ] So now we have an empty index we need to populate it with data, the REST api can be used to manage adding and removing documents. In this example you can see a single document that will be either inserted or will replace the current document (matched on the key field). These operations can contain up to 1000 document operations (or 16MB) which is fine for ad hoc document management but gets a little cumbersome with large indexes. Good for adhoc document inserts Up to 1000 documents or 16MB per batch HTTP POST @search.action – upload (insert or replace all fields), merge (update specified fields only, will fail if doc doesn’t exist), mergeOrUpload (merge if exists else insert), delete.

WORKING WITH DOCUMENTS – INDEXERS Data Source Indexer Index Indexers allow the synchronization of documents with an underlying data store. The data source manages the connection, currently supported are direct connections to documentDB, SQL Azure, Blob and Table storage. The last 2 are in preview and allow indexing of a wide variety of document formats including HTML and PDFs. For SQL Azure the data source describes which table or view it should synchronize against, which column denotes the record has changed and which column denotes a document should be deleted. The indexer links the data source to the index and controls the frequency by which it will check for changes. The indexer can also be run manually via a REST command. Data Source – DocDB, SQL Azure, Blob (PDF, HTML, XML, ZIP) or Table How to identify a new document How to identify a document that should be deleted. (For SQL either a DateTime and bit field or using integrated change tracking (Tables only)) Indexer – How often should we check this data.

INDEXERS - DEMO From SQL Azure show the table, the data source definition, the populated index. Update data in the db, rerun the indexer. If time show the Blob storage indexer Worth pointing out all of the following can be done via the azure portal (but it will try and create an index for you instead of reusing an existing one) For the blob storage one it uses the default field names for document properties

SEARCHING FOR DOCUMENTS – Free TEXT HTTP GET https://[your search index name].search.windows.net /indexes/[index name]/docs?search=[search text]&api-version=2015-02-28 HTTP POST https://[your search index name].search.windows.net /indexes/[index name]/docs/search? search=[search text]&api-version=2015-02-28 Return all fields in the documents Ordered according to search score Returns only the top 50 results Searches in all “searchable” fields Matches any of the search terms Uses the default query syntax Uses the default scoring profile (defined in the index definition) Now that we have a fully populated index we can start querying it, here we see the simplest form of query, the index will be searched for documents matching the search text using the default settings shown here. The index can be queryied using with a POST or a GET request, the POST request allows for larger querys (16MB vs 8KB). HTTP GET or POST REQUEST (Get limited to 8KB, Post to 16MB) Azure Portal Mention scoring profiles again. Query syntax by default allows + and * wildcard matching, “any” matching instead of “all”

INDEX StrucTURE – SCORING PROFILES Field 1: …………….……. Field 2: ………Ford..…. Field 3:…………………... Field 1: …….Ford……. Field 2: ………...………. Field 3:…………………... By default all fields carry the same “weight” Boosting Fields Ranges Document Age Geo Field 1: …………………. Field 2: ………...………. Field 3:……Ford……... When we search for a term each field is given equal weight, the accuracy of a match determines the search score for a document, in the example on screen each document would have an equal search score and would be returned to the caller using the default sort order. Scoring profiles allow us to boost the search score based on a few options. We can boost based on matches found in a particular field (e.g. rank results higher if the match is found in Field 2), we can boost on ranges (for example matches in documents with a 5 star review will be ranked higher than a 3 star review), by the age of document and by the distance of geo field from a known location. Multiple search fields (e.g. Make, Model, Description) – by default each field has equal weight when calculating the validity of the match Scoring profiles help you tweak this by assigning a higher score if the match is in the make than in the description. Custom profiles include boosting based on age (newest come higher) Favour results closer to a known location Specific fields (e.g. Make higher than description) A specific range (e.g. results in 5-star ratings count for more than result in 1-start)

SEARCHING FOR DOCUMENTS – Simple QUERY Description Example + AND ford+fiesta | OR Ford|fiesta - NOT ford-fiesta * suffix for* “ “ phrase “ford fiesta” The search text by default is parsed using the simple query syntax, this allows for basic boolean queries. For example if I want to search for documents containing ford and fiesta the plus symbol would be used to denote the second term. The simple query syntax is useful but doesn’t really offer anything above what could already be done in SQL server. For more power we can specify the queryType to “full” and use a bigger subset of the Lucene syntax.

SEARCHING FOR DOCUMENTS – LUCENE QUERY Regular Expression searching Field limitations Term boosting Proximity search The Lucene syntax gives us fine grained control over how we search. We can use regular expressions to match patterns, we can restrict specific search terms to specific fields, we can boost documents containing some search terms above others and control how close together search terms should be. In the example above we are searching for documents where the make field contains “ford” and any of the searchable fields match the regex for a mobile phone number. Documents with a matching phone number will be boosted higher than those without. Extension to simple query syntax RegExp class compatible regexs (note: uses the Lucene regex syntax not the .net one) Field Limitations – search within a single fie;d Term boosting – boost one term over another e.g. ford^2 bmw will score fords higher than bmw Proximity search – find me fiesta within 4 words of ford. Sample text shows make equals ford and results containing the text “one careful owner” will be scored higher. search=make:ford^3 OR /07[0-9]{3}\ *[0-9]{6}/&queryType=full

SEARCHING FOR DOCUMENTS – OPTIONS Query string Description Example $select Return only these fields $select=carId,make,model $skip Get the next “page” of results $skip=10 $top The number of results in a page $top=100 $count Return the total number of results $count=true searchFields Only search for the text in these fields searchFields=make,model highlight highlightPreTag highlightPostTag Return results pre-highlighted Highlight=make&highlightPreTag=<div>&highlightPostTag=</div> To control the search results (using either query syntax) we have a number of options available. These can restrict the number of fields returned and the allow for paging of documents. Be warned, parameter names change depending on whether you are using HTTP GET or HTTP POST

SEARCHING FOR DOCUMENTS – ODATA FILTERS Logical operators (and, or, not) Comparisons (eq, ne, gt, lt, gr, le) Collection filters Geospatial filters To refine the documents that can be search further we can use the ODATA filters. These allow us to restrict the results to only documents that match a certain criteria. The onscreen example shows searching for documents containing the text “one careful owner” in cars that cost less than £20000 and have the Bluetooth feature. These can also be used to restrict results based on geographical area (either a radius from a fixed point or within a polygon of points) For more SQL-like filtering Collection filters – collection contains any items, collection contains any items that match query Geospatial – only show items within a certain distance, only show items within a polygon search=“one careful owner”&$filter=price le 20000 AND features/any(t: t eq ‘BLUETOOTH')

SEARCHING FOR DOCUMENTS – FACETING Group by Field (e.g. Number of documents per Make) Group by interval (e.g. Hour, Day, Month) Group by values (e.g. Price Ranges £1001-£2000, £2001-£3000) One of the most powerful features of search indexes is faceting, this is the idea that when I return results, also return me the number of results if I restricted my search by other terms. In the example on screen you can see this in action, When I search for a new TV, I will get the number of results for each of the manufacturers along with my current result set. Azure search allows this data to be returned broken down by distinct items in a field (e.g. Manufacturer name), by and interval (e.g. per day) or by a range (prices in 1000 bands) Mention controlling sort (e.g. top 5 makes sorted by number of documents asc or desc)

Working WITH DOCUMENTS – DEMO Now that we have all the pieces in place, lets show you it in action. simple query, lucene query Control an option (e.g. top) Scoring Profile (will need to update index) Odata filter Facetting highlighting

If you want to KNOW MORE… https://azure.microsoft.com/en-gb/documentation/services/search/ https://github.com/Azure-Samples/search-dotnet-getting-started https://blogs.technet.microsoft.com/onsearch/ https://searchsamples.azurewebsites.net Twitter: @memleek So why should you use azure search If you need to do complex querying on large amounts of text but don’t want the headaches of managing servers, give it a look! Azure search integrates simply with other azure services and with a little work can bring in data from any source

Questions?

Please give us your feedback: sqlrelay.co.uk/feedback Thank you

END OF THE LINE