Finding Islands, Gaps, and Clusters in Complex Data Diving Into Analytics With TSQL Edward Pollack Sr. Database Administrator Datto
Thank you to our SQL Saturday #892 Sponsors
Edward Pollack Sr. DBA, Datto Lives in Albany, NY with wife Theresa and sons Nolan (3.5yo) & Oliver (0.9yo), and Legos permanently affixed to the bottoms of his feet. Has spoken at over 100 events, including SQL Saturdays and PASS Summit. Regularly publishes articles for SQL Shack on fun data- related topics. Edward Pollack Sr. DBA, Datto /ed-pollack Published author of Dynamic SQL: Applications, Performance, and Security, which is now in 2nd edition. @EdwardPollack edrick42
Agenda Finding Significant Patterns in Complex Data Review: Structured/Inorganic Groupings Review: Gaps & Islands in Simple Data Data Clusters Answering Complex Questions Performance Conclusion
Structured/Inorganic Groupings The Pros Data can be partitioned into segments based on static rules. Can segment data by dates or date parts easily. Result set is in a predictable size and format. Predictable results. The Cons Does not provide mechanisms for learning or feedback. Boundaries can divide data into misleading groupings. Predictable results.
Structured Groupings Demo
Gaps/Islands Analysis Query that joins to previous/next rows of data to test for existence of those rows. Can locate and report on missing data. Great for analysis of outliers or exceptions. Can be used to pinpoint streaks, both positive or negative. Allows for many types of analytics against numeric data.
Gaps/Islands Analysis Demo
Data Clusters Created by using gaps/islands analysis over any type of data. Organizing sequential islands of data into meaningful groupings. Allows for related events to be easily identified. Introduces data proximity into analytics. Data groups itself into clusters naturally based on its contents. Must develop and experiment with grouping rules prior to analysis.
Data Clusters Demo
Answering Tough Questions Filters control what data to analyze. Existence checks control cluster parameters. Join predicates determine what to group together. Metrics include: Streaks, droughts, performance, unusual patterns, maxima, minima, etc… Dynamic SQL: Loop through dimensions to gather automated insights.
Answering Tough Questions Demo
Performance Analytics such as these rely on reading large volumes of data. Aka: Index/table scans. Not intended for OLTP databases/workloads. Run on data that is: Replicated, AlwaysOn, ETL, OLAP, data copy, etc… Helpful Tools: Covering Indexes. Columnstore Indexes. In-Memory OLTP. Automated Analytics. Incremental Data Loads. LEAD/LAG for some data aggregation challenges. Performance can be optimized to be linearly efficient to size of the data read.
Important Considerations Data Quality! How to manage: NULLs Missing Data Unexpected inputs/data values Duplicate data The borders of a data cluster within a multi-partitioned data set may require special treatment. QA: Thoroughly test all use cases!
Can This be Done With Other Tools? Probably! TSQL is a great tool as it can filter and manage data alongside analysis. If the data is already well-structured for reporting, then R/Python may be able to provide similar value and performance. Decide on tool based on: Performance Filtering/data manipulation required Complexity of analysis Expertise in tools What will happen with this data next?
Conclusion Data can be organically grouped, regardless of complexity. Results can be used to determine many useful metrics: Willing/losing Streaks. Data clusters. Related events. Patterns or abnormalities within a data set. Be creative and find innovative solutions to challenging problems!
Learn more from Ed Pollack @EdwardPollack epollack@datto.com