Chapter 23: Selecting Efficient Sorting Strategies 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.

Slides:



Advertisements
Similar presentations
Chapter 5: Creating and Managing Tables using PROC SQL 1 © Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Advertisements

Statistical Methods Lynne Stokes Department of Statistical Science Lecture 7: Introduction to SAS Programming Language.
Chapter 9: Introducing Macro Variables 1 © Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008.
Comp 335 File Structures Reclaiming and Reusing File Space Techniques for File Maintenance.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Performing Queries Using PROC SQL Chapter 1 1 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Chapter 18: Modifying SAS Data Sets and Tracking Changes 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Operating Systems CSE 411 CPU Management Sept Lecture 11 Instructor: Bhuvan Urgaonkar.
Chapter 3: Combining Tables Horizontally using PROC SQL 1 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Chapter 10:Processing Macro Variables at Execution Time 1 STAT 541 © Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward SAS ESSENTIALS -- Elliott & Woodward1.
DAY 15: ACCESS CHAPTER 2 Larry Reaves October 7,
DAY 14: ACCESS CHAPTER 1 Tazin Afrin October 03,
Microsoft Access 2003 Define some key Access terminology: Field – A single characteristic or attribute of a person, place, object, event, or idea. Record.
McGraw-Hill Technology Education © 2004 by the McGraw-Hill Companies, Inc. All rights reserved. Office Access 2003 Lab 3 Analyzing Data and Creating Reports.
©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina Chapter 17 supplement: Review of Formatting Data STAT 541.
Exploring Microsoft Office XP - Microsoft Word 2002 Chapter 71 Exploring Microsoft Word Chapter 7 The Expert User: Workgroups, Forms, Master Documents,
Physical DB Issues, Indexes, Query Optimisation Database Systems Lecture 13 Natasha Alechina.
Chapter 15: Combining Data Horizontally 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
SAS Efficiency Techniques and Methods By Kelley Weston Sr. Statistical Programmer Quintiles.
More about Databases. Data Entry through Forms Table View (Data sheet view) is useful for data entry of new records But sometimes customization would.
Access: Queries Ad-hoc Reporting Chapter T. Access Queries Queries Access Properties Sorting Selection Criteria Calculations.
Summer SAS Workshop Lecture 2. Summer Summer SAS Workshop Lecture 2 I’ve got Data…how do I get started? Libname Review How do you do arithmetic.
Chapter 16: Using Lookup Tables to Match Data 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Chapter 22: Using Best Practices 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
SAS Basics. Windows Program Editor Write/edit all your statements here. Log Watch this for any errors in program as it runs. Output Will automatically.
The Power of the BY Statement SVSUG Paul Choate, California Developmental Services (& Toby Dunn, U.S. Army Medical Department Center & School)
Chapter 17: Formatting Data 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Using Derrick Rapley Maryland CFUG January 8, 2002.
Chapter 19: Introduction to Efficient SAS Programming 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
SAS Basics. Windows Program Editor Write/edit all your statement here.
SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious.
Chapter 4: Combining Tables Vertically using PROC SQL 1 © Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Creating Indexes on Tables An index provides quick access to data in a table, based on the values in specified columns. A table can have more than one.
Chapter 17 Supplement: Alternatives to IF-THEN/ELSE Processing STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South.
Virtual Memory Pranav Shah CS147 - Sin Min Lee. Concept of Virtual Memory Purpose of Virtual Memory - to use hard disk as an extension of RAM. Personal.
Chapter 21: Controlling Data Storage Space 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Chapter 6: Modifying and Combining Data Sets  The SET statement is a powerful statement in the DATA step DATA newdatasetname; SET olddatasetname;.. run;
Chapter 14: Combining Data Vertically 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.
Chapter 5: Enhancing Your Output with ODS
Chapter 6: Modifying and Combining Data Sets
Decomposition Storage Model (DSM)
Chapter 2: Getting Data into SAS
Chapter 19: Introduction to Efficient SAS Programming
Hash-Based Indexes Chapter 11
Chapter 3: Working With Your Data
Chapter 13: Creating Samples and Indexes
Chapter 18: Modifying SAS Data Sets and Tracking Changes
Former Chapter 23: Selecting Efficient Sorting Strategies
Chapter 11: File System Implementation
Chapter 4: Sorting, Printing, Summarizing
Database Implementation Issues
Lecture 12 Lecture 12: Indexing.
Chapters 5 and 7 supplement
Chapter 24 (4th ed.): Creating Functions
More about Databases.
Hash-Based Indexes Chapter 10
Using Macros to Solve the Collation Problem
Solving Equations Containing Decimals
Overview of Query Evaluation
Chapter 24: Querying Data Efficiently
Chapter 11 Instructor: Xin Zhang
Database Implementation Issues
Nagendra Vemulapalli Access chapters 1&2 Nagendra Vemulapalli
STAT 515 Statistical Methods I Lecture 1 August 22, 2019 Originally prepared by Brian Habing Department of Statistics University of South Carolina.
Presentation transcript:

Chapter 23: Selecting Efficient Sorting Strategies 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina

2 Outline Avoiding Unnecessary Sorts Avoiding Unnecessary Sorts Using a Threaded Sort Using a Threaded Sort Calculating and Allocating Sort Resources (not covered) Calculating and Allocating Sort Resources (not covered) Handling Large Data Sets Handling Large Data Sets Removing Duplicate Observations Efficiently Removing Duplicate Observations Efficiently

3 Avoiding Unnecessary Sorts Sorts can be avoided in some situations Sorts can be avoided in some situations BY groups with an Index BY groups with an Index –If a data set includes an index, you can use a BY statement on the indexed variable without having used PROC SORT –The BY statement can be used in a DATA step or PROC step –Processing a data set with an index may be less efficient than PROC SORT –Does not index if DESCENDING or NOTSORTED are used or data is pre-sorted

4 Avoiding Unnecessary Sorts The NOTSORTED option groups the data on the BY variable, but doesn’t order groups The NOTSORTED option groups the data on the BY variable, but doesn’t order groups –Useful when sorting on nominal groupings –Results are interesting when data is not pre-grouped proc freq data=stat541.fall2008; by gender notsorted; table major; run;

5 Avoiding Unnecessary Sorts You can actually group on formatted values rather than the variable itself You can actually group on formatted values rather than the variable itself GROUPFORMAT option can only be used in the DATA step GROUPFORMAT option can only be used in the DATA step GROUPFORMAT allows you to create groups without creating a new variable GROUPFORMAT allows you to create groups without creating a new variable The CLASS statement is an under-used resource, especially in PROC MEANS and PROC UNIVARIATE The CLASS statement is an under-used resource, especially in PROC MEANS and PROC UNIVARIATE

6 Avoiding Unnecessary Sorts PROC CONTENTS can be used to see whether data is already sorted PROC CONTENTS can be used to see whether data is already sorted The SORTED BY option can then be used to include the sort information as a data set attribute The SORTED BY option can then be used to include the sort information as a data set attribute

7 Using a Threaded Sort Threaded sorts can distribute sorting across multiple CPUs Threaded sorts can distribute sorting across multiple CPUs PROC SORT dsname THREADS|NOTHREADS; You can modify or query the number of CPUs with CPUCOUNT You can modify or query the number of CPUs with CPUCOUNT

8 Handling Large Data Sets If a data set is too large to sort (insufficient space for the multiple copies of the data set needed for a sort), the data set can be split into smaller data sets then reassembled, typically with a SET statement/BY statement combination. If a data set is too large to sort (insufficient space for the multiple copies of the data set needed for a sort), the data set can be split into smaller data sets then reassembled, typically with a SET statement/BY statement combination.

9 Handling Large Data Sets Many methods are available Many methods are available  FIRSTOBS= OBS= in DATA step  IF/OUTPUT in DATA step  WHERE in PROC SORT step  WHERE in DATA step A DATA step is better than PROC APPEND for reassembling a large data set A DATA step is better than PROC APPEND for reassembling a large data set

10 Handling Large Data Sets The TAGSORT option saves only the BY variables and observation numbers in temporary files The TAGSORT option saves only the BY variables and observation numbers in temporary files This saves on the space set aside for a SORT This saves on the space set aside for a SORT

11 Removing Duplicate Observations Efficiently NODUPKEY NODUPKEY NODUPRECS NODUPRECS –Checks the entire record, not just the BY variables –Can be limited to post-DROP and post-KEEP variables FIRST. and LAST. FIRST. and LAST. –I’m surprised the book included this choice