Singleton Processing with Limited Memory Peter L. Montgomery Microsoft Research Redmond, WA, USA.

Slides:



Advertisements
Similar presentations
Lecture 8 Join Algorithms. Intro Until now, we have used nested loops for joining data – This is slow, n^2 comparisons How can we do better? – Sorting.
Advertisements

CS4432: Database Systems II Hash Indexing 1. Hash-Based Indexes Adaptation of main memory hash tables Support equality searches No range searches 2.
1 Hash-Based Indexes Module 4, Lecture 3. 2 Introduction As for any index, 3 alternatives for data entries k* : – Data record with key value k – –Choice.
Hash-Based Indexes The slides for this text are organized into chapters. This lecture covers Chapter 10. Chapter 1: Introduction to Database Systems Chapter.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.
Searching Kruse and Ryba Ch and 9.6. Problem: Search We are given a list of records. Each record has an associated key. Give efficient algorithm.
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
CSC1016 Coursework Clarification Derek Mortimer March 2010.
1 Chapter 9 Maps and Dictionaries. 2 A basic problem We have to store some records and perform the following: add new record add new record delete record.
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
Unary Query Processing Operators CS 186, Spring 2006 Background for Homework 2.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Operating Systems File systems
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
1.A file is organized logically as a sequence of records. 2. These records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating.
Chapter 61 Chapter 6 Index Structures for Files. Chapter 62 Indexes Indexes are additional auxiliary access structures with typically provide either faster.
File Systems (1). Readings r Silbershatz et al: 10.1,10.2,
Lecture 21 Last lecture Today’s lecture Cache Memory Virtual memory
1Fall 2008, Chapter 11 Disk Hardware Arm can move in and out Read / write head can access a ring of data as the disk rotates Disk consists of one or more.
CMPE 421 Parallel Computer Architecture
CSCE Database Systems Chapter 15: Query Execution 1.
IT253: Computer Organization
Indexed and Relative File Processing
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
Memory Management during Run Generation in External Sorting – Larson & Graefe.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree- and Hash-Structured Indexes Selected Sections of Chapters 10 & 11.
Virtual Memory 1 1.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
CS4432: Database Systems II Query Processing- Part 2.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapters 13: 13.1—13.5.
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
CS4432: Database Systems II
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Indexing Goals: Store large files Support multiple search keys
File System Structure How do I organize a disk into a file system?
Database System Implementation CSE 507
Are they better or worse than a B+Tree?
CS222P: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Database Implementation Issues
File organization and Indexing
Introduction to Database Systems
Database Systems (資料庫系統)
Database Design and Programming
DATABASE IMPLEMENTATION ISSUES
CS222/CS122C: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Database Implementation Issues
Lecture 20: Indexes Monday, February 27, 2006.
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #05 Index Overview and ISAM Tree Index Instructor: Chen Li.
Database Implementation Issues
Virtual Memory 1 1.
Presentation transcript:

Singleton Processing with Limited Memory Peter L. Montgomery Microsoft Research Redmond, WA, USA

Relations, Ideals, Singletons Relation: Pair (a, b) with b > 0 and gcd(a, b) = 1. Relation is smooth if norm of a−b  is smooth in Q(  )/Q, for two extension fields Q(  ). Ideals are (usually) identified by p and by ratio a/b mod p, where prime p divides norm of a−b  for some extension Q(  ). Singleton: An ideal appearing only once in our data.

Filter inputs One or more files of smooth relations. May contain duplicates (esp. when using lattice sieving). Some norm divisors (perhaps primes > 1M) appear alongside (a, b) on input files. Only ideals for those primes will be processed.

Desired filter outputs A file retaining the useful relations. Remove duplicates. Recursively remove all relations with a singleton ideal. Saved relations may be in any order.

Special Requirements Input might have 100 M relations on 100M ideals (corresponding to large prime bounds 1000M). Run on PC with 1.5 Gbyte available memory. Can tolerate 1% false deletions and 5% false retentions. Desire to identify free relations, where there are several a/b ratios for one p.

Present large arrays – 1 Duplication check (for relations) –Hash table, via 32-bit functions h 1 and h 2. –h 1 tells where to start looking for h 2 within table. –4 bytes per relation to store h 2. –An 80% full table needs 4*(100 M)/0.8 = 500 Mbyte. Factor base (ideals) –Hash table with (p, a/b mod p, index) triples. –index is a 32-bit ordinal unique to this ideal. –12 bytes per entry (more for 64-bit p). –An 80% full table needs 12*(100 M)/0.8 = 1500 Mbyte.

Present large arrays – 2 Relations and their ideals –Has (line number, index 1, index 2,...) of retained relations. –Each index i is an ordinal from factor base table. –If six primes/relation, need 28*(100 M) = 2800 Mbyte. Ideal frequencies –Indexed by index from factor base table. –Tells how often each ideal appears in relations table. –Counts saturate at 255. Uses 100 Mbyte = 4900 Mbyte (330% of goal).

High-level program flow Allocate duplication, factor base, relations tables. Read inputs. Skip duplicate relations. Insert ideals into factor base table. Construct relations table with ideals and source line numbers. Sort factor base by p. Append free relations to relations table. Free duplication and factor base tables. Allocate frequency. Scan relations table to initialize frequencies. Repeatedly scan relations table. Delete all relations with a singleton ideal, while adjusting frequencies. Reread original inputs. Output file gets all non-free relations which survived in relations table. Free relations and frequencies tables.

Idea: Move relations table to disk While inputs are read, relations table (RT) is built sequentially. While RT is scanned sequentially for singletons, revised RT is written back at the start of the array. While inputs are reread, RT is read sequentially to identify what to retain. A sequential disk file meets these needs (use a new file when writing revised RT). Variation: Multiple, smaller-sized, files.

Revised in-memory sizes Duplication 500 Mbyte (while reading inputs). Factor base 1500 Mbyte (while reading inputs and checking for free relations). Frequencies 100 Mbyte (while repeatedly scanning RT). Still using 2000 Mbyte, 33% above 1500 Mbyte goal.

Replacing factor base table by functions While reading inputs, hash each ideal to a 64-bit value hid. Allow 64-bit p. On-disk RT will store hid, not index. Enlarge frequencies table to 500M entries. On each scan of RT, use unique mapping from hid to a subscript in [0, 500M − 1]. Frequencies and duplication are not needed at same time.

Good points Table sizes reduced to 500 Mbyte, one third of our goal. Primary cause of false deletions is two relations which hash to same h 2 and to nearby h 1, so they look like duplicates. Primary cause of false retentions is an ideal for which the hid  subscript maps always mate this with something else.

Potential troublespots Many cache (and TLB?) misses. Disk I/O will slow scanning, so perhaps do only 5-10 scans. Free relations won’t be found. Without injective mapping from ideal to subscript, seems hard to accurately count distinct ideals on input and output files (useful summary statistics).

Larger data sets with 1.5 Gbyte? Duplication table can store first 300 M distinct relations, until 80% full. Frequencies can saturate at 3. A 0.75 Gbyte array holds 3000 M two-bit entries, perhaps 1000M ideals with table 33% full. One such array checks for singletons with current hid  subscript function while another initializes for next function.