Singleton Processing with Limited Memory Peter L. Montgomery Microsoft Research Redmond, WA, USA
Relations, Ideals, Singletons Relation: Pair (a, b) with b > 0 and gcd(a, b) = 1. Relation is smooth if norm of a−b is smooth in Q( )/Q, for two extension fields Q( ). Ideals are (usually) identified by p and by ratio a/b mod p, where prime p divides norm of a−b for some extension Q( ). Singleton: An ideal appearing only once in our data.
Filter inputs One or more files of smooth relations. May contain duplicates (esp. when using lattice sieving). Some norm divisors (perhaps primes > 1M) appear alongside (a, b) on input files. Only ideals for those primes will be processed.
Desired filter outputs A file retaining the useful relations. Remove duplicates. Recursively remove all relations with a singleton ideal. Saved relations may be in any order.
Special Requirements Input might have 100 M relations on 100M ideals (corresponding to large prime bounds 1000M). Run on PC with 1.5 Gbyte available memory. Can tolerate 1% false deletions and 5% false retentions. Desire to identify free relations, where there are several a/b ratios for one p.
Present large arrays – 1 Duplication check (for relations) –Hash table, via 32-bit functions h 1 and h 2. –h 1 tells where to start looking for h 2 within table. –4 bytes per relation to store h 2. –An 80% full table needs 4*(100 M)/0.8 = 500 Mbyte. Factor base (ideals) –Hash table with (p, a/b mod p, index) triples. –index is a 32-bit ordinal unique to this ideal. –12 bytes per entry (more for 64-bit p). –An 80% full table needs 12*(100 M)/0.8 = 1500 Mbyte.
Present large arrays – 2 Relations and their ideals –Has (line number, index 1, index 2,...) of retained relations. –Each index i is an ordinal from factor base table. –If six primes/relation, need 28*(100 M) = 2800 Mbyte. Ideal frequencies –Indexed by index from factor base table. –Tells how often each ideal appears in relations table. –Counts saturate at 255. Uses 100 Mbyte = 4900 Mbyte (330% of goal).
High-level program flow Allocate duplication, factor base, relations tables. Read inputs. Skip duplicate relations. Insert ideals into factor base table. Construct relations table with ideals and source line numbers. Sort factor base by p. Append free relations to relations table. Free duplication and factor base tables. Allocate frequency. Scan relations table to initialize frequencies. Repeatedly scan relations table. Delete all relations with a singleton ideal, while adjusting frequencies. Reread original inputs. Output file gets all non-free relations which survived in relations table. Free relations and frequencies tables.
Idea: Move relations table to disk While inputs are read, relations table (RT) is built sequentially. While RT is scanned sequentially for singletons, revised RT is written back at the start of the array. While inputs are reread, RT is read sequentially to identify what to retain. A sequential disk file meets these needs (use a new file when writing revised RT). Variation: Multiple, smaller-sized, files.
Revised in-memory sizes Duplication 500 Mbyte (while reading inputs). Factor base 1500 Mbyte (while reading inputs and checking for free relations). Frequencies 100 Mbyte (while repeatedly scanning RT). Still using 2000 Mbyte, 33% above 1500 Mbyte goal.
Replacing factor base table by functions While reading inputs, hash each ideal to a 64-bit value hid. Allow 64-bit p. On-disk RT will store hid, not index. Enlarge frequencies table to 500M entries. On each scan of RT, use unique mapping from hid to a subscript in [0, 500M − 1]. Frequencies and duplication are not needed at same time.
Good points Table sizes reduced to 500 Mbyte, one third of our goal. Primary cause of false deletions is two relations which hash to same h 2 and to nearby h 1, so they look like duplicates. Primary cause of false retentions is an ideal for which the hid subscript maps always mate this with something else.
Potential troublespots Many cache (and TLB?) misses. Disk I/O will slow scanning, so perhaps do only 5-10 scans. Free relations won’t be found. Without injective mapping from ideal to subscript, seems hard to accurately count distinct ideals on input and output files (useful summary statistics).
Larger data sets with 1.5 Gbyte? Duplication table can store first 300 M distinct relations, until 80% full. Frequencies can saturate at 3. A 0.75 Gbyte array holds 3000 M two-bit entries, perhaps 1000M ideals with table 33% full. One such array checks for singletons with current hid subscript function while another initializes for next function.