Computational Statistics with Application to Bioinformatics

Slides:



Advertisements
Similar presentations
Generating Random Numbers
Advertisements

Unit 0, Pre-Course Math Review Session 0.2 More About Numbers
Chapter 8 – Introduction to Number Theory. Prime Numbers prime numbers only have divisors of 1 and self –they cannot be written as a product of other.
Random number generation Algorithms and Transforms to Univariate Distributions.
Prof. Ramin Zabih (CS) Prof. Ashish Raj (Radiology) CS5540: Computational Techniques for Analyzing Clinical Data.
Random Number Generators. Why do we need random variables? random components in simulation → need for a method which generates numbers that are random.
Probability Densities
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
Statistics.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
Computer ArchitectureFall 2008 © August 25, CS 447 – Computer Architecture Lecture 3 Computer Arithmetic (1)
Chapter 8 – Introduction to Number Theory Prime Numbers
Chapter 8 – Introduction to Number Theory Prime Numbers  prime numbers only have divisors of 1 and self they cannot be written as a product of other numbers.
Random Number Generation Pseudo-random number Generating Discrete R.V. Generating Continuous R.V.
ETM 607 – Random Number and Random Variates
MATH 224 – Discrete Mathematics
Probability theory 2 Tron Anders Moger September 13th 2006.
Analysis of Algorithms
CPSC 531: RN Generation1 CPSC 531:Random-Number Generation Instructor: Anirban Mahanti Office: ICT Class Location:
Chapter 7 Random-Number Generation
Basic Concepts in Number Theory Background for Random Number Generation 1.For any pair of integers n and m, m  0, there exists a unique pair of integers.
Random Number Generators 1. Random number generation is a method of producing a sequence of numbers that lack any discernible pattern. Random Number Generators.
Monte Carlo Methods.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Validating a Random Number Generator Based on: A Test of Randomness Based on the Consecutive Distance Between Random Number Pairs By: Matthew J. Duggan,
More on Efficiency Issues. Greatest Common Divisor given two numbers n and m find their greatest common divisor possible approach –find the common primes.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
Modular (Remainder) Arithmetic n = qk + r (for some k; r < k) eg 37 = (2)(17) + 3 Divisibility notation: 17 | n mod k = r 37 mod 17 = 3.
4.5 Inverse of a Square Matrix
DATA & COMPUTER SECURITY (CSNB414) MODULE 3 MODERN SYMMETRIC ENCRYPTION.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
Chapter 6: Continuous Probability Distributions A visual comparison.
Kernel Regression Prof. Bennett Math Model of Learning and Discovery 1/28/05 Based on Chapter 2 of Shawe-Taylor and Cristianini.
Chapter 6: Continuous Probability Distributions A visual comparison.
1.  How does the computer generate observations from various distributions specified after input analysis?  There are two main components to the generation.
Biostatistics Class 3 Probability Distributions 2/15/2000.
Page : 1 bfolieq.drw Technical University of Braunschweig IDA: Institute of Computer and Network Engineering  W. Adi 2011 Lecture-5 Mathematical Background:
Kernel Regression Prof. Bennett
Linear Algebra Review.
Chapter 7. Classification and Prediction
CSE565: Computer Security Lecture 7 Number Theory Concepts
Hexadecimal, Signed Numbers, and Arbitrariness
Great Theoretical Ideas in Computer Science
Normal Distribution.
Analysis of Algorithms
Introduction to Number Theory
CS1371 Introduction to Computing for Engineers
Lecture 2. Deviates from Other Distributions
Random Number Generators
Chapter 10: Solving Linear Systems of Equations
CSCE 411 Design and Analysis of Algorithms
LECTURE 05: THRESHOLD DECODING
Motif p-values GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Parallel Quadratic Sieve
Chapter 7 Random Number Generation
Stream Ciphers Day 18.
CS 475/575 Slide set 4 M. Overstreet Old Dominion University
Chapter 7 Random-Number Generation
LECTURE 05: THRESHOLD DECODING
Validating a Random Number Generator
Factoring RSA Moduli: Current State of the Art J
LECTURE 05: THRESHOLD DECODING
#22 Uncertainty of Derived Parameters
Random Number Generation
Mathematical Background for Cryptography
What we learn with pleasure we never forget. Alfred Mercier
Biologically Inspired Computing: Operators for Evolutionary Algorithms
Generating Random Variates
Mathematical Background: Extension Finite Fields
Presentation transcript:

Computational Statistics with Application to Bioinformatics Prof. William H. Press Spring Term, 2008 The University of Texas at Austin Unit 5: More on Random Deviate Generation The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

Unit 5: More on Random Deviate Generation (Summary) Derive the theory behind Xorshift random generators matrix representation large matrix powers by successive squaring Look at two good empirical tests of randomness GCD test Gorilla test Learn several methods for generating random deviates from arbitrary distributions transformation method rejection method ratio-of-uniforms method (“teardrops”) especially when combined with squeezes, e.g. Leva’s Normal algorithm The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

Theory of Xorshift generators General design philosophy: Full period 2N or 2N-1 Bit mixing on every bit Then pass a bunch of empirical tests Represent a word as a vector of N bits. Then the elementary operations are matrix multiplies, mod 2. 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 S-2 = 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 S+4 = So a generator is M = Sk3 Sk2 Sk1 The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

The biggest period we can hope for is 2N-1, since the zero vector is its own cycle of length 1. Test for this “full period” by M 2 N ¡ 1 = M ( 2 N ¡ 1 ) = f 6 for every prime factor f of 2N-1 (Why just these?) These huge powers of M are actually easily done by successive squaring M 2 ; 4 8 : and then assembling from the binary representation of the desired power. (Show an example.) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

Now, finally, just a brute-force loop over all the possible values k1, k2, k3. Select full period ones. Do empirical tests. (Can use this same method to find full-period linear shift register taps = primitive polynomials mod 2.) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

Bit mixing for an actually good (N=64) Xorshift: [i,j] = ndgrid(1:n,1:n); S = @(k) eye(n,n) + (i-j == k) spy(S(17)*S(-31)*S(8)) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

k » N o r m a l ( ¹ ; ¾ ) ¼ : 8 4 n + 6 5 3 1 P ( j ) = 6 ¼ Examples of powerful empirical tests of randomness (Marsaglia and Tsang, 2002) The gcd test: Compute the gcd of successive random pairs Euclid’s algorithm for gcd: 366 = 1*297 + 69 297 = 4*69 + 21 69 = 3*21 + 6 21 = 3*6 + 3 6 = 2*3 + 0 k=5 j=3 empirical result (what you actually do is measure using a known-good generator) k » N o r m a l ( ¹ ; ¾ 2 ) ¼ : 8 4 n + 6 5 3 1 P ( j ) = 6 ¼ 2 theoretically calculable count in ~100 bins each for j and k chi-square test the results The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

P ( ) = e M » B i n o m a l ( 2 ; e ) ¹ = ¾ p The gorilla test: Example: N=4 on a 12-bit generator, testing position 10 The gorilla test: 010010100101 101001010001 011010101110 111010111010 101000010101 010101001010 010100100101 001001000010 101010101010 101010010101 010010001001 000001111100 101010100101 010100101010 010000100101 011110100101 From a random sequence of 2N words, form the sequence of 2N bits from some particular bit position. The number of times each unique N-bit word should appear somewhere in this sequence is ~Poisson(1) So and the number of words that don’t appear is these bits add 1 to bin 4 P ( ) = e ¡ 1 M » B i n o m a l ( 2 N ; e ¡ 1 ) ¹ = ¾ p this is actually not quite right. Computer scientists can you see why? (Think Boyer-Moore!) (a realistic N might be 20 to 30, 106 to 109 bins) Editorial: In the “old days” it was good enough to convert the N bit integer to a float value between 0 and 1, then test randomness of the resulting “real” values (in various ways). Nowadays, people depend on genuine bitwise randomness, so modern tests look much more number-theoretical, even when their basis is still empirical. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

Random deviates from other univariate distributions There are lots of nice algorithms specific to common distributions, but there are also some fairly general methods: You must be able to calculate the inverse of the cdf of your distribution! Transformation method: The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

Rejection method: You must have a bounding function for which you are able to calculate the inverse of the cdf! The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

Ratio of Uniforms Method (some of the best features of both xformation and rejection) a line of constant x=v/u The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

jac = jacobian([v/u u^2],[u v]) jac = [ -v/u^2, 1/u] [ 2*u, 0] syms u v; jac = jacobian([v/u u^2],[u v]) jac = [ -v/u^2, 1/u] [ 2*u, 0] abs(det(jac)) ans = 2 The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

Ratio of Uniforms is particularly powerful when combined with squeezes outer squeeze For this particular example the desired distribution is integer valued (binomial deviates), hence the staircases. For a continuous distribution, there would just be a smooth curve between the squeezes (which would typically be too close together to see clearly). inner squeeze you only compute p(v/u) when you are between the squeezes! The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

e.g., Leva’s algorithm for normal deviates: here, only ~1% of the (u,v) area is between the squeezes, requiring the calculation of the log The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press