Back to home

Generating All 32-Bit Primes (Part I)

This article documents my quest to write a C program targeting Linux that generates all prime numbers that fit in a 32-bit unsigned int (uint32_t) as quickly as possible.

In particular, the program should write all 32-bit primes to a file (which, in all my implementations, is called PRIMES) in binary, so that every 4 bytes of the file stores one primes number, with the bytes ordered in a little-endian fashion. In hex, the file should start out: 02 00 00 00 03 00 00 00 05 00 00 00 07 00 00 00, and the correct SHA-256 hash for the file turns out to be: 272eb05aa040ba1cf37d94717998cbbae53cd669093c9fa4eb8a584295156e15.

The algorithm used should able to correctly generate primes up to an arbitrarily large limit (within reasonable hardware constraints). It should not assume the primality of any number larger than 2 without first verifying it. It should not use a huge amount of memory—1GB should be plenty.

Trial Division

The simplest and most obvious way to check a number's primality is trial division. Given a target integer n, where n \ge 3, we check if n is divisible by each prime number less than or equal to its square root. If and only if it is not divisible by any of them, we can conclude that n is prime. In C we could implement trial division like this:


/// Returns `true` iff `n` is a prime number.
///
/// `primes`: an array of prime numbers in order, starting from 2, skipping
/// none, and going at least up to the square root of `n`.
bool is_prime(uint32_t n, uint32_t* primes) {
    for (size_t i=0; ; i++) {
        uint32_t p = primes[i];

        if (n % p == 0) {
            return false;
        }

        // Comparison against 0xFFFF prevents overflow in `p*p`
        if (p >= 0xFFFF || p*p >= n) return true;
    }
}
    

In the worst case, carrying out this algorithm requires \pi(\sqrt n) divisions, where \pi is the prime-counting function. (That is, \pi(x) is the number of primes less than or equal to n). Since \pi(x) is asymptotically equivalent to \frac{x}{\ln x} [1], the trial division algorithm runs in O(\sqrt {n} / \ln \sqrt n) = O(\sqrt{n} / \ln n) time.

Trial division can be used to generate all the prime numbers up to a limit N \ge 3 as follows: we create a growing list of the primes in order, which is initialized to {2}. Then we go through all the integers from 3 to N, checking the primality of each one and adding it to the end of the list if it is found to be prime. In C, for N=2^32-1:


uint32_t* primes = (uint32_t*) malloc(203739216 * sizeof(uint32_t));

primes[0] = 2;
size_t p_idx = 1; // index of the next element to be added to `primes`

for (uint32_t n=3; n<=0xFFFFFFFF; n+=2) {
    if ( is_prime(n, primes) ) {
        primes[p_idx] = n;
        p_idx++;
    }

    // Prevents overflow in `n+=2`
    if (n == 0xFFFFFFFF) break;
}
[2]
    

This algorithm calls is_prime O(N) times, so its time complexity in the worst case is within O(N \sqrt N / \ln N).

The array primes can be written to a file as follows:


FILE* prime_file = fopen("PRIMES", "wb");
fwrite(primes, 4, p_idx, prime_file);
    

Now we have everything we need to write a full program meeting the specifications laid out in this article's introduction. The implementation found here runs in about 24m20s of user time on my system.[3]

Wheel Factorization

Some numbers are obviously not prime, and we are wasting our time by even checking. For instance, all even numbers greater than 2 (or, in other words, all integers greater than 2 and congruent to 0 modulo 2) are clearly not prime. The implementation of the sequential trial division algorithm already takes advantage of this fact, incrementing the for loop with n+=2, instead of n++, to skip all even numbers (n is always greater than 2).

A well-known result is the fact that all primes greater than 3 are congruent to either 1 or 5 mod 6. This is easy to prove by cases: any number congruent to 0, 2, or 4 mod 6 is divisible by 2, and any such number greater than 2 is therefore composite; and any number congruent to 3 mod 6 is divisible by 3, and any such number greater than 3 is therefore composite. Less well known is the fact that all primes greater than 5 are congruent to 1, 7, 11, 13, 17, 19, 23, or 29 mod 30, but this can be proved by cases in a very similar fashion: 30 is 2×3×5, so it is easy to eliminate all the possible remainders mod 30 which imply divisibility by 2, 3, or 5.

Let's generalize. Suppose we know the first k primes, which we shall call p_1, p_2, p_3, \cdots, p_k. Let k\# be the primorial of k. That is, the product of the first k primes, or p_1 \times p_2 \times p_3 \times \cdots \times p_k. Now, take the sequence 0, 1, 2, 3, \cdots, (k\#-1), and delete any numbers divisible by any of p_1, p_2, p_3, \cdots, p_k. All prime numbers greater than p_k will be congruent modulo k\# to one of the numbers still remaining in the sequence.

Here is a diagram of a 'wheel' of size 3\# = 30. The numbers in each 'spoke' form a congruence class modulo 30. The darkened spokes contain exactly those numbers which are divisible by 2, 3, or 5, or, in other words, are not coprime to 30. Clearly, no primes greater than 5 will be found in the darkened spokes. The white spokes contain all natural numbers coprime to 30, and may be called the coprime spokes.

[4]

Note that the properties 'not divisible by any of the first k primes', 'coprime to k\#', and 'found on a coprime spoke of the wheel of size k\#' are all equivalent.

We can represent a wheel in C with the following struct:


struct Wheel {
    size_t size;

    /// Numbers with these remainders modulo `size` are coprime to `size`.
    uint32_t* coprime_spokes;
    /// The number of elements in `spokes`
    size_t n_spokes;
};
    

And initialize it as follows:


/// Initialize a `struct Wheel` using the first `k` prime numbers.
///
/// `primes`: an array of size at least `k` whose first `k` elements are the
///           first `k` primes in order
struct Wheel wheel_init(uint32_t* primes, size_t k) {
    struct Wheel wheel;

    wheel.size = 1;
    for (size_t i=0; i<k; i++)
        wheel.size *= primes[i];

    // All the remainders mod `wheel.size` which imply divisibility by one of
    // the first `k` primes
    uint32_t *coprime_spokes = malloc( wheel.size * sizeof(uint32_t) );
    wheel.n_spokes = 0; // index of the next element to be added to
                        // `coprime_spokes`

    for (size_t i=0; i<wheel.size; i++) {
        for (size_t j=0; j<k; j++) {
            if (i % primes[j] == 0) goto continue_loop;
        }

        coprime_spokes[wheel.n_spokes] = i;
        wheel.n_spokes ++;

continue_loop: ;
    }

    wheel.coprime_spokes = malloc( wheel.n_spokes * sizeof(uint32_t) );
    memcpy(wheel.coprime_spokes, coprime_spokes, wheel.n_spokes * sizeof(uint32_t));
    free(coprime_spokes);

    return wheel;
}
    

To prevent integer overflows, we need to know the largest uint32_t that is coprime to the size of a struct Wheel:


/// Returns the largest `uint32_t` coprime to `wheel.size`
uint32_t wheel_last(struct Wheel wheel) {
    for (uint32_t l = 0xFFFFFFFF;; l--) {
        for(size_t i=0; i<wheel.n_spokes; i++) {
            if ( (l - wheel.coprime_spokes[i]) % wheel.size == 0 )
                return l;
        }
    }
}
    

We can loop through the numbers on the coprime spokes of struct Wheel wheel as follows:


uint32_t last = wheel_last(wheel);
uint32_t turn = 0;
while (true) {
    for (size_t i=0; i<wheel.n_spokes; i++) {
        // `n` is on a coprime spoke of `wheel`, and every number on a coprime
        // spoke of `wheel` will appear as `n` in this loop.
        uint32_t n = wheel.size * turn + wheel.coprime_spokes[i];

        // ...

        if (n == last)
            goto exit_loop;
    }

    turn++;
}
exit_loop: // ...
    

Adding wheel factorization to the sequential trial division algorithm is quite simple. To start, we initialize our growing list of primes to {2}, and use ordinary trial division to extend it until it contains the first k primes. We use these primes to create the wheel of size k\#. For the larger primes, we use the same trial division algorithm before, but only checking numbers which are found on a coprime spike of the wheel. The time complexity of this algorithm is no different from that of the ordinary sequential trial division algorithm, since O(N) numbers are found on the coprime spokes.

The implementation here, which uses a wheel based on the first 5 primes, runs about 23m30s of user time, barely an improvement over not using the wheel. Though the wheel requires us to consider only about 20% of natural numbers as potential primes[5], the ~80% of numbers we avoid checking are the ones that are the fastest to check using trial division anyway, since they are all divisible by very small primes.

The Sieve of Eratosthenes

The previous methods have been based on individually checking the primality of candidate numbers, one by one. Prime sieves, by contrast, work by eliminating composite numbers from a list of natural numbers. The sieve of Eratosthenes operates on the ordered list of integers from 1 to N, where each number on the list can either be crossed out or not crossed out. Crossing out a number multiple times is the same as doing it once. The sieve is based on an operation similar to one used to create the wheel: Given a number n, we cross out all the multiples of n in the list, besides n itself. That is, we cross out 2n, 3n, 4n, \cdots up to the largest cn \le N. We shall call this operation n-sifting.[6]

We start with the ordered list of integers from 1 to N. The number 1 is initially crossed out, but the rest of the list is initially not crossed out. To sift out all composite numbers, we do the following:

  1. Initialize the variable p to the value 2.
  2. Perform p-sifting on the list to remove all composite numbers divisible by p.
  3. Change the value of p to the value of the next number in the list that is not crossed out.
  4. Repeat steps 2 and 3 until p \gt \sqrt N.

The numbers on the list that remain not crossed out after this algorithm terminates, including, but not limited to, all the numbers that served as the value of p at some point, are exactly the prime numbers less than or equal to N.

An n-sifting operation performs about N / n crossing-out operations, and the algorithm performs p-sifting for all primes p \le \sqrt N. In total, therefore, the algorithm performs \sum_{p \le \sqrt N} {N \over p} crossing-out operations. Since \sum_{p}^{M} {1 \over p} is O(\log \log M), the time complexity of the sieve of Eratosthenes is O(N \log \log \sqrt N) = O( N \log \log N), under the assumption that a crossing-out operation is O(1).

The list that the sieve operates on can be implemented with an array of booleans. The crossed-outness of the number i is represented by the truth value of the ith element of the array. We require an array of size N=2^32-1, so if a boolean takes up 1 byte, the entire array will take up an absurd 4.3GB. The space needed can be reduced by a factor of 8, down to about 537MB, by using a bit array, which packs 8 booleans into each byte. The implementation of the bit array is besides the point of this article, but we will have the following at our disposal:

bitarray_init runs in O(N), while bitarray_get and bitarray_set are O(1).

With these operations, we can implement the sieve of Eratosthenes as follows:


// We are assuming, pretty justifiably, that `size_t` is at least 32 bits.
// `list[i] == true` means the number `i` is crossed off (marked as not
// prime).
struct BitArray list = bitarray_init( (size_t) 0xFFFFFFFF );

// 0 and 1 are not prime
bitarray_set(&list, 0, true);
bitarray_set(&list, 1, true);

for (uint32_t p = 2;
     // comparing against `0xFFFF` prevents overflow in `p*p`
     p <= 0xFFFF && p*p <= 0xFFFFFFFF;
     p++
) {
    // skip over `p` if it has already been marked composite
    if (bitarray_get(list, p))
        continue;

    // cross off all the multiples of `p` greater than `p`
    for (size_t i=2; i <= 0xFFFFFFFF / p; i++) {
        bitarray_set(&list, i*p, true);
    }
}
[7]
    

This leaves us with a boolean list with all non-primes crossed out, but to complete the program we have to collect all the prime numbers as uint32_ts and put them in a big array. This can done by simply looping through the entire list:


uint32_t* primes = malloc(203739216 * sizeof(uint32_t));
size_t p_idx = 0;

for (size_t n=2; n<=0xFFFFFFFF; n++) {
    if ( !bitarray_get(list, n) ) {
        primes[p_idx] = (uint32_t) n;
        p_idx ++;
    }
}
    

The array primes can be written to the file PRIMES as before, completing the program. The sieve of Eratosthenes is much faster than the previous two approaches. The implementation here runs in about 32s of user time, over 40 times faster than sequential trial division with wheel factorization. During the time in which they both exist, the 537MB bit array and the ~815MB primes array take up a collective ~1.3GB, which is not ideal, but not entirely unreasonable.

Conclusion

There is a long way to go from here. Kim Walisch's primesieve can generate all 32-bit primes in 0.061s (though this is without writing them to a file), so there are still orders of magnitude to be shaved off. All the C implementations referenced in this article, along with the checksum file for PRIMES, can be found zipped up here or on GitHub here.

This table summarizes the performance of the three algorithms explored in this article:

Algorithm Time Complexity (worst case) Runtime for N=2^32-1:
Trial Division O(N \sqrt N / \ln N) ~24m20s
Trial Division w/ Wheel Factorization O(N \sqrt N / \ln N) ~23m30s (k = 5)
Sieve of Eratosthenes O(N\ln \ln N) ~0m32s

Notes

back [1]: That is, \lim_{x \to \infty} {\pi(x) \over {x/lnx}} = 1 (Prime Number Theorem).

back [2]: We allocate space for 203,739,216 prime numbers, though \pi(2^{32}-1), is actually 203,280,221. This is because I consider using the actual exact value to be cheating, as it requires having already computed many prime numbers. The number 203,739,216 was obtained using the bound \pi(x) < \frac{x}{\ln x - 1.1}, proven by Dusart in "Estimates of Some Functions Over Primes Without R.H.", 2010.

back [3]: All these tests were done on my Framework Laptop 13, which has the following specs:

back [4]: This representation of the wheel is based on the numbers 1 through 30, though the discussion and code in the rest of the article would imply a wheel based on 0 through 29. This doesn't really make a difference, as 0 and 30 are obviously congruent modulo 30. the 30, 60, 90 ... spoke could be replaced by 0, 30, 60, ... without changing anything else.

back [5]: Here is a somewhat informal argument: Obviously, a wheel of size W has \varphi(W) coprime spokes out of W total spokes, where \varphi is Euler's totient function. Therefore, the proportion of the natural numbers found on a coprime spoke is \varphi(W)/W. The wheel based on the first five primes has W = 2 \cdot 3 \cdot 5 \cdot 7 \cdot 11 = 2310, and \varphi(2310) = 480, so the proportion of the natural numbers that lie on the coprimes spokes of this this wheel is 2310/480 = 16/77 \approx 0.2078.

back [6]: This is not a standard term in the literature, but one I just made up.

back [7]: The list here starts at 0 instead of 1. This makes no difference—0 is simply crossed off immediately.