This article documents my quest to write a C program targeting Linux
that generates all prime numbers that fit in a 32-bit unsigned int
(uint32_t) as quickly as possible.
In particular, the program should write all 32-bit primes to a file
(which, in all my implementations, is called PRIMES) in
binary, so that every 4 bytes of the file stores one primes number, with
the bytes ordered in a little-endian fashion. In hex, the file should start
out: 02 00 00 00 03 00 00 00 05 00 00 00 07 00 00 00, and the
correct SHA-256 hash for the file turns out to be:
272eb05aa040ba1cf37d94717998cbbae53cd669093c9fa4eb8a584295156e15.
The algorithm used should able to correctly generate primes up to an arbitrarily large limit (within reasonable hardware constraints). It should not assume the primality of any number larger than 2 without first verifying it. It should not use a huge amount of memory—1GB should be plenty.
The simplest and most obvious way to check a number's primality is trial
division. Given a target integer
,
where
,
we check if
is divisible by each prime number less than or equal to its square root.
If and only if it is not divisible by any of them, we can conclude that
is prime. In C we could implement trial division like this:
/// Returns `true` iff `n` is a prime number.
///
/// `primes`: an array of prime numbers in order, starting from 2, skipping
/// none, and going at least up to the square root of `n`.
bool is_prime(uint32_t n, uint32_t* primes) {
for (size_t i=0; ; i++) {
uint32_t p = primes[i];
if (n % p == 0) {
return false;
}
// Comparison against 0xFFFF prevents overflow in `p*p`
if (p >= 0xFFFF || p*p >= n) return true;
}
}
In the worst case, carrying out this algorithm requires
Trial division can be used to generate all the prime numbers up to a
limit
as follows: we create a growing list of the primes in order, which is
initialized to
{2}. Then we go through all the integers from
to
,
checking the primality of each one and adding it to the end of the list if
it is found to be prime. In C, for
:
uint32_t* primes = (uint32_t*) malloc(203739216 * sizeof(uint32_t));
primes[0] = 2;
size_t p_idx = 1; // index of the next element to be added to `primes`
for (uint32_t n=3; n<=0xFFFFFFFF; n+=2) {
if ( is_prime(n, primes) ) {
primes[p_idx] = n;
p_idx++;
}
// Prevents overflow in `n+=2`
if (n == 0xFFFFFFFF) break;
}
[2]
This algorithm calls is_prime
times, so its time complexity in the worst case is within
.
The array primes can be written to a file as follows:
FILE* prime_file = fopen("PRIMES", "wb");
fwrite(primes, 4, p_idx, prime_file);
Now we have everything we need to write a full program meeting the specifications laid out in this article's introduction. The implementation found here runs in about 24m20s of user time on my system.[3]
Some numbers are obviously not prime, and we are wasting our time by
even checking. For instance, all even numbers greater than 2 (or, in other
words, all integers greater than 2 and congruent to 0 modulo 2) are clearly
not prime. The implementation of the sequential trial division algorithm
already takes advantage of this fact, incrementing the for
loop with n+=2, instead of n++, to skip all even
numbers (n is always greater than 2).
A well-known result is the fact that all primes greater than 3 are congruent to either 1 or 5 mod 6. This is easy to prove by cases: any number congruent to 0, 2, or 4 mod 6 is divisible by 2, and any such number greater than 2 is therefore composite; and any number congruent to 3 mod 6 is divisible by 3, and any such number greater than 3 is therefore composite. Less well known is the fact that all primes greater than 5 are congruent to 1, 7, 11, 13, 17, 19, 23, or 29 mod 30, but this can be proved by cases in a very similar fashion: 30 is 2×3×5, so it is easy to eliminate all the possible remainders mod 30 which imply divisibility by 2, 3, or 5.
Let's generalize. Suppose we know the first
primes, which we shall call
.
Let
be the primorial of
.
That is, the product of the first
primes, or
.
Now, take the sequence
,
and delete any numbers divisible by any of
.
All prime numbers greater than
will be congruent modulo
to one of the numbers still remaining in the sequence.
Here is a diagram of a 'wheel' of size
.
The numbers in each 'spoke' form a congruence class modulo 30. The
darkened spokes contain exactly those numbers which are divisible by 2,
3, or 5, or, in other words, are not coprime to 30. Clearly, no primes
greater than 5 will be found in the darkened spokes. The white spokes
contain all natural numbers coprime to 30, and may be called the
coprime spokes.
Note that the properties 'not divisible by any of the first
primes', 'coprime to
',
and
'found on a coprime spoke of the wheel of size
'
are all equivalent.
We can represent a wheel in C with the following struct:
struct Wheel {
size_t size;
/// Numbers with these remainders modulo `size` are coprime to `size`.
uint32_t* coprime_spokes;
/// The number of elements in `spokes`
size_t n_spokes;
};
And initialize it as follows:
/// Initialize a `struct Wheel` using the first `k` prime numbers.
///
/// `primes`: an array of size at least `k` whose first `k` elements are the
/// first `k` primes in order
struct Wheel wheel_init(uint32_t* primes, size_t k) {
struct Wheel wheel;
wheel.size = 1;
for (size_t i=0; i<k; i++)
wheel.size *= primes[i];
// All the remainders mod `wheel.size` which imply divisibility by one of
// the first `k` primes
uint32_t *coprime_spokes = malloc( wheel.size * sizeof(uint32_t) );
wheel.n_spokes = 0; // index of the next element to be added to
// `coprime_spokes`
for (size_t i=0; i<wheel.size; i++) {
for (size_t j=0; j<k; j++) {
if (i % primes[j] == 0) goto continue_loop;
}
coprime_spokes[wheel.n_spokes] = i;
wheel.n_spokes ++;
continue_loop: ;
}
wheel.coprime_spokes = malloc( wheel.n_spokes * sizeof(uint32_t) );
memcpy(wheel.coprime_spokes, coprime_spokes, wheel.n_spokes * sizeof(uint32_t));
free(coprime_spokes);
return wheel;
}
To prevent integer overflows, we need to know the largest
uint32_t that is coprime to the size of a
struct Wheel:
/// Returns the largest `uint32_t` coprime to `wheel.size`
uint32_t wheel_last(struct Wheel wheel) {
for (uint32_t l = 0xFFFFFFFF;; l--) {
for(size_t i=0; i<wheel.n_spokes; i++) {
if ( (l - wheel.coprime_spokes[i]) % wheel.size == 0 )
return l;
}
}
}
We can loop through the numbers on the coprime spokes of struct
Wheel wheel as follows:
uint32_t last = wheel_last(wheel);
uint32_t turn = 0;
while (true) {
for (size_t i=0; i<wheel.n_spokes; i++) {
// `n` is on a coprime spoke of `wheel`, and every number on a coprime
// spoke of `wheel` will appear as `n` in this loop.
uint32_t n = wheel.size * turn + wheel.coprime_spokes[i];
// ...
if (n == last)
goto exit_loop;
}
turn++;
}
exit_loop: // ...
Adding wheel factorization to the sequential trial division algorithm is
quite simple. To start, we initialize our growing list of primes to
{2}, and use ordinary trial division to extend it until it
contains the first
primes. We use these primes to create the wheel of size
.
For the larger primes, we use the same trial division algorithm before,
but only checking numbers which are found on a coprime spike of the
wheel. The time complexity of this algorithm is no different from that of
the ordinary sequential trial division algorithm, since
numbers are found on the coprime spokes.
The implementation here, which uses a wheel based on the first 5 primes, runs about 23m30s of user time, barely an improvement over not using the wheel. Though the wheel requires us to consider only about 20% of natural numbers as potential primes[5], the ~80% of numbers we avoid checking are the ones that are the fastest to check using trial division anyway, since they are all divisible by very small primes.
The previous methods have been based on individually checking the
primality of candidate numbers, one by one. Prime sieves, by contrast,
work by eliminating composite numbers from a list of natural numbers. The
sieve of Eratosthenes operates on the ordered list of integers from 1 to
,
where each number on the list can either be crossed out or not crossed
out. Crossing out a number multiple times is the same as doing it once.
The sieve is based on an operation similar to one used to create the
wheel: Given a number
,
we cross out all the multiples of
in the list, besides
itself. That is, we cross out
up to the largest
.
We shall call this operation
-sifting.[6]
We start with the ordered list of integers from 1 to
.
The number 1 is initially crossed out, but the rest of the list is
initially not crossed out. To sift out all composite numbers, we do the
following:
The numbers on the list that remain not crossed out after this algorithm
terminates, including, but not limited to, all the numbers that served as
the value of
at some point, are exactly the prime numbers less than or equal to
.
An
-sifting operation performs about
crossing-out operations, and the algorithm performs
-sifting for all primes
.
In total, therefore, the algorithm performs
crossing-out operations. Since
is
,
the time complexity of the sieve of Eratosthenes is
,
under the assumption that a crossing-out operation is
.
The list that the sieve operates on can be implemented with an array of
booleans. The crossed-outness of the number
is represented by the truth value of the
th
element of the array. We require an array of size
,
so if a boolean takes up 1 byte, the entire array will take up an absurd
4.3GB. The space needed can be reduced by a factor of 8, down to about
537MB, by using a bit array, which packs 8 booleans into each byte. The
implementation of the bit array is besides the point of this article, but
we will have the following at our disposal:
struct BitArray, which contains the bit array data
structure. struct BitArray bitarray_init(size_t size), which
creates a struct BitArray that stores size
bits, which are all initialized to 0. bool bitarray_get(struct BitArray ba, size_t i),
which returns true if and only if the ith bit stored
in ba is 1. void bitarray_set(struct BitArray* ba, size_t i, bool value),
which sets the ith bit in ba to
0 if value is false and
1 otherwise. bitarray_init runs in
bitarray_get and bitarray_set are
With these operations, we can implement the sieve of Eratosthenes as follows:
// We are assuming, pretty justifiably, that `size_t` is at least 32 bits.
// `list[i] == true` means the number `i` is crossed off (marked as not
// prime).
struct BitArray list = bitarray_init( (size_t) 0xFFFFFFFF );
// 0 and 1 are not prime
bitarray_set(&list, 0, true);
bitarray_set(&list, 1, true);
for (uint32_t p = 2;
// comparing against `0xFFFF` prevents overflow in `p*p`
p <= 0xFFFF && p*p <= 0xFFFFFFFF;
p++
) {
// skip over `p` if it has already been marked composite
if (bitarray_get(list, p))
continue;
// cross off all the multiples of `p` greater than `p`
for (size_t i=2; i <= 0xFFFFFFFF / p; i++) {
bitarray_set(&list, i*p, true);
}
}
[7]
This leaves us with a boolean list with all non-primes crossed out, but
to complete the program we have to collect all the prime numbers as
uint32_ts and put them in a big array. This can done by simply
looping through the entire list:
uint32_t* primes = malloc(203739216 * sizeof(uint32_t));
size_t p_idx = 0;
for (size_t n=2; n<=0xFFFFFFFF; n++) {
if ( !bitarray_get(list, n) ) {
primes[p_idx] = (uint32_t) n;
p_idx ++;
}
}
The array primes can be written to the file
PRIMES as before, completing the program. The sieve of
Eratosthenes is much faster than the previous two approaches. The
implementation here runs in
about 32s of user time, over 40 times faster than sequential trial division
with wheel factorization. During the time in which they both exist, the
537MB bit array and the ~815MB primes array take up a
collective ~1.3GB, which is not ideal, but not entirely unreasonable.
There is a long way to go from here. Kim Walisch's
primesieve
can generate all 32-bit primes in 0.061s (though this is without writing
them to a file), so there are still orders of magnitude to be shaved off.
All the C implementations referenced in this article, along with the
checksum file for PRIMES, can be found zipped up
here or on GitHub
here.
This table summarizes the performance of the three algorithms explored in this article:
| Algorithm | Time Complexity (worst case) | Runtime for
|
|---|---|---|
| Trial Division | ~24m20s | |
| Trial Division w/ Wheel Factorization | ~23m30s ( |
|
| Sieve of Eratosthenes | ~0m32s |
back
[1]: That is,
(Prime Number Theorem).
back
[2]: We allocate space for 203,739,216 prime numbers, though
,
is actually 203,280,221. This is because I consider using the actual exact
value to be cheating, as it requires having already computed many prime
numbers. The number 203,739,216 was obtained using the bound
,
proven by Dusart in "Estimates of Some Functions Over Primes Without
R.H.", 2010.
back [3]: All these tests were done on my Framework Laptop 13, which has the following specs:
back [4]: This representation of the wheel is based on the numbers 1 through 30, though the discussion and code in the rest of the article would imply a wheel based on 0 through 29. This doesn't really make a difference, as 0 and 30 are obviously congruent modulo 30. the 30, 60, 90 ... spoke could be replaced by 0, 30, 60, ... without changing anything else.
back
[5]: Here is a somewhat informal argument: Obviously, a wheel of size
has
coprime spokes out of
total spokes, where
is Euler's totient function. Therefore, the proportion of the natural
numbers found on a coprime spoke is
.
The wheel based on the first five primes has
,
and
,
so the proportion of the natural numbers that lie on the coprimes spokes
of this this wheel is
.
back [6]: This is not a standard term in the literature, but one I just made up.
back [7]: The list here starts at 0 instead of 1. This makes no difference—0 is simply crossed off immediately.