Friday, March 3, 2023

Pseudorandom numbers in Java, Part 1: The Background


Computing pioneer John von Neumann, who invented one of the first “arithmetical” computer algorithms for pseudorandom numbers, said “Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin.”

What is randomness? And how does it relate to Java’s random number generators? Let’s ask the question in context. What is the probability of your winning the top prize in the next national lottery? Lottery ticket buyers must believe that the game is fair, that the winning numbers are in fact chosen randomly, and that each ticket has the same probability of winning. They wouldn’t risk their money if they thought the contest was rigged.

But what exactly is randomness? This article is for software developers who may or may not be mathematics whizzes. To those who are, my apologies for what will, with high probability, be perceived as a lack of precision.

Just for fun, I asked ChatGPT for a definition, and it replied as follows:

Randomness is the quality or state of being unpredictable or lacking a definite pattern or order. In other words, randomness is the absence of predictability or regularity in events or processes. For example, the toss of a coin is a random event because it is impossible to know with certainty whether it will land on heads or tails. Randomness is a fundamental concept in many areas of science and mathematics, and it is often used to model complex systems and to generate random numbers for various purposes.

That’s pretty good. You should expect that lottery numbers will be random, revealing no pattern of any digits appearing more or less frequently than others over the long haul. Random numbers to be used for security purposes must be extremely hard for an aggressor to guess.

There are several measures for testing random number generators. One obvious measurement is speed, that is, how quickly random numbers can be generated. Sometimes an application needs a lot of random numbers, but for the common case of an application needing only one or a few random numbers, speed is irrelevant and outweighed by the main work of the application.

More important and still easy to measure is the randomness or consistency, that is, how reliably the values that are generated are spread throughout the range of possible values. In other words, the generator should generate with equal probability all the possible outcomes. An easy way to measure this is to run a given generator many times and visually examine the distribution. I did this in Java Cookbook. There, a program (Random5) creates a java.util.Random instance and calls its nextDouble() and nextGaussian() methods 100,000 times each. The program then invokes an R script to plot the values, producing the plots in Figure 1.

Oracle Java Exam, Java Prep, Java Preparation, Java Tutorial and Materials, Oracle Java, Oracle Java

Figure 1. The values of nextDouble() and nextGaussian()

You’d probably intuitively expect the nextDouble() histogram to be relatively flat (meaning all values were picked equally) and the nextGaussian() plot to approximate a bell curve. You can see that both predictions were fulfilled. You can download and run the code yourself to replicate this.

Randomness is not the only measure of goodness for random number generators. Another measure is predictability, that is, how hard it is for an outsider to figure out what value a given call will yield. Low predictability is a requirement for use in cryptography, as explained in John Graham-Cumming’s blog post on why cryptography requires the generation of random numbers.

When it comes to predictability, the period is a measure of how many generations can be requested before the values begin to repeat. If a random number generator repeats values too quickly, an attacker might be able to guess with reasonable accuracy the values used in the encryption.

Random number generators can also be jumpable, meaning that you should be able to skip forward without disrupting the sequence to a point in time that is equivalent to invoking and discarding a number of operations. Similarly, a leapable algorithm is one that can take a huge step without disrupting the sequence.

Randomness has many uses in general computing. For example, in UNIX-like systems, you can create reasonably secure temporary files using the following command:

someprogram > $ mktemp /tmp/scriptXXXXXXXXX

The mktemp utility program uses a random character generator (and characters are just numbers within the range of the ASCII or UTF-8 character set). The mktemp invocation above creates a file with a hard-to-guess name such as /tmp/scriptogWqMcWOygi, and the file is readable and writable only by the owner. Such files provide a reasonably secure way of creating files in a public directory. Even-better systems also randomize the process id numbers given to each program that is run. This makes it harder for an attacker to anticipate the process id number of an upcoming command—and that, in turn, makes it harder to attack the system.

True random numbers versus pseudorandom numbers

Most computers do not have truly random random numbers, just as they don’t have real real numbers; truly random numbers are computationally expensive (and thus time consuming), and generally they aren’t needed unless, for example, you are running a lottery.

Instead, many applications leverage faster, more efficient software that can generate pseudorandom numbers using a pseudorandom number generator (PRNG).

PRNGs use an algorithm that has a starting value, or seed, and a permutation. The permutation part of a commonly used linear congruential generator (LCG) algorithm is based on the following formula:

Xn+1 = (a * Xn + c) mod m

In that equation, X0 is the seed, and each value is based on the previous value multiplied by a constant a; that result is added to another constant called c; and the result of that taken modulo a third constant, m. What if Xn gets large or even overflows? No problem. Whatever’s left will still result in a nice pseudorandom value. And, in practice, many implementations using the LCG algorithm also use bit masking on subsequent calls to remove some of the bits that are not changing on subsequent calls.

The LCG algorithm is comparable to hashCode(), which generates a seemingly random number, with the current state of the given object being the seed. You’ve probably seen a typical IDE-generated hashCode method, such as the one shown in Listing 1.

Listing 1. Datum class with IDE-generated hashCode

public class Datum {
    long id;
    String name;
    int yearJoined;

    @Override
    public int hashCode() {
        int result = (int) (id ^ (id >>> 32));
        result = 31 * result + (name != null ? name.hashCode() : 0);
        result = 31 * result + yearJoined;
        return result;
    }
    ...
}

In mutable objects, changing the state changes the hashCode() value, permitting you to take the comparison a bit farther, as shown in Listing 2. If you run this program, you’ll see it generates random-looking numbers, but they are not very good ones.

Listing 2. Running hashCode() as a bad random number generator

import java.util.stream.IntStream;

public class DatumHash {
    public static void main(String[] args) {
        Datum d = new Datum(123, "Ian", 1999);
        // Loop by 175 just to get a good range of values
        for (int i = 1; i <= 2020; i+=175) {
            d.setId(2020 % i); // just generate some value here
            d.setYearJoined(i);
            System.out.println(d +" hashCode: "+d.hashCode());
        });
    }
}

It’s Knuth time


A definitive treatment of PRNGs can be found in Donald Knuth’s The Art of Computer Programming, volume 2, chapter 3—particularly section 3.2.1, which devotes 16 pages of mathematical discussion to LCGs. Meanwhile, section 3.2.2 dedicates 12 more pages to other PRNG algorithms.

This material will interest mathematics and computer science majors but may reach over the head of those who are not mathematically oriented. Even those who don’t care should note the following, however. Knuth points out that if you exercise really bad judgment in selecting the values for X0, a, m, and c in the formula shown previously, for example, by setting X0, a, and c all to 7 and setting m to 10, the formula will generate the following series as output:

7 6 9 0 7 6 9 0 7 6 9 0 ...

That series has a repeating period of 4, which is completely useless for most purposes. Let that be a caution to you: Do not implement your own LCG algorithm unless you really know what you’re doing! Instead, use existing generators provided in the JDK. (Note: Choosing better values for the various values takes up quite a bit of Knuth’s discussion, and it remains a problem for developers building PRNGs for production use.)

Another problem for generating useful random numbers is that the seed must come from somewhere. In the rare case where you are testing the random number function itself, you probably want the seed to be fixed (you can feed the seed into the constructor or use the setSeed() call), so that the algorithm will generate the same set of numbers each time you run your test.

However, in production use of a random number function, you want the seed to be…wait for it…random. And that would require the use of yet another random number function, which in turn would require its own random seed from yet another random number function. You can see where this will (or won’t) end.

What about true randomness? Only a process that doesn’t follow a fixed algorithm can generate truly random numbers.

Lava lamps for randomness?


Oracle Java Exam, Java Prep, Java Preparation, Java Tutorial and Materials, Oracle Java, Oracle Java
One of the first computer-based true random number generators used Lavarand, a lava lamp set in front of a web cam. A lava lamp contains two nonmiscible liquids—wax and a clear liquid—in a clear upright container. The heat from an incandescent bulb causes the wax to rise, where it cools and then falls, in an endless cycle that is slightly different every time.

The idea behind this is that the flow of the wax blobs would be reasonably random and couldn’t be guessed by people who didn’t have a camera focused on the lamp—and even then, they’d not know exactly which part of the lamp the camera was focused on. Since the patent on the original Lavarand has expired, the methodology can be used by anyone and, in fact, a whole wall of lamps is used by internet backbone carrier Cloudflare to secure traffic on the current internet—really.

Most developers probably don’t have lava lamps or another true random number generation system, but they need a random seed value to start. PRNGs tend to use a source of semirandomness such as the low-order bits of the high-precision system clock, which changes every microsecond or nanosecond, making the value almost impossible to guess.

Some operating systems, such as OpenBSD, calculate randomness (termed entropy) from the keyboard, the time, the arrangement of things in the environment, and so on. Such operating systems save some of this randomness to disk when the computer is shut down, so that even when the operating system first boots up, it already has a good source of entropy. Few other operating systems go that far, yet.

Source: oracle.com

Related Posts

0 comments:

Post a Comment