How to accurately determine if a random method is indeed random?
January 30, 2024 8:58 AM   Subscribe

I wanted to do a project where I measured how random the native RNGs are in various programming languages. How can I objectively measure my results? Also, if you run a random number generate say 100 vs 10,000,000 times should the distribution even out the more tests you run?
posted by ascetic to Computers & Internet (14 answers total) 2 users marked this as a favorite
 
There are a couple of libraries for this, including TestU01 and Dieharder, which incorporate a large battery of statistical tests.
posted by jedicus at 9:02 AM on January 30 [6 favorites]


+1 to the above - also PractRand is another tool with source and discussion.
posted by graphweaver at 9:08 AM on January 30 [2 favorites]


The term you want to search for is entropy analysis. There's a ton of tools and studies out there on the subject. Do you care only about what the output of a single run looks like or whether the generator will spit out the exact same sequence if given the same seed? There are some generators that have fairly good entropy but are cryptographically flawed because if you get a few numbers in sequence, you can predict what the future ones will be.

if you run a random number generate say 100 vs 10,000,000 times should the distribution even out the more tests you run?

Yes.

It's easy to do a thought experiment on this. If you get a single random number, you have little idea whether the distribution is even. If you get two, you get a better but still bad idea, four, eight, etc. The larger the data set, the better you can judge how random it is.

Or think about blackjack. Most casinos use multiple decks for their games, partially to make it harder to count cards, but partially to increase the randomness. The odds are stacked in their favor by the rules of the game, so by increasing the entropy, at scale, it increases their odds of making money by removing the outliers on either direction.
posted by Candleman at 9:33 AM on January 30 [3 favorites]


But to answer the question in the title of your post, the simplest random functions built into pretty much every language, such as rand() in PHP, are definitively not random at all. They're fine if you want to choose one of seven different colors for the background or something like that but are not what would be considered mathematically random. One of the first things I did while learning reverse engineering was taking a Backgammon game that used a terrible built in RNG and replaced it with a true PRNG library. Otherwise the sequences were so predictable the game wasn't fun.
posted by Candleman at 9:39 AM on January 30 [3 favorites]


This particular field of pseudo-random number generators is surprisingly important and complicated, seeing as the entire multi-billion dollar edifice of online shopping, and corporate networks all rely on the "random" numbers being sufficiently random to be treated as such for practical purposes. For some helpful pictures, see this section about the central limit theorem, which you can see what the sum of the "realizations" (dice rolls) will look like if they are random in the sense of a normal distribution, which they almost certainly will be. That's a necessary but not sufficient condition, though. You could, pretty easily, generate a function which produces numbers that are random in one sense but quite predictable (repetitive) in another.

Of course, you won't get exactly the average result, as you note in your question. At the limit, yes (see the law of large numbers), but not for any finite number of realizations. The bad news is the convergence of the sampling is slooooowwwwww. The error is (again on average) proportional to the square root of the number of samples. So let's say you find a given generator is hitting the central limit to 1% for a certain number of realizations. You would need to sample 4x as many to assess the randomness to .5%, 100x to get to even .1%. Would take a lot of computer time to detect even small amounts of non-randomness.
posted by wnissen at 9:40 AM on January 30 [5 favorites]


The standard PRNGs in most languages are well-known to be insufficiently random for cases where it really matters (cryptography, physics simulations).
posted by BungaDunga at 10:24 AM on January 30 [4 favorites]


the big thing is that an RNG can pass one statistical test (say, you measure it as producing a uniform distribution of numbers) and totally fail on another (it generates the uniform distribution by counting 1, 2, 3, 4...). So you need to run a whole battery of tests to try to hunt down any correlations between outputs of the algorithm. That's what Dieharder etc do.
posted by BungaDunga at 10:30 AM on January 30 [3 favorites]


Most casinos use multiple decks for their games, partially to make it harder to count cards, but partially to increase the randomness. The odds are stacked in their favor by the rules of the game, so by increasing the entropy, at scale, it increases their odds of making money by removing the outliers on either direction.

This is wrong, but (surprisingly to me) the conclusion is correct. The casinos don't care about the odds on any one particular hand, because they are dealing with thousands and thousands of hands. So if they were simply tightening (or loosening) the distribution of wins around the same central tendency, they wouldn't care.

However, the house edge (assuming no card counting, and the same rules) increases with additional decks, from 0.16% with one deck to 0.46% with 2 decks, 0.60% with 4, and 0.66% with 8. This is because a blackjack hand is based on a combination of cards, and once you've been dealt a card, the odds of getting one with the same value drops - by more when you have one deck than if you have multiple decks. For example, "If you start your hand with an Ace, 16 of the other 51 cards, or 31.4 percent, are 10 values that will give you a blackjack. If there are six decks in play, 64 of the other 311 cards, or only 30.9 percent, are 10 values."
posted by Mr.Know-it-some at 10:41 AM on January 30 [6 favorites]


As you may have gathered, this is a big subject, with prior art and existing solutions. If you want a high level introduction to the concepts involved, you might find John D. Cook's short chapter on "Testing a Random Number Generator" a useful start.
posted by caek at 10:50 AM on January 30 [5 favorites]


> 64 of the other 311 cards

How did he come up with 64? And how does he get 30.9% from 64/311???
posted by mpark at 4:55 PM on January 30


How did he come up with 64? And how does he get 30.9% from 64/311???
I don't know the answer to your first question, but your second question points to the fix. It should be 96/311=30.9%
posted by agentofselection at 5:54 PM on January 30 [2 favorites]


A statistics 101 answer is autocorrelation.
posted by SemiSalt at 5:07 AM on January 31


This is because a blackjack hand is based on a combination of cards, and once you've been dealt a card, the odds of getting one with the same value drops - by more when you have one deck than if you have multiple decks. For example, "If you start your hand with an Ace, 16 of the other 51 cards, or 31.4 percent, are 10 values that will give you a blackjack. If there are six decks in play, 64 of the other 311 cards, or only 30.9 percent, are 10 values."

If only there were a term for variance of randomness.
posted by Candleman at 1:56 AM on February 1


I've been thinking about this, it's a fun problem. Lets say we wanted to build up some visual intuition about how various generators perform. Let's assume for simplicity that we want to generate random integers between 0 and 255. Here's some code that samples the standard random number generator from NumPy, a Python library of many mathematical functions. And let's say we also write the world's worst random number generator. Run the code and look at how in one sense our bad number generator is more smoothly distributed. But also very much not random. What happens if you change the modulo operation to a prime number? A large (>255) prime number? What if you square the seed before running the modulo? Lots of fun to be had here if you are painfully nerdy. Requires a graphics-capable Python, try Anaconda.
#!/bin/env python3

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import numpy as np

# Change the number of samples and the range
N_points = 10000
n_bins = 2**8

# Create a pseudo-random number generator with a fixed seed, for reproducibility
rng = np.random.default_rng(19680801)

# Use NumPy's pretty-good (though not for cryptography) generator
# to make a uniform distribution
dist1 = rng.integers(low=0, high=n_bins, size=N_points)

# Make a terrible version
seed = 0
def my_distribution():
    global seed
    random_number = seed
    seed = (seed + 1) % n_bins
    return random_number

# Let 'er rip!
dist2 = np.array([my_distribution() for ii in range(N_points)])

# Plot side by side
fig, axs = plt.subplots(1, 2, sharey=True, tight_layout=True, figsize=(20, 10))

axs[0].hist(dist1, bins=n_bins)
axs[0].xaxis.set_major_locator(ticker.MultipleLocator(10))
axs[0].xaxis.set_minor_locator(ticker.MultipleLocator(1))
axs[1].hist(dist2, bins=n_bins)
axs[1].xaxis.set_major_locator(ticker.MultipleLocator(10))
axs[1].xaxis.set_minor_locator(ticker.MultipleLocator(1))
plt.show()

posted by wnissen at 12:46 PM on February 1


« Older What’s the best way to collect art and other...   |   Nice mobile computer desks? Newer »

You are not logged in, either login or create an account to post comments