The first time I heard about this famous law, it really came as a surprise. I found it unbelievable that the first digit did not have a uniform probability distribution. It looks like a kind of ‘law that controls us’, or a ‘law that is above our humble arbitrariness’. However, although curious, this law has nothing supernatural, except its own simplicity.
For those who parachuted in this post, Benford’s Law says that the distribution of the first digit in records of sources and real data is not homogeneous. That is, there is no equal probability of varying between 0 and 9, nor between 1 and 9 if you have thought that the 0 on the left does not make sense. This law says that there is a distribution of approximately:
30% so that the first digit is 1;
17% so that the first digit is 2;
12% so that the first digit is 3;
10% so that the first digit is 4;
8% so that the first digit is 5;
7% so that the first digit is 6;
6% so that the first digit is 7;
5% so that the first digit is 8;
4% so that the first digit is 9.
There are several uses for this result, such as checking whether data has been tampered with in any institution. For it is likely that they behave in this way, that way if it does not, it is suspected that there has been adulteration.
But where does this supernatural thing come from? What is the reason for this strange pattern called Benford’s Law?
The explanation for this is simple, let’s see a news item that came out today (12/26/2020) in G1 about Dengue (yes, there are still other diseases in Brazil besides COVID).
In this news we have the number 47 thousand cases.
But before that news was expected, we would have 30,000 cases.
Before that, we must have the news 20 thousand cases;
Before that, we must have the news 10,000 cases;
But Distrito Federal is not the only region with cases of Dengue. We could have had other news (depending on the date) such as:
DF exceeds 10,000 cases in 2020
SP exceeds 10,000 cases in 2020
MG exceeds 10,000 cases in 2020
RJ exceeds 10,000 cases in 2020
However, not all regions are affected in the same way, so we could have that of the 27 Brazilian states, 25 of them exceed 20 thousand cases of Dengue. But of these 25 states, perhaps not all of them exceed 30 thousand cases, we can say for example that 21 states have passed 30 thousand cases. And so on, arriving that only 3 states surpass 50 thousand cases of Dengue.
This means that in the news related to Dengue cases, we would have 27 of them talking about the respective state having passed 10,000 Dengue cases. However, only 25 states with news talking about over 20 thousand cases. And so decreasing …
The same relationship applies for example to numbers of children. Most people before having their second child, have their first child. That is, if we look at the records of families with children, the number 1 will appear in the first digit very often, since before we have the second, third, fourth child, I usually fear the first.
Similarly, when a disease arises, before we have the second case of being infected, we will have news about the first case. And not only that, each city, each region, each country, will have its first case. Then we will have the first case in Campinas, the first case in Bauru, the first case in São Carlos … see how the 1 appears in the records very often if comparable to the others. For example, we had the first case of an extremely rare disease. It may be that in the coming years no one in that same region will have the same disease, making us still in the first case.
That is why 1 is so present in the first digit. For thinking a little about conditional probability, usually to have the 2, we need the 1 to have occurred before. Making it more frequent than 2, and this more frequent than 3, and so on.