In mathematics we call correlation a type of statistical relationship that determines how much the variation of two factors is similar. This measure varies in the interval [-1, 1], in it, high values (close to 1) ‘may indicate’ that one of the observed factors influences the number of cases of the other factor. Likewise, inversely high values (close to -1) ‘may’ indicate ’that the occurrence of one of the factors influences the non-occurrence of the other. Finally, values close to 0 ‘may indicate’ that it is not possible to identify a relationship between the occurrences of the two factors.
In the previous paragraph the expression ‘may indicate’ is repeated a few times and appears highlighted, because due to the ease of calculating correlations between any two numerical variables (in Excel software you can do this with 5 mouse clicks), there is an erroneous tendency to assume that value as an indicative, rather than a possible indicative. To exemplify this, I made 1,000 variables in Excel, each with 10 observations recording values between -1,000 and 1,000 chosen at random and calculating their correlations. I found that the biggest one was between variables 25 and 602.
Although there is a high correlation between both variables, in this case it will mean nothing more than a statistical coincidence. A result of the unlikely fact that two variables with no relation to each other, are observed as apparently dependent. For although unlikely, when comparing the 1,000 variables one by one, we actually made 499,500 comparisons. An amount that proved to be sufficient for this coincidence to occur. But if, at the time of writing the speech, ‘ignore’ the total of compared variables, we could fall into the mistake of stating that “the strong correlation (0.98) between variable 25 and 602, indicates a dependency factor between them”. Let’s move on to a more contextualized example:
In the USA it was observed that the growth in the number of doctors in Sociology between 1999 and 2009 had a high correlation (0.81) with the number of annual deaths caused by anti-coagulants (source: https://tylervigen.com/view_correlation?id = 1279 accessed on 21-04-2021). Would they be replacing doctors with doctors in Sociology in the treatment of patients with anticoagulants? Is there a deviation in health resources in this treatment for the funding of research in Sociology? Are sociology theses hampering work with anticoagulants in the medical field?
|Sociology PhDs awarded|
(EUA – National Science Foundation)
|Deaths caused by anticoagulants (EUA – Center of Disease Control)||17||39||39||27||44||46||29||42||47||52||78|
In this case, there is also a strong correlation between both variables, but it means nothing more than a statistical coincidence due to the high amount of comparisons made, which were omitted at the time of this speech.
Thus, the value of a correlation, although it seems to be a strong argument, in itself does not mean anything. It may even seem that the correlation value between the variables is just a matter of opinion, so each one would have their own and that’s it … but no.
Among the factors that determine the validity of a correlation, is the outline of the hypothesis prior to the test. If we suspect that two events are dependent on each other, such as the number of cases of COVID-19 in the city and the purchase of gel alcohol per inhabitant, we can gather databases related to these events and then calculate how much the variation of two factors is similar. If we obtain a high correlation (close to 1), we can assume that the concern with the increase in cases of COVID-19 in a city, generates an increase in the consumption of gel alcohol. However, if we obtain an inversely high correlation (close to -1), we can assume that the reduction of alcohol consumption in gel in a city, generates an increase in the number of cases of COVID-19. Both assumptions derive from the initial hypothesis that these two variables have a dependency relationship.
That said, there are many directions and meanings of Statistics applied to scientific research in all areas. For example, we speak of values close to 1 or -1, but what is close? 0.9 is close? 0.8 is close? 0.7 is close? Is there a clear dividing line between what is highly correlated and what is not? If 0.8 is close, then 0.79 is also close?
These are questions whose answers cannot be defined by universal rules. So much so that it is common in undergraduate and graduate courses, to see disciplines of statistics specific to the demand of each area: “Statistics for CourseName”. Because among the many questions and intricacies in this area, it is desirable that the pairs are in accordance with the values and concepts accepted as sufficient. For example, is a survey of 10 participants too much or too little? The answer is it depends. It depends on how representative they are, how many variables we are considering for each subject, the intentions of the study, the generality we are looking for, among other factors that prevent us from considering a number as too much or too little. If we think of representing a country with more than 100 million inhabitants, 1,000 participants from the same region may not be representative enough, while 1,000 participants from 50 different regions may reflect the desired representation.
That said, there are many questions you need to ask (and understand why you should ask them) before stating something based on a statistical test. Therefore, when we have a sufficiently large database, before we calculate how each of the variables in that bank correlates with the others, it is recommended to define some hypotheses to be investigated. Otherwise, we will certainly find variables that are strongly correlated, but that do not represent any real relationship between them. Bringing in this process the risk of assuming meanings that coincide with our personal beliefs and transforming them into ‘data-based’ Fake News, as for example, that doctoral research in Sociology hindered treatments with anticoagulants. The fact that we have statistical data that corroborates this, is not enough for this hypothesis to be sustained.
Although it seems something “simple”, there is a universe within statistics and in mathematical research on tests and their results, each fitting for very particular data profiles and that, in their own way, allow to extract the best interpretations. If reducing to the simplism of saying that the correlation value leads to dependence is a dangerous mistake, as well as using other universal rules to infer these relationships, this can lead, for example, to misinterpreted results that are rooted in personal beliefs and remain reaffirmed as true even without more serious scientific grounds, in some cases, they continue to be reaffirmed even after experts show that they are in fact flaws.
Talking to statisticians, mathematicians or other professionals who work with these tests is not trivial, as their specifics in dealing with terms and concepts commonly accepted by peers, even making extra-pairs dialogue difficult. Even a simple statement about ‘having a correlation’ can lead to several more technical questions, such as what the probability distribution of the data is, whether the test was parametric or non-parametric, whether the data represents a population or a sample, what is its significance , what is the variance and standard deviation of the answers, what is the reliability of the collection instrument, among other many initial questions necessary to discuss this subject a little.