Data science is the field that utilizes formulas, algorithms, and highly researched techniques to reveal information from raw or unstructured data. Though essentially a statistical or mathematical term, data science has been developed into an independent stream. This field has various career opportunities in the future like data scientists, data analysts, etc. Therefore, understanding the concepts of data science is essential.
Database meaning: A database collects unstructured or raw data and stores it for future uses. It comprises data and the system used to manage it. Organizations protect their database by using the latest security features and technologies.
Data sets are ranked to determine their order. In statistics, ranking means data transformation in numerical cells is replaced by their ranks when the data is sorted in ascending or descending order. When the data is sorted in ascending order, rank one is assigned to the smallest value. On the other hand, when the data is sorted in descending order, rank 1 is the most significant value.
Correlation Coefficients
To calculate the strength of the relationship between 2 variables, we use correlation coefficients. The two most popular types of correlation coefficients are Pearson’s Correlation Coefficient and Spearman’s Correlation Coefficient. Pearson’s Correlation Coefficient is probably the first correlation coefficient taught to students who have started learning statistics.
It is so often used that whenever someone refers to the correlation coefficient, it usually means talking about the Pearson’s Correlation Coefficient. It is generally denoted by R.
What is Pearson’s Correlation?
Pearsons’ Correlation is the most used method to calculate the relationship between data sets. It shows the linear relationship between variables. In simpler terms, it indicates whether or not you can draw a line on a graph to represent the variables.
The formula used to determine Pearson’s Correlation Coefficient is given below:
r=n(Σxy)−(Σx)(Σy)[nΣx2−(Σx)2][nΣy2−(Σy)2]
The other commonly used correlation coefficients are the Sample Correlation Coefficient and Population Correlation Coefficient. The formulas for these coefficients are given below:
Population Correlation Coefficient
The Population Correlation Coefficient pxy is given below:
pxy=xyxy
Sample Correlation Coefficient
The Sample Correlation Coefficient rxy is given below:
rxy=sxysxsy
Spearman’s Rank Correlation: Ranks are Given
When there is a monotonic relation between two variables,i.e., an increase in one would increase the other and vice versa, it becomes hard to identify the direction and strength of the monotonic relation. For example, while calculating Pearson’s Correlation Coefficient, we get the direction and strength of the linear association between the variables you are interested in. However, Spearman’s Rank Correlation indicates the limit to which two variables fluctuate together.
Before evaluating Spearman’s Rank Correlation, we need to rank the observations. It is essential because we have to identify whether increasing the value of one variable has a similar effect on the other,i.e., whether it rises or notAl; a comparison between 2 variables is mat each level ade before assigning ranks. For instance, if the variables are ranked from 1 to n (n is the number of variables), with the rank 1 being assigned to the highest value. The order of the set will be from highest to lowest case the same value repeats twice, the arithmetic mean or average of the ranks is assigned to them. For example, X represents the selling price values of a product.
X = (22, 34, 24, 30, 28, 24)
The descending order of the set, i.e., from highest to lowest, will be as follows:
34, 30, 28, 24, 24, 22 and their corresponding ranks will be as follows:
34 -> 1, 30 -> 2, 28 -> 3, and so on.
However, the fourth and fifth rank cannot be given the same values. Therefore, as per the Spearman’s Rank Correlation, (4+5)/2 rank will be given to each value. The remaining values will have the same rank as they had before. The final ranks are listed below:
34 -> 1, 30 -> 2. 28 -> 3, 24 -> 4.5, 24 -> 4.5, and 22 -> 6.
Spearman Rank Correlation Formula
Spearman Rank Correlation is represented by either p or r. The formula for Spearman Rank Correlation is given below:
rR=1-6∑di2n(n2-1)
Where,
di is the difference in the ranks of the ith observation of each random variable and n is the number of observations in the set. The value of Spearman Rank Coefficient can vary between -1 to +1.
If the value of p is -1, it means that there is a perfect negative association between the ranks.
If the value of p is 0, it means that there is no association between the ranks.
If the value of p is 1, it means that there is a perfect positive association between the ranks.