Exploratory data analysis, coined by American mathematician John Tukey in the 1970s, is a process of analysing and studying data sets in order to summarise their essential properties. Through summary statistics and graphical representations, scientists often utilise data visualisation techniques to find trends, detect anomalies, validate assumptions, or test a hypothesis.
Types of Statistics in Maths
Descriptive statistics, which explains the qualities of sample and population data, and inferential statistics, which utilises those properties to test hypotheses and make conclusions, are the two primary disciplines of statistics.
- Understanding statistics: statistics are employed in practically every scientific subject, including business, the humanities, government, and industry. Statistics is a discipline of applied mathematics that arose from the application of mathematical techniques to probability theory, such as calculus and linear algebra.
- Descriptive statistics: descriptive statistics are primarily concerned with the sample data’s central tendency, variability, and distribution. The estimate of the features, a typical aspect of a sample or population, is referred to as the central tendency, and includes descriptive statistics such as mean, median, and mode. Variability is a collection of statistics that shows how much variation exists among the parts of a sample or population along the attributes being examined, and includes metrics like range, variance, and standard deviation.
Definition of Exploratory Data Analysis
Exploratory data analysis is a way of evaluating datasets to summarise their essential properties, generally using visual approaches, in data mining. Before beginning the modelling work, exploratory data analysis is used to examine what the data can tell us. It’s not simple to deduce essential data qualities from a column of numbers or an entire spreadsheet. Deriving insights from raw data may be tiresome, uninteresting, and/or overpowering. In this case, exploratory data analysis approaches have been developed as a help.
Before entering into machine learning or statistical modelling, exploratory data analysis is essential because it offers the context required to construct a suitable model for the issue at hand and appropriately evaluate the model’s output. To ensure that the findings they provide are legitimate, appropriately understood, and relevant to the intended business settings, data scientists use exploratory data analysis
Different Types of Exploratory Data Analysis
There are two sorts of exploratory data analysis techniques: graphical and quantitative (non-graphical). The quantitative approach, on the other hand, requires the compilation of summary statistics, whilst the graphical methods entail summarising the data in a diagrammatic or visual manner.
- Univariate non-graphical: among the four types of data analysis, this is the most basic. The data being analysed in this sort of study is made up of just one variable. This analysis’ major goal is to characterise the data and look for trends.
- Univariate graphical: the graphical approach, unlike the non-graphical method, offers a complete image of the data. Histograms, stem and leaf plots, and box plots are the three major techniques of analysis for this sort of data. The overall count of instances for a range of values is shown by the histogram. The stem and leaf plot illustrates the form of the distribution as well as the data values. The box plots show the lowest, first quartile median, third quartile, and maximum values visually.
- Multivariate non-graphical: the multivariate non-graphical kind of exploratory data analysis uses cross-tabulation or statistics to show the link between various variables of data.
- Multivariate graphical: this form of exploratory data analysis shows how two or more sets of data are related. A bar chart with each group representing a level of one variable and each bar within each group representing levels of other variables.
Features of Exploratory Data Analysis
Features of exploratory data analysis are in the following goals:
- Allow for unexpected data findings.
- Make theories concerning the causes of the things you’ve noticed.
- Assess the assumptions that will be used to make statistical inferences. Assist in the selection of suitable statistical tools and approaches.
- Provide a foundation for further data collecting through surveys or experiments.
- Apply k-means clustering to your data. The data points are given to clusters, also known as k-groups, in this unsupervised learning process. Market segmentation, picture compression, and pattern identification all employ K-means clustering.
- EDA is a kind of data that may be utilised in predictive models like linear regression to anticipate outcomes.
- It’s also used for summarising statistics, identifying associations between variables, and understanding how various fields in the data interact with each other in univariate, bivariate, and multivariate visualisation.
Conclusion
With the use of summary statistics and graphical representations, exploratory data analysis refers to the crucial process of doing first investigations on data in order to uncover patterns, detect anomalies, test hypotheses, and verify assumptions.