Data cleansing was not necessary when we first began our data science adventure and worked for our first data set, such as the iris set of data, but actual data sets are far ideal. The data collection has a number of flaws that need to be fixed before any model can be fitted to it. Incorrect handling of the data could result in biases and unreliable outcomes. The data analysis enters the scene at this point.
Exploratory data analysis involves a number of steps, including defining all the variables and associated data types, doing univariate and bivariate analyses, handling missing values, handling outliers, etc. The exploratory process of data analysis should never be skipped while creating any model. Outlier detection is one of the most crucial phases of exploratory data analysis. Extreme numbers known as outliers may not be consistent with the other data points. They might have found their way into the dataset as a result of other mistakes. There are several approaches to handle outliers, but the appropriate approach must be determined by the dataset.
The aforementioned claim may have provided a good hint as to what constitutes an anomaly. In the basic world, an outlier is indeed a value that sits outside the spectrum of all other values within the dataset. Outlier anomalies are those statistics that are significantly distinct from the remainder of the data, such as a rapid increase or reduction by many folds. For instance, a patient’s body temperature was recorded in a hospital as 988 degrees Celsius, which is obviously erroneous. There could be a comma missing, making the final number 98.8 rather than 988.
Another illustration involves weighing high school pupils and finding a record with a value of 1234, which is extremely implausible. It can be a data entering mistake. An outlier may not always be an incorrect entry; in some situations, it may be the outcome of an experiment. The data scientist will make this determination. The spectrum of outliers varies from case to case and is dependent on business issues. Before referring to a piece of data as an outlier, it is always preferable to consult the relevant business stakeholders. To ensure that the outliers don’t interfere with the model results, extra care must be taken with them.