To a large extent, the success or failure of software development projects can be predicted by the accuracy of the effort estimation. One of the crucial factors that affect the estimation accuracy is the use of outliers, data points that appear to be inconsistent with the rest of the datasets. Elimination of outliers can improve data quality. The authors of this paper performed a systematic analysis using a general experimental procedure to evaluate the extent to which elimination of outliers led to higher accuracy of the effort estimation.
Five outlier elimination methods are used, including least trimmed squares and k-means clustering. Two of the most popular software estimation methods were used: least square regression and estimation by analogy. Empirical experiments were conducted with five industrial datasets. The accuracy was estimated with several criteria, including the mean magnitude of relative error and the median magnitude of relative error.
The experimental results are not consistent. For several of the datasets, elimination of the outliers did not increase accuracy, and in some cases, actually decreased it. Improvements were observed on the Stock_NDV dataset only. The authors plan to continue their investigation with more focus on the types of outliers.
The intended audience includes academics and practitioners working on software effort estimation methods.
Readers who are interested in this topic can find additional information on the subject in other papers [1,2].