Lessons learned from data mining projects in industry are presented in the form of seven principles and a dozen tips. No questionnaire or interview surveys were administered. The lessons learned are simply drawn from the authors’ own extensive experiences and their discussions with other researchers and practitioners.
The classic view of the data mining process is shown to be overly simplistic. In applied data mining, the preferred approach is said to involve three phases. In the scout phase, several methods are used to quickly explore a range of hypotheses and present results to users, the people building products for a company. After establishing user goals, the project turns to a more careful experimental approach in the survey phase. Finally, the build phase involves the integration of learned models into products. These phases are said to take weeks, months, and years, respectively. Early and continuous feedback from users is deemed critical for success. One principle highlights the need to avoid bad learning. The stability of any learned trend must be assessed by repeating the analysis across multiple sub-samples of the data. Another principle emphasizes the utility of several inductive technologies. For example, discretization algorithms can automatically reveal irrelevant variables. Of the dozen tips, perhaps none is more important than the tip to use well-supported tools, such as R, Weka, and MATLAB.
Though several issues, such as data cleaning and validation, could have been usefully discussed in much greater detail, this paper is recommended to those undertaking data mining projects in industry or academia.