The aim of data mining is to make sense of large amounts of mostly unsupervised data, in some domain. The above statement defining the aims of data mining (DM) is intuitive and easy to understand. The users of data mining are often domain experts who not only own the data but also collect the data themselves. We assume that data owners have some understanding of the data and the processes that generated the data. Businesses arc the largest group of data mining users, since they routinely collect massive amounts of data and have a vested interest in making sense of the data. Their goal is to make their companies more competitive and profitable. Data owners desire not only to better understand their data but also lo gain new knowledge about the domain (present in their data) for the purpose of solving problems in novel, possibly better ways.
In the above definition, the first key term is “make sense”, which has different meanings depending on the user's experience. In order to make sense we envision that this new knowledge should exhibit a series of essential attributes: it should be understandable, valid, novel, and useful. Probably the most important requirement is that the discovered new knowledge needs to be understandable to data owners who want to use it to some advantage. The most convenient outcome by far would be knowledge or a model of the data that can be described in easy-to-understand terms, say via production rules such as:
IF abnormality (obstruction) in coronary arteries THEN coronary artery disease
In the example, the input data may be images of the heart and accompanying arteries. If the images are diagnosed by cardiologists is being normal or abnormal (with obstructed arteries), then such data are known as learning/training data. Some data mining techniques generate models of the data in terms of production rules, and cardiologists may then analyze these and either accept or reject them (in case the rules do not agree with their domain knowledge). Note, however, that cardiologists may not have used, or even known, some of the rules generated by data mining techniques, even if the rules are correct (as determined by cardiologists after deeper examination), or as shown by a data miner to be performing well on new unseen data, known as test data.