DIFFUSE - Machine Learning

Introduction

Machine Learning (ML) usually refers to systems performing tasks associated with Artificial Intelligence (AI). Such tasks involve recognition, diagnosis, planning, prediction etc. [Mitchell 1997] defines Machine Learning as follows: "A computer program is said to learn from experiment E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E". [Witten and Frank 2000] state: "things learn when they change their behavior in a way that makes them perform better in the future".

Machine Learning can be viewed as general inductive process that automatically builds a model by learning the inherent structure of a dataset depending on the characteristics of the data instances. Over the past decade, Machine Learning has evolved from a field of laboratory demonstrations to a field of significant commercial value. Machine Learning techniques have been very successful in the areas of Data Mining, speech and voice recognition, text recognition, face recognition etc.

A good model should be descriptive (describe the training data), predictive (generalise well for unseen test data) and explanatory (provide a plausible description).

Terminology

The following terms are widely used in the field of Machine Learning:

Instance: An instance is an example in the training data. An instance is described by a number of attributes. One attribute can be a class label.
Attribute/Feature: An attribute is an aspect of an instance (e.g. temperature, humidity). Attributes are often called features in Machine Learning. A special attribute is the class label that defines the class this instance belongs to (required for supervised learning).
Classification: A classifier is able to categorise/classify given instances (test data) with respect to pre-determined/learned classification rules.
Training/Learning: A classifier learns the classification rules based upon a given set of instances (training data). Please not that some algorithms have no training phase but do all the work when classifying an instance (so called lazy learning algorithms).
Clustering: The process of dividing an unlabeled data set (no class information given) into clusters that contain similar instances.

Supervised vs. Unsupervised Learning

The basic notion of supervised learning is that of the classifier. A teacher helps to construct a classification model by defining classes and providing positive and negative examples of instances belonging to these classes. The system then determines the commonalities and differences between the various classes in order to generate classification rules for unknown instances. The resulting rules assign a class label to an instance based on the values of the instance’s attributes.

In unsupervised learning there are no classes defined a-priori. The algorithm itself divides the instances into different classes and finds descriptions for these classes. This process is often also referred to as clustering. The resulting rules will be a summary of some properties of the instances in the database: which classes are present and what differentiates them. However, this will only be what the algorithm has found. There may be many other ways of dividing the instances into classes and of describing each class.

Semi-supervised learning is a combination of supervised and unsupervised learning. In this approach a user defined amount of supervision is imposed on the algorithm. The advantage is that only part of the data must have been classified a-priori, requiring less teaching effort, but the classification accuracy is usually higher than for purely unsupervised approaches.

Feature Selection

Feature selection involves selecting a subset of the available feature set to be used in learning and classification. The effectiveness of a system depends on the number and the types of features. It is important to minimise the number features to increase the performance of the learning and future classifications, in terms of processing time and memory required. It is also important to remove irrelevant features to increase the classification accuracy, as some Machine Learning algorithms cannot cope well with irrelevant features.

Example

An example that is often used to illustrate Machine Learning is to determine whether an unspecified sport can be played based on the current weather condition. Consider the following data (taken from Weka 3: Data Mining Software in Java) that includes the outlook, the temperature, the humidity, whether it is windy or not and the information whether playing is possible (this is the class attribute).

Table 1: Weather data

Outlook	Temperature	Humidity	Windy	Play
sunny	85	85	false	no
sunny	80	90	true	no
overcast	83	86	false	yes
rainy	70	96	false	yes
rainy	68	80	false	yes
rainy	65	70	true	no
overcast	64	65	true	yes
sunny	72	95	false	no
sunny	69	70	false	yes
rainy	75	80	false	yes
sunny	75	70	true	yes
overcast	72	90	true	yes
overcast	81	75	false	yes
rainy	71	91	true	no

The task for the Machine Learning algorithm is to find the rules inherent in this data set. Applying the C4.5 (named J48 in Weka) tree induction algorithm we get the following decision tree. This decision tree is a perfect classifier for our training data in the table. No data instance is misclassified and therefore the accuracy of the classifier on the training data is 100%.

Figure 1: Decision tree for weather data

Each node of the tree resembles an attribute/feature. The branches below each node divide the attribute value space using the test associated with each branch. The leaves of the tree contain the class and the number of instances. Tree-based algorithms such as C4.5 already perform some kind of feature selection. The order in which the attributes appear in the tree (from top to bottom) depends on their ability divide to the feature space most quickly into sets of instances that can be accurately mapped to a single class. C4.5 uses a metric called information gain to determine the best sequence of attributes and the optimal splits of the attribute value space. Attributes with low information gain appear in the bottom of the tree or are not part of the tree at all. For instance, the temperature attribute is not present in the above tree because it does not have a strong correlation with the class.

The decision tree can be easily translated into the following set of classification rules:

if outlook=sunny and humidity<=75 then play=yes
if outlook=sunny and humidity>75 then play=no
if outlook=overcast then play=yes
if outlook=rainy and windy=true then play=no
if outlook=rainy and windy=false then play=yes

References

The following references provide nice introductions into the topic of Machine Learning:

Ian H. Witten, Eibe Frank, "Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)", Morgan Kaufmann, June 2005.
Tom M. Mitchell, “Machine Learning”, McGraw-Hill Education (ISE Editions), December 1997.
Nils J. Nilsson, "Introduction to Machine Learning", 1996

Centre closure