DIFFUSE - Machine Learning
Introduction
Machine Learning (ML) usually
refers to systems performing tasks associated with
Artificial Intelligence (AI). Such tasks involve
recognition, diagnosis, planning, prediction etc. [Mitchell
1997] defines Machine Learning as follows: "A computer
program is said to learn from experiment E with respect to
some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with
experience E". [Witten and Frank 2000] state:
"things learn when they change their behavior in a way
that makes them perform better in the future".
Machine Learning can be viewed as general inductive process
that automatically builds a model by learning the inherent
structure of a dataset depending on the characteristics of
the data instances. Over the past decade, Machine Learning
has evolved from a field of laboratory demonstrations to a
field of significant commercial value. Machine Learning
techniques have been very successful in the areas of Data
Mining, speech and voice recognition, text recognition,
face recognition etc.
A good model should be descriptive (describe the training
data), predictive (generalise well for unseen test data)
and explanatory (provide a plausible description).
Terminology
The following terms are widely
used in the field of Machine Learning:
-
Instance: An instance is an example in the training
data. An instance is described by a number of
attributes. One attribute can be a class label.
-
Attribute/Feature: An attribute is an aspect of an
instance (e.g. temperature, humidity). Attributes are
often called features in Machine Learning. A special
attribute is the class label that defines the class
this instance belongs to (required for supervised
learning).
-
Classification: A classifier is able to
categorise/classify given instances (test data) with
respect to pre-determined/learned classification
rules.
-
Training/Learning: A classifier learns the
classification rules based upon a given set of
instances (training data). Please not that some
algorithms have no training phase but do all the work
when classifying an instance (so called lazy learning
algorithms).
-
Clustering: The process of dividing an unlabeled
data set (no class information given) into clusters
that contain similar instances.
Supervised vs. Unsupervised Learning
The
basic notion of supervised learning is that of the
classifier. A teacher helps to construct a classification
model by defining classes and providing positive and
negative examples of instances belonging to these classes.
The system then determines the commonalities and
differences between the various classes in order to
generate classification rules for unknown instances. The
resulting rules assign a class label to an instance based
on the values of the instance’s attributes.
In unsupervised learning there are no classes defined
a-priori. The algorithm itself divides the instances into
different classes and finds descriptions for these classes.
This process is often also referred to as clustering. The
resulting rules will be a summary of some properties of the
instances in the database: which classes are present and
what differentiates them. However, this will only be what
the algorithm has found. There may be many other ways of
dividing the instances into classes and of describing each
class.
Semi-supervised learning is a combination of supervised and
unsupervised learning. In this approach a user defined
amount of supervision is imposed on the algorithm. The
advantage is that only part of the data must have been
classified a-priori, requiring less teaching effort, but
the classification accuracy is usually higher than for
purely unsupervised approaches.
Feature Selection
Feature selection involves
selecting a subset of the available feature set to be used
in learning and classification. The effectiveness of a
system depends on the number and the types of features. It
is important to minimise the number features to increase
the performance of the learning and future classifications,
in terms of processing time and memory required. It is also
important to remove irrelevant features to increase the
classification accuracy, as some Machine Learning
algorithms cannot cope well with irrelevant features.
Example
An example that is often used to
illustrate Machine Learning is to determine whether an
unspecified sport can be played based on the current
weather condition. Consider the following data (taken from
Weka 3: Data
Mining Software in Java) that includes the outlook, the
temperature, the humidity, whether it is windy or not and
the information whether playing is possible (this is the
class attribute).
Table 1: Weather
data
sunny
|
85
|
85
|
false
|
no
|
sunny
|
80
|
90
|
true
|
no
|
overcast
|
83
|
86
|
false
|
yes
|
rainy
|
70
|
96
|
false
|
yes
|
rainy
|
68
|
80
|
false
|
yes
|
rainy
|
65
|
70
|
true
|
no
|
overcast
|
64
|
65
|
true
|
yes
|
sunny
|
72
|
95
|
false
|
no
|
sunny
|
69
|
70
|
false
|
yes
|
rainy
|
75
|
80
|
false
|
yes
|
sunny
|
75
|
70
|
true
|
yes
|
overcast
|
72
|
90
|
true
|
yes
|
overcast
|
81
|
75
|
false
|
yes
|
rainy
|
71
|
91
|
true
|
no
|
The task for the Machine Learning algorithm is to find the
rules inherent in this data set. Applying the C4.5 (named J48 in
Weka)
tree induction algorithm we get the following decision
tree. This decision tree is a perfect classifier for our
training data in the table. No data instance is
misclassified and therefore the accuracy of the classifier
on the training data is 100%.
Figure 1:
Decision tree for weather data
Each node of the tree resembles an attribute/feature. The
branches below each node divide the attribute value space
using the test associated with each branch. The leaves of
the tree contain the class and the number of instances.
Tree-based algorithms such as C4.5 already perform some
kind of feature selection. The order in which the
attributes appear in the tree (from top to bottom) depends
on their ability divide to the feature space most quickly
into sets of instances that can be accurately mapped to a
single class. C4.5 uses a metric called information gain to
determine the best sequence of attributes and the optimal
splits of the attribute value space. Attributes with low
information gain appear in the bottom of the tree or are
not part of the tree at all. For instance, the temperature
attribute is not present in the above tree because it does
not have a strong correlation with the class.
The decision tree can be easily translated into the
following set of classification rules:
if outlook=sunny and humidity<=75
then play=yes
if outlook=sunny and humidity>75 then play=no
if outlook=overcast then play=yes
if outlook=rainy and windy=true then play=no
if outlook=rainy and windy=false then play=yes
References
The following references provide
nice introductions into the topic of Machine Learning:
- Ian H. Witten, Eibe Frank,
"Data Mining: Practical Machine Learning Tools and
Techniques (Second Edition)", Morgan Kaufmann, June
2005.
- Tom M. Mitchell, “Machine
Learning”, McGraw-Hill Education (ISE Editions),
December 1997.
- Nils J. Nilsson, "Introduction
to Machine Learning", 1996