DSTC - Machine Learning
Introduction
Machine Learning (ML) usually refers to systems performing tasks
associated with Artificial Intelligence (AI). Such tasks involve
recognition, diagnosis, planning, prediction etc. [Mitchell 1997]
defines Machine Learning as follows: "A computer program is said to
learn from experiment E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by
P, improves with experience E". [Witten and Frank 2000]
state: "things learn when they change their behavior in a way that
makes them perform better in the future".
Machine Learning can be viewed as general inductive process that
automatically builds a model by learning the inherent structure of a
dataset depending on the characteristics of the data instances. Over
the past decade, Machine Learning has evolved from a field of
laboratory demonstrations to a field of significant commercial value.
Machine Learning techniques have been very successful in the areas of
Data Mining, speech and voice recognition, text recognition, face
recognition etc.
A good model should be descriptive (describe the training data),
predictive (generalise well for unseen test data) and explanatory
(provide a plausible description).
Terminology
The following terms are widely used in the field of Machine Learning:
-
Instance: An instance is an example in the
training data. An instance is described by a number of attributes. One
attribute can be a class label.
-
Attribute/Feature: An attribute is an aspect
of an instance (e.g. temperature, humidity). Attributes are often
called features in Machine Learning. A special attribute is the class
label that defines the class this instance belongs to (required for
supervised learning).
-
Classification: A classifier is able to
categorise/classify given instances (test data) with respect to
pre-determined/learned classification rules.
-
Training/Learning: A classifier learns the
classification rules based upon a given set of instances (training
data). Please not that some algorithms have no training phase but do
all the work when classifying an instance (so called lazy learning
algorithms).
-
Clustering: The process of dividing an
unlabeled data set (no class information given) into clusters that
contain similar instances.
Supervised vs. Unsupervised Learning
The basic notion of supervised learning is that of the classifier. A
teacher helps to construct a classification model by defining classes
and providing positive and negative examples of instances belonging to
these classes. The system then determines the commonalities and
differences between the various classes in order to generate
classification rules for unknown instances. The resulting rules assign
a class label to an instance based on the values of the
instance’s attributes.
In unsupervised learning there are no classes defined a-priori. The
algorithm itself divides the instances into different classes and finds
descriptions for these classes. This process is often also referred to
as clustering. The resulting rules will be a summary of some properties
of the instances in the database: which classes are present and what
differentiates them. However, this will only be what the algorithm has
found. There may be many other ways of dividing the instances into
classes and of describing each class.
Semi-supervised learning is a combination of supervised and
unsupervised learning. In this approach a user defined amount of
supervision is imposed on the algorithm. The advantage is that only
part of the data must have been classified a-priori, requiring less
teaching effort, but the classification accuracy is usually higher than
for purely unsupervised approaches.
Feature Selection
Feature selection involves selecting a subset of the available feature
set to be used in learning and classification. The effectiveness of a
system depends on the number and the types of features. It is important
to minimise the number features to increase the performance of the
learning and future classifications, in terms of processing time and
memory required. It is also important to remove irrelevant features to
increase the classification accuracy, as some Machine Learning
algorithms cannot cope well with irrelevant features.
Example
An example that is often used to illustrate
Machine Learning is to determine whether an unspecified sport can be
played based on the current weather condition. Consider the following
data (taken from Weka
3: Data Mining Software in Java) that includes the outlook,
the temperature, the humidity, whether it is windy or not and the
information whether playing is possible (this is the class attribute).
|
sunny
|
85
|
85
|
false
|
no
|
|
sunny
|
80
|
90
|
true
|
no
|
|
overcast
|
83
|
86
|
false
|
yes
|
|
rainy
|
70
|
96
|
false
|
yes
|
|
rainy
|
68
|
80
|
false
|
yes
|
|
rainy
|
65
|
70
|
true
|
no
|
|
overcast
|
64
|
65
|
true
|
yes
|
|
sunny
|
72
|
95
|
false
|
no
|
|
sunny
|
69
|
70
|
false
|
yes
|
|
rainy
|
75
|
80
|
false
|
yes
|
|
sunny
|
75
|
70
|
true
|
yes
|
|
overcast
|
72
|
90
|
true
|
yes
|
|
overcast
|
81
|
75
|
false
|
yes
|
|
rainy
|
71
|
91
|
true
|
no
|
The task for the Machine Learning algorithm is to find the rules
inherent in this data set. Applying the C4.5
(named J48 in Weka)
tree induction algorithm we get the following decision tree. This
decision tree is a perfect classifier for our training data in the
table. No data instance is misclassified and therefore the accuracy of
the classifier on the training data is 100%.

Each node of the tree resembles an attribute/feature. The branches
below each node divide the attribute value space using the test
associated with each branch. The leaves of the tree contain the class
and the number of instances. Tree-based algorithms such as C4.5 already
perform some kind of feature selection. The order in which the
attributes appear in the tree (from top to bottom) depends on their
ability divide to the feature space most quickly into sets of instances
that can be accurately mapped to a single class. C4.5 uses a metric
called information gain to determine the best sequence of attributes
and the optimal splits of the attribute value space. Attributes with
low information gain appear in the bottom of the tree or are not part
of the tree at all. For instance, the temperature attribute is not
present in the above tree because it does not have a strong correlation
with the class.
The decision tree can be easily translated into the following set of
classification rules:
if outlook=sunny and
humidity<=75 then play=yes
if outlook=sunny and humidity>75 then play=no
if outlook=overcast then play=yes
if outlook=rainy and windy=true then play=no
if outlook=rainy and windy=false then play=yes
References
The following references provide nice introductions into the topic of
Machine Learning:
- Ian H. Witten, Eibe
Frank, "Data
Mining: Practical Machine Learning Tools and Techniques (Second
Edition)", Morgan Kaufmann, June 2005.
- Tom M. Mitchell,
“Machine
Learning”, McGraw-Hill Education (ISE Editions), December
1997.
- Nils J. Nilsson, "Introduction
to Machine Learning", 1996