netAI - Introduction
Motivation
There is a growing need for accurate and timely classification of
network traffic flows. The ability to dynamically identify and
classify flows according to their network applications is highly
beneficial for:
- Trend analyses (estimating the size and origins of capacity
demand trends for network planning)
- Adaptive, network-based marking of traffic requiring specific
QoS without direct client-application or end-host involvement
- Dynamic access control (detect forbidden applications, Denial
of Service (DoS) attacks or other unwanted traffic)
- Lawful Interception (enabling minimally invasive warrants and
wire-taps based on statistical summaries of traffic details)
- Intrusion detection (detect suspicious activities related to
security breaches)
Current popular methods of classifying network applications include
TCP/UDP port-based identification, and payload-based
identification. The latter can be further divided into protocol
decoding and signature-based identification. With protocol decoding
the classifier actually decodes the application protocol while
signature-based methods search for application-specific byte
sequences in the packet payload.
Port-based classification systems are moderately accurate at best
and will become less effective in the near future. For
example, many users of peer-to-peer file-sharing applications
deliberately choose to run their applications on non-default port
numbers to avoid detection [1]. Payload-based classification relies
on specific application data, making it difficult to detect a wide
range of applications or stay up to date with new applications. In
addition, the process of creating rules for signature-based
classification must often be done by hand, which can be very time
consuming.
Machine learning (ML) techniques [2] provide a promising
alternative through classifying flows based on application protocol
(payload) independent statistical features. The features used in
this study are flow characteristics such as packet length and
inter-arrival times. This approach does not require packet payload
and the classifier can be trained automatically assuming a
representative training dataset can be obtained (see [3]).
Current Classification Techniques
Port Numbers
The oldest and still most common technique is based on the
inspection of known port numbers. While some applications use
symmetric ports (all communicating peers use the same port number),
many client-server applications such as the web are port asymmetric
(only the server is using the well-known port whereas clients use
dynamic ports). Therefore we refer to either port numbers or the
server port as the port that identifies an application. The
Internet Assigned Numbers Authority (IANA) [4] assigns the
well-known ports from 0-1023 and registers port numbers in the
range from 1024-49151.
Many applications do not have IANA assigned or registered ports
however and only utilise ‘well known’ default ports.
Often these ports overlap with IANA ports and an unambiguous
identification is no longer possible. A port database [5] that
lists not only the IANA ports but also ports reported by users for
different applications shows that many applications have
overlapping ports in the IANA registered port range. As more and
more applications emerge, this overlap will increase since the port
number range is not likely to increase.
Even applications with well-known or registered ports can end up
using different port numbers when users attempt to hide their
existence or bypass port-based filters, or when multiple servers
are sharing a single IP address (host). Furthermore some
applications (e.g. passive FTP or video/voice communication) use
dynamic ports unknowable in advance.
Protocol Decoding
involves stateful reconstruction of session and application
information from packet content (e.g. [6]). This technique avoids
reliance on fixed port numbers and provides very accurate and
reliable application identification, but imposes significant
complexity and processing load on the traffic identification
device. It must be kept up-to-date with extensive knowledge of
application semantics and network-level syntax, and must be
powerful enough to perform concurrent analysis of a potentially
large number of flows.
This approach can be difficult or impossible when dealing with
proprietary protocols or encrypted traffic. Another problem is that
direct analysis of session and application layer content may
represent an explicit breach of organisational privacy policies or
violation of relevant privacy legislation.
This method currently provides the highest accuracy and
reliability for classifying network traffic to their corresponding
applications. The major problems of this method are the performance
required (especially considering the ever-increasing network
bandwidths) and the effort required for implementing and keeping
the protocol definitions up to date. In our opinion this method
seems only feasible for few applications when the incentives to
provide reliable classification are very high.
Signature-based Approaches
To overcome the inefficiencies of protocol decoding some
researchers have proposed to use signature-based methods. These
methods search for specific application-characteristic patterns in
the payload of the packets. The advantage of this approach is that
it is more effective than pure port-based classifications and more
efficient than protocol decoding.
On the other hand signature-based methods are less accurate than
protocol decoding. However, these methods are still protocol
dependent as signatures are application-specific and must be
developed with a protocol specification or through reverse
engineering. In the past developing signatures for some unspecified
binary protocols has been found to be difficult. There are a number
of ways to defeat simple signature-based detection (e.g. [7]).
Overall, signature-based methods provide a very good trade-off
between resource efficiency and classification performance.
Machine Learning Approach
Figure 1 and Figure 2 visualize a machine learning based
classification architecture. Training input data can be taken from
previously captured traffic traces (or possibly from live
capturing). Then packets are grouped into flows based on IP
addresses, TCP or UDP ports and protocol and the flow
characteristics (features) are computed. The flow data used for
training each class must be representative for the particular
network application. For supervised learning algorithms the flow
data needs to be labelled with class labels corresponding to the
network applications prior to training. For large data traces it is
necessary to limit the number of flows passed to the learning
algorithm by sampling flows before training.
The flow characteristics and a set of algorithm parameters are
then used to build a classification model (see Figure 1). The
algorithm parameters range from very simple to very complex and
depend on the ML algorithm used. For some algorithms no parameters
may be needed. Once the classifier has been trained new flows can
be classified based on their statistical attributes (see Figure 2).
New flows are taken from live network capture or from trace files.
Again sampling can be used to only classify a fraction of the
overall flow data for example if the classification performance is
insufficient. The results of the classification process can be used
to map network traffic to different QoS classes, or other tasks
such as trend analysis.
Figure 1: Machine Learning Approach -
Training
Figure 2: Machine Learning
Approach - Classification
Related Work
This section briefly describes some previously published related
work from the network research area. The list is probably
incomplete and we are happy to include more related work if readers
send us emails with citation(s).
There have been several proposals for the use of ML or statistical
clustering techniques to separate network applications based on
traffic statistics. In [8] the authors use nearest neighbour (NN)
and linear discriminate analysis (LDA) to map different
applications to different QoS classes. The Expectation Maximization
(EM) algorithm was used in [9] to cluster flows into different
application types. The authors of [10] have used correlation-based
feature selection and a Naive Bayes classifier to differentiate
between different application types. The authors of [11] use
principal component analysis (PCA) and density estimation to
classify traffic into different applications. We have proposed an
approach for identifying different network applications based on
greedy forward feature search and EM in [3]. The authors of [12]
have developed a method that characterises host behaviour on
different levels to classify traffic into different application
types.
References
- Thomas Karagiannis, Andre Broido, Nevil Brownlee, kc claffy,
“Is P2P dying or just hiding?” In Proceedings of
Globecom 2004, November/December 2004.
- Tom M. Mitchell, “Machine Learning“, McGraw-Hill
Education (ISE Editions), December 1997.
- S. Zander, T.T.T. Nguyen, G. Armitage, “ Automated
Traffic Classification and Application Identification using Machine
Learning“, IEEE 30th Conference on Local Computer Networks
(LCN 2005), Sydney, Australia, 15-17 November 2005.
- IANA Port Numbers, http://www.iana.org/assignments/port-numbers
(January 2006)
- Ports database, http://www.portsdb.org/ (as of
January 2006)
- Cisco IOS Documentation, “Network-Based Application
Recognition and Distributed Network-Based Application
Recognition“,
http://www.cisco.com/univercd/cc
/td/doc/product/software/ios122/122newft/122t/122t8/dtnbarad.htm
(as of January 2006).
- Put to the test: http://www.networkworld.com/news/2002/0415idsevad.html?net</
a> (January 2006)
- M. Roughan, S. Sen, O. Spatscheck, N. Duffield,
“Class-of-Service Mapping for QoS: A statistical
signature-based approach to IP traffic classification“, ACM
SIGCOMM Internet Measurement Workshop, Sicily, Italy, 2004.
- A. McGregor, M. Hall, P. Lorier, J. Brunskill, “Flow
Clustering Using Machine Learning Techniques“, Passive &
Active Measurement Workshop 2004 (PAM 2004), France, April 19-20,
2004.
- A. W. Moore and D. Zuev, “Internet Traffic Classification
Using Bayesian Analysis Techniques”, ACM SIGMETRICS, Banff,
Canada, June 2005.
- T. Dunnigan, G. Ostrouchov, “Flow Characterization for
Intrusion Detection”, Oak Ridge National Laboratory,
Technical Report, http://www.csm.ornl.gov/~ost/id/tm.ps,
November 2000.
- T. Karagiannis, K. Papagiannaki, and M. Faloutsos,
“BLINC: Multilevel Traffic Classification in the Dark”,
ACM Sigcomm, Philadelphia, PA, August 2005.