netAI - Introduction

Motivation

There is a growing need for accurate and timely classification of network traffic flows. The ability to dynamically identify and classify flows according to their network applications is highly beneficial for:

Trend analyses (estimating the size and origins of capacity demand trends for network planning)
Adaptive, network-based marking of traffic requiring specific QoS without direct client-application or end-host involvement
Dynamic access control (detect forbidden applications, Denial of Service (DoS) attacks or other unwanted traffic)
Lawful Interception (enabling minimally invasive warrants and wire-taps based on statistical summaries of traffic details)
Intrusion detection (detect suspicious activities related to security breaches)

Current popular methods of classifying network applications include TCP/UDP port-based identification, and payload-based identification. The latter can be further divided into protocol decoding and signature-based identification. With protocol decoding the classifier actually decodes the application protocol while signature-based methods search for application-specific byte sequences in the packet payload.

Port-based classification systems are moderately accurate at best and will become less effective in the near future. For example, many users of peer-to-peer file-sharing applications deliberately choose to run their applications on non-default port numbers to avoid detection [1]. Payload-based classification relies on specific application data, making it difficult to detect a wide range of applications or stay up to date with new applications. In addition, the process of creating rules for signature-based classification must often be done by hand, which can be very time consuming.

Machine learning (ML) techniques [2] provide a promising alternative through classifying flows based on application protocol (payload) independent statistical features. The features used in this study are flow characteristics such as packet length and inter-arrival times. This approach does not require packet payload and the classifier can be trained automatically assuming a representative training dataset can be obtained (see [3]).

Current Classification Techniques

Port Numbers

The oldest and still most common technique is based on the inspection of known port numbers. While some applications use symmetric ports (all communicating peers use the same port number), many client-server applications such as the web are port asymmetric (only the server is using the well-known port whereas clients use dynamic ports). Therefore we refer to either port numbers or the server port as the port that identifies an application. The Internet Assigned Numbers Authority (IANA) [4] assigns the well-known ports from 0-1023 and registers port numbers in the range from 1024-49151.

Many applications do not have IANA assigned or registered ports however and only utilise ‘well known’ default ports. Often these ports overlap with IANA ports and an unambiguous identification is no longer possible. A port database [5] that lists not only the IANA ports but also ports reported by users for different applications shows that many applications have overlapping ports in the IANA registered port range. As more and more applications emerge, this overlap will increase since the port number range is not likely to increase.

Even applications with well-known or registered ports can end up using different port numbers when users attempt to hide their existence or bypass port-based filters, or when multiple servers are sharing a single IP address (host). Furthermore some applications (e.g. passive FTP or video/voice communication) use dynamic ports unknowable in advance.

Protocol Decoding

involves stateful reconstruction of session and application information from packet content (e.g. [6]). This technique avoids reliance on fixed port numbers and provides very accurate and reliable application identification, but imposes significant complexity and processing load on the traffic identification device. It must be kept up-to-date with extensive knowledge of application semantics and network-level syntax, and must be powerful enough to perform concurrent analysis of a potentially large number of flows.

This approach can be difficult or impossible when dealing with proprietary protocols or encrypted traffic. Another problem is that direct analysis of session and application layer content may represent an explicit breach of organisational privacy policies or violation of relevant privacy legislation.

This method currently provides the highest accuracy and reliability for classifying network traffic to their corresponding applications. The major problems of this method are the performance required (especially considering the ever-increasing network bandwidths) and the effort required for implementing and keeping the protocol definitions up to date. In our opinion this method seems only feasible for few applications when the incentives to provide reliable classification are very high.

Signature-based Approaches

To overcome the inefficiencies of protocol decoding some researchers have proposed to use signature-based methods. These methods search for specific application-characteristic patterns in the payload of the packets. The advantage of this approach is that it is more effective than pure port-based classifications and more efficient than protocol decoding.

On the other hand signature-based methods are less accurate than protocol decoding. However, these methods are still protocol dependent as signatures are application-specific and must be developed with a protocol specification or through reverse engineering. In the past developing signatures for some unspecified binary protocols has been found to be difficult. There are a number of ways to defeat simple signature-based detection (e.g. [7]). Overall, signature-based methods provide a very good trade-off between resource efficiency and classification performance.

Machine Learning Approach

Figure 1 and Figure 2 visualize a machine learning based classification architecture. Training input data can be taken from previously captured traffic traces (or possibly from live capturing). Then packets are grouped into flows based on IP addresses, TCP or UDP ports and protocol and the flow characteristics (features) are computed. The flow data used for training each class must be representative for the particular network application. For supervised learning algorithms the flow data needs to be labelled with class labels corresponding to the network applications prior to training. For large data traces it is necessary to limit the number of flows passed to the learning algorithm by sampling flows before training.

The flow characteristics and a set of algorithm parameters are then used to build a classification model (see Figure 1). The algorithm parameters range from very simple to very complex and depend on the ML algorithm used. For some algorithms no parameters may be needed. Once the classifier has been trained new flows can be classified based on their statistical attributes (see Figure 2). New flows are taken from live network capture or from trace files. Again sampling can be used to only classify a fraction of the overall flow data for example if the classification performance is insufficient. The results of the classification process can be used to map network traffic to different QoS classes, or other tasks such as trend analysis.

Figure 1: Machine Learning Approach - Training

machine learning approach classification

Figure 2: Machine Learning Approach - Classification

Related Work

This section briefly describes some previously published related work from the network research area. The list is probably incomplete and we are happy to include more related work if readers send us emails with citation(s).

There have been several proposals for the use of ML or statistical clustering techniques to separate network applications based on traffic statistics. In [8] the authors use nearest neighbour (NN) and linear discriminate analysis (LDA) to map different applications to different QoS classes. The Expectation Maximization (EM) algorithm was used in [9] to cluster flows into different application types. The authors of [10] have used correlation-based feature selection and a Naive Bayes classifier to differentiate between different application types. The authors of [11] use principal component analysis (PCA) and density estimation to classify traffic into different applications. We have proposed an approach for identifying different network applications based on greedy forward feature search and EM in [3]. The authors of [12] have developed a method that characterises host behaviour on different levels to classify traffic into different application types.

References

Thomas Karagiannis, Andre Broido, Nevil Brownlee, kc claffy, “Is P2P dying or just hiding?” In Proceedings of Globecom 2004, November/December 2004.
Tom M. Mitchell, “Machine Learning“, McGraw-Hill Education (ISE Editions), December 1997.
S. Zander, T.T.T. Nguyen, G. Armitage, “ Automated Traffic Classification and Application Identification using Machine Learning“, IEEE 30th Conference on Local Computer Networks (LCN 2005), Sydney, Australia, 15-17 November 2005.
IANA Port Numbers, http://www.iana.org/assignments/port-numbers (January 2006)
Ports database, http://www.portsdb.org/ (as of January 2006)
Cisco IOS Documentation, “Network-Based Application Recognition and Distributed Network-Based Application Recognition“, http://www.cisco.com/univercd/cc /td/doc/product/software/ios122/122newft/122t/122t8/dtnbarad.htm (as of January 2006).
Put to the test: http://www.networkworld.com/news/2002/0415idsevad.html?net</ a> (January 2006)
M. Roughan, S. Sen, O. Spatscheck, N. Duffield, “Class-of-Service Mapping for QoS: A statistical signature-based approach to IP traffic classification“, ACM SIGCOMM Internet Measurement Workshop, Sicily, Italy, 2004.
A. McGregor, M. Hall, P. Lorier, J. Brunskill, “Flow Clustering Using Machine Learning Techniques“, Passive & Active Measurement Workshop 2004 (PAM 2004), France, April 19-20, 2004.
A. W. Moore and D. Zuev, “Internet Traffic Classification Using Bayesian Analysis Techniques”, ACM SIGMETRICS, Banff, Canada, June 2005.
T. Dunnigan, G. Ostrouchov, “Flow Characterization for Intrusion Detection”, Oak Ridge National Laboratory, Technical Report, http://www.csm.ornl.gov/~ost/id/tm.ps, November 2000.
T. Karagiannis, K. Papagiannaki, and M. Faloutsos, “BLINC: Multilevel Traffic Classification in the Dark”, ACM Sigcomm, Philadelphia, PA, August 2005.

Centre closure