Anonymisation Features
One of the key difficulties in capturing and analysing network data is
that of privacy. This leads to a Catch-22 situation - in order to capture
unbiased data, we need the participants involved in generating that data
to behave as they normally would. However, if the data generation
participants know that their online activity is being monitored, this could
affect their typical behaviour. While it is not proper to capture raw network
traffic traces without informing the users, it is possible to overcome privacy
concerns through anonymisation of captured data.
By randomising the values of user identifiable information, we can protect the
privacy of network users. While we are throwing away some data, it is
imperative to remember that more often we care more about the type of Internet
services accessed, frequency, times of day than we do about the actual server
accessed or the actual content downloaded (the fact that an image of 100kB is
downloaded is more important than knowing that the image was called
"my_dodgy_image.jpg").
Netsniff provides a real-time anonymisation facility. As packets are read and
parsed for protocol information, netsniff can anonymise any identifying
information, thus protecting the identity of users and their online activity.
Further, this anonymisation is consistent, meaning we can determine correlation
between data sets (such as correlating a DNS lookup with later HTTP transactions)
or determining the number of emails sent to a particular (anonymised) email
address.
How To Invoke Anonymisation?
Anonymisation is enabled via the 3 command line arguments (-a, -m
and -k). The most important of these is the -a argument which
actually enables anonymisation of collected data, the remaining two flags tell
netsniff how to anonymise its captured information. The -k option
specifies a file in which to read a key to pre-initialise the anonymisation
algorithms while the -m option specifies which of three algorithms
(cryptopan, nullip or tcpdriv) will be used to anonymise
IP Addresses.
IP Address Anonymisation Modes
Netsniff offers a number of different anonymisation modes which can be
specified on the command line. Each option causes anonymisation to be performed
in a slightly differing manner.
CryptoPan Anonymisation
The CryptoPan library is used with either a specified key (or a default key
of "0") to anonymise IP addresses.
| Advantages | Disadvantages |
- IP Addresses are anonymised consistently across separate execution
runs of the netsniff application.
|
- If the key used is available, it is possible to reverse the IP
Address anonymisation and retrieve the original IP Addreses, thus
destroying the desired privacy.
|
NullIP Anonymisation
All IP addresses are replaced with the IP Address 0.0.0.0.
| Advantages | Disadvantages |
- Pure anonymisation is achieved, since all IP Addresses will map
to the same IP Address, there is no way to reverse the process
and retrieve the original information.
|
- It is impossible to determine the number of different servers or
IP devices accessed.
- It is impossible to differentiate IP addresses on one side of the
device running netsniff and other side (client vs. Internet).
- Since all IP Addresses will map to the same IP Address, there is
no longer the ability to correlate IP Addresses across different
applications or data sessions.
|
Tcpdpriv Style Anonymisation
Netsniff implements an anonymisation scheme similar to that used by the -A50
option on tcpdpriv.
This algorithm is a prefix-preserving one - a thorough analysis of prefix-preserving
IP address anonymisation is presented here
The approach is table based, keeping a lookup table in memory, which is built
up gradually as IP addresses are anonymised. This is defined nicely
here as:
Suppose that we have a set of <raw, anonymised> binding pairs
of IP addresses. To anonymise an IP address (a =
a1a2...an) we first find the pair
<x, y> with x = x1x2...xn
and y = y1y2...yn with the
longest prefix match k on a and x.
a is anonymised to b = b1b2...bn
where b1b2...bk =
y1y2...yk and
bk+1bk+2...bn =
rand(0 ... 2n-k - 1). In netsniff rand() is an
alternating series of 0 and 1 bits. Finally the new anonymised address is
added to the binding table.
As described here
this is problematic - since the anonymisation depends on the traffic sniffed
and the order in which IP addresses are seen, it is inconsistent over multiple
netsniff sessions. In terms of security this
also proves that this algorithm is as robust as is possible for prefix
preserving IP address anonymisation.
| Advantages | Disadvantages |
- Correlation is possible between multiple networked applications
accessing the same server or information about the same server.
- Security and reverse address mapping is as robust as possible for
this type of anonymisation algorithm.
- Network locality and subnet information is not lost due to the
prefix-preserving nature of the algorithm
|
- Mapping of IP addresses are not consistent across multiple execution
runs of netsniff.
- If some address mappings are known, it is possible to retrieve some
other partial address mappings, see
here
for a possible avenue of attack.
|
String Anonymisation
All other anonymised fields are anonymised using a String Anonymiser. This is
initialised with a key specified using the -k command line option or a
default random key. String anonymisation is used in output of the ARP, POP3,
SMTP, FTP and DNS protocols. The algorithm uses a secure hash function.
As for IP Address Anonymisation, string anonymisation is consistent given
the same input string. If a key file is specified on the command line (-k)
then it also remains consistent across multiple execution runs of netsniff.
This consistency allows correlation of interesting data - eg. while a
destination email address will be anonymised, it is possible to see the
number of different emails to the same address since that email address will
be consistently anonymised.
|