Name: pkthisto Version: 0.1.2, October 28th 2001 Author: gj_armitage@yahoo.com Copyright (c) 2001, Grenville Armitage 1. Summary: A packet traffic analysis program specifically designed for generating inter-packet arrival time histograms, packet length histograms, and packet rate plots for UDP/IP traffic. I orignally wrote this to assist me in analysing online game traffic in and out of QuakeIII Arena servers. pkthisto can take input from real time capture (where Berkeley packet filter, BPF, is supported by the OS), tcpdump tracefiles, or NAI Sniffer tracefiles. Although pkthisto was originally developed under MS Visual C++ 6.0 in a Win32 environment, it migrated to FreeBSD4.3 for the real time capture mode and development continued in that environment using KDevelop 1.4. Real time capture is currently not available under Win32. (See section 6 of this README for details on compiling under MS Visual C++.) pkthisto is released under the GNU General Public License, Version 2, 1991. pkthisto compiles 'out of the box' under FreeBSD4.3 (and Win32 with lesser functionality). Standard C libraries are sufficient, there are no additional packages to install (modulo the possible need to recompile your kernel for BPF support). pkthisto does not come with any additional run-time libraries encumbered by other licenses. 2. Installation: The current development environment for pkthisto is FreeBSD4.3 with KDevelop 1.4, an X11-based C/C++ development tool. This distribution contains tools to create an appropriate makefile, with which you can generate a running executable. (I have not verified whether pkthisto can or cannot be compiled under anything other than FreeBSD4.3 or Win32, but I'd be interested in hearing experiences.) The following installation steps apply to *nix environments. (See section 6 for compiling under Win32.) The basic distribution is a gzipped tarfile named pkthisto-0.1.2.tar.gz, which creates a subdirectory ./pkthisto-0.1.2 when gunzipped/untar'ed. Once the tarfile is unpacked, perform the following steps: > cd ./pkthisto-0.1.2 > ./configure > cd pkthisto > make "./configure" will spend a minute or so inspecting your system, compiler settings, etc and generating appropriate makefiles. Once this has completed successfully, you move into the source subdirectory and run "make" to actually compile pkthisto. You can then either copy pkthisto to somewhere more convenient in your path, or use "make install" to automatically copy pkthisto into /usr/local/bin. (An alternate installation location can be specified during the configuration stage. If you wish to install into //bin then execute "./configure --prefix=//" instead of "./configure" before compiling. "make install" will then copy pkthisto to //bin/pkthisto.) Executing "make clean" in ./pkthisto-0.1.2/pkthisto will subsequently remove all intermediate object files. KDevelop 1.4's pkthisto.kdevprj file is also supplied, in case it helps you do further development of pkthisto. 3. Using pkthisto Currently pkthisto takes a stream of IP packets and automatically identifies each unique UDP/IP flow. For every active flow, pkthisto creates histograms of inter-packet arrivals times and IP packet lengths. These histograms can be dumped to disk in stages, or dumped all at once at the end. pkthisto can analyse traffic from one of three sources: - Real time packet capture through a local Ethernet interface if your OS supports the BPF driver (e.g. FreeBSD4.3) - Raw tracefile generated by tcpdump (any platform) - Raw tracefile generated by NAI Sniffer (any platform, so long as it is in ".enc" rather than ".cap" format) Before starting pkthisto you'll need to create an appropriate configuration file. Configuration options are described in the file ./pkthisto-0.1.2/conf-demo.txt If your config file is named "conf.txt", start pkthisto with: pkthisto -c conf.txt (pkthisto will default to looking for configuration file 'conf_ph.txt' if the '-c' option is not supplied.) If you require real time packet capture, pkthisto needs to run with sufficient priviledges to open your BPF/Ethernet driver (and optionally, enough priviledges to establish promiscuous mode operation on the BPF driver, unless the "no_promiscuous" configuration file option has been set). If you are reading from a pre-existing tracefile, you only require enough priviledges to read the tracefile and write new files to the current directory. pkthisto supports a 'checkpoint' facility whereby it dumps to disk the current flows and their histograms at regular intervals. This is primarily useful during real time capture mode - it keeps a record that is likely to survive a crash of pkthisto or the machine on which it is running, and reduces the amount of memory pkthisto keeps allocated at any given time. Checkpoint intervals are set in the configuration file with the "checkpoint_interval" option. Checkpoints may also be forced during real time capture mode by sending the running process a SIGUSR1 signal ("kill -USR1 "). When reading from disk pkthisto concludes at the end of the tracefile, and writes a final checkpoint to disk. When doing real time capture, pkthisto can be configured to conclude after a certain number of packets have been seen (use the "total_pkts" configuration file option). During real time capture, pkthisto also concludes gracefully (doing a final checkpoint before exiting) when it receives a SIGTERM signal ("kill -TERM "). 4. Output files pkthisto generates output across three levels of files. The top level summary of pkthisto's activities is collected in a file called: gtout0.txt The second level of files reflect a summary of each flow. Filenames are one of three forms: gtoutxxx.txt, gtoutxxxFSd.txt, or gtoutxxxTSd.txt (where "xxx" is a decimal integer value uniquely identifying each flow.) The latter two forms occur when pkthisto has been informed of a specific IP address and UDP port that is a game server host. Servers are specified with a "specific_server" configuration file option, and multiple servers may be tracked. Thus, "gtout80FS1.txt" represents the 80th flow seen by pkthisto, a flow carrying UDP/IP packets coming *from* the 1st server pkthisto knows about. Conversely "gtout1035TS1.txt" is the 1035th flow, representing UDP/IP packets going *to* the 1st server pkthisto knows about. The form "gtoutxxxx.txt" occurs for flows that do not appear to be going to, or coming from, a known server. The third level of files are the actual histograms and running statistics themselves. Part of the file name is derived from the second level files that summarize each flow. The prefix identifies the file's contents: Length histograms start with "LH-" Cumulative Length distributions start with "CLH-" Inter-arrival histograms start with "IH-" Cumulative inter-arrival distributions start with "CIH" Estimated bit rate plots start with "RATE-" Estimated packet rate plots start with "PPS-" If requested by the user (with the "dump_sizes" option) the following files will also be created: Plot of lowest 5% of packet sizes start with "SIZEL-" Plot of median of packet sizes start with "SIZEM-" Plot of upper 95% of packet sizes start with "SIZEU-" Plot of size ratio start with "SIZER-" The size related output files are optional because they are easily derived from the information in the LM-* and CLM-* files anyway. Every histogram covers only a finite number of packets, so that during post-analysis we can get a reasonable sense of how the histograms (distributions) changed over time. By default, up to 2000 packets make up each histogram, but this can be modified by the "max_pkts_per_histo" configuration option. Thus, histogram and cumulative distribution files (which are derived from the histograms) will contain separate sequences of data representing individual histograms in chronological order. Each sequence is preceded by a short tag containing an integer histogram number. Use this histogram number to find information about each histogram within the flow's associated gtoutxxxx.txt, gtoutxxxFSd.txt, or gtoutxxxTSd.txt file. 4.1 Format of gtout0.txt The top level file gtout0.txt is updated with a summary of pkthisto's ongoing activities each time a checkpoint occurs. When pkthisto starts, it logs key operating parameters to gtout0.txt, and then begins gathering traffic. Checkpoints are marked by the line of the form: **Checkpoint NN, [XX,YY] where NN is an integer representing the number of checkpints so far, XX is the number of unique UDP/IP flows seen so far, and identifies when the checkpoint occurred, in ASCII format (for example, "Wed Sep 19 09:52:12 2001"). YY represents the number of flows that have been released from memory because they never saw enough packets over the previous two checkpoint periods to be considered relevant. Following the "**Checkpoint" line is a sequence of entries summarizing each active flow, with the form: : srcaddr:port -> dstaddr:port: NNN elements, YYY min where is the flow's filename (e.g. "gtout890FS1.txt"), srcaddr:port and dstaddr:port are the IP address (in w.x.y.z notation) and UDP port for the flow's source and destination respectively, NNN is the number of packets seem in the flow so far, and YYY is a floating point number representing the number of minutes the flow has been active since it was first detected. A flow is only mentioned during a checkpoint if it has been active during the period of time since the previous checkpoint. A checkpoint ends with a line of the form: **EndCheckpoint NN [XX dumped, YY stored] where NN is the checkpoint number, XX is the number of flows dumped, and YY represents the number of 'flows' currently stored in memory (idle flows, or flows that haven't yet seen enough packets to warrant dumping to disk). The final checkpoint differs in that "**Checkpoint" is replaced with "**FinalCheckpoint", summary information on *all* flows seen during the pkthisto run are dumped, and followed with the line "**Checkpointing complete." In addition, the summary lines have ", started " appended to the "YYY min" field. Note that when doing real time capture, the fields represent the local host's internal date and time clock. When reading from a stored tracefile, date&time is taken from the timestamps embedded in the file. 4.2 Format of gtoutxxx.txt, gtoutxxxFSd.txt, and gtoutxxxTSd.txt Each flow has one of these files, which begin with: Src: srcaddr:port Dst: dstaddr:port Start: The source and destination IP/UDP information defines the end points of the flow, the start field identifies the time at which the flow was first seen (to an resolution of one second, and accuracy dependent on the system clock). The rest of the file is a sequence of summaries for each histogram, updated each time a checkpoint occurs. Checkpoints begin with the line: Checkpoint: NN start and end with the line Checkpoint: NN end where NN is an integer number representing the checkpoint (and matches the "Checkpoint NN" in gtout0.txt). Between the start and end fields are a sequence of summaries for each histogram stored during the time since the previous checkpoint. They are of the following form (all on one line): Histo:HHH: BB - EE min, avg LL bytes, RR kbps, PP pps, l/m/h XX/YY/ZZ, MMM pkts, QQ err Where HHH is a positive integer uniquely identifying the histogram, BB and EE are the timestamps of the first and last packets making up the histogram (relative to the flow's "Start:" time), LL is the average IP packet length during this interval, RR is the total kbits of IP packets divided by the length of the interval in seconds, PP is the number of packets per second, XX/YY/ZZ represent the low 5%, median, and upper 95% packet sizes, MMM is the number of packets in the interval, and QQ is the number of packets who were longer than 800 bytes (considered outside the range of the length histograms, since they are almost certainly unrelated to real time game play traffic). Note that HHH counts histograms relative to the flow for which the histograms apply. It only increments, and may sometimes increment by more than one between histograms and checkpoints. The reason is that sometimes a flow goes idle after a period of being active, and the histogram being created at the time the flow went idle may see only a handful of packets. During operation, pkthisto reclaims memory from histograms that will never be dumped to disk. When the flow again becomes suffiently active, histograms will be dumped to disk. However, the internal histogram identifier HHH will have incremented a number of times, reflecting the number of histograms that were begun and then released due to the flow being too idle. The minimum number of packets that must be seen to make a valid histogram defaults to 100, and can be changed with the "min_pkts_in_flow" option. The "flow_max_milliseconds" option specifies how many milliseconds can elapse between packets of the same flow before the flow is considered to have become idle. The default is 500 ms. 4.3 Format of LH-, CLH-, IH-, and CIH- files Histograms generated by pkthisto can take one of two forms: pure ASCII (as sequences of "X Y" data points that can be fed to xgraph or copied directly into spreadsheets) or compressed ASCII (lines of encoded text, where the X axis is implied and the Y axis data points are compressed into two-digit base64 values). The compressed ASCII format provides substantial disk space savings during long term, real time traffic traces. By default pkthisto generates pure ASCII output. However, the configuration file option "compressed_output" forces pkthisto to generate compressed ASCII output instead. The LH-, CLH-, IH-, and CIH- files are affected by this choice. Details of these two formats can be found in the file ./pkthisto/HistoFormats.txt 4.4 Format of remaining RATE- and PPS- files These files contain "X Y" pairs where the X value is the time axis (represented as an integer number of seconds since 1/1/1970) and Y is either the bit rate (in kbits per second, measured as the number of IP packet bits per time interval) or actual packets per second. "X Y" pairs are generated per histogram. X represents the start time of the histogram. 4.5 Special files relating to server traffic When one or more ipaddr:port pairs are specified as probable servers, pkthisto also tracks the aggregate flows to and from each specified server. These flows are known as -All2Serv-NN and -Serv2All-NN, where NN represents which one of the specified servers (starting at 1). All the per-flow files discussed in sections 4.1 to 4.5 have siblings for the aggregate server flows, for example: gtout-All2Serv-1.txt, gtout-Serv2All-1.txt, LH-gtout-All2Serv-1.txt, CLH-gtout-All2Serv-1.txt, ...etc... Flows TO the server are logged as coming from 0.0.0.0:0, while flows FROM the server are logged as going to 0.0.0.0:0. 5. What pkthisto ignores During real-time capture pkthisto installs BPF filter code to ignore Ethernet frames that do not carry UDP/IP packets. TCP/IP flows are ignored. When reading from existing tracefiles, pkthisto simply assumes all Ethernet frames have been pre-filtered to contain only UDP/IP packets. Any flow that never sees more than 100 (or min_pkts_in_flow) packets within 800 (or flow_max_milliseconds) milliseconds will be forgotten. It will never be mentioned in gtout0.txt and never have histograms dumped during checkpoints. (This should filter out transient 'flows' due to game launchers such as GameSpy3D semi-regularly probing a server.) 6. Compiling for Win32 platforms Although pkthisto was originally developed under MS Visual C++ 6.0 in a Win32 environment, it was ported to FreeBSD4.3 for the real time capture mode and development continued a number of steps from there. Where I discovered differences between Visual C++ 6.0 in a Win32 environment and KDevelop 1.4/gcc in a FreeBSD4.3 environment, I've used conditional compilation directives. The flag WIN32 should be set for Win32-compatible code, and unset for FreeBSD4.3 (or equivalent) environments. You will need to specifically link against ws2_32.lib (add under Project->Settings->Linker if you're using MS Visual C++) for Win32 (to bring in inet_ntoa() functions). I've supplied sample Visual C++ 6.0 project/workspace files ./pkthisto-0.1.2/pkthisto.dsp and ./pkthisto-0.1.2/pkthisto.dsw If you have Visual C++, you should be able to use WinZip (or similar) to unpack/untar the pkthisto distribution, then go into the ./pkthisto-0.1.2 folder and double-click on pkthisto.dsw to start start Visual C++. Tell Visual C++ to "build" and a Win32 version of pkthisto should be built. (At least, it worked for me on a Windows 2000 system. No promises it'll work on every Win32 platform, although I imagine it should.) The pkthisto executable must be run from a console window (or from within Visual C++). NOTE: pkthisto does not currently provide support for real time capture in a Win32 environment. Perhaps later. Tcpdump and NAI Sniffer tracefile analysis are supported. 7. Bugs, things TODO, Conclusions No doubt there are bugs, and it would be lovely to use a more compressed output format to save diskspace on longer real time captures. And naturally, this README file is not complete. Future releases of pkthisto may extend real time capture support beyond just BPF devices, and include Win32 environments. The ultimate source of information is, of course, the source code. Fortunately it is a relatively small program. Enjoy! gj_armitage@yahoo.com