Name: pkthisto Version: 0.3.4, March 31 2005 Version: 0.3.3, July 16th 2003 Version: 0.2, October 10th 2002 Version: 0.1.2, October 28th 2001 Author: gj_armitage@yahoo.com Copyright (c) 2001-2003 Grenville Armitage Contributor: mpozzobon@swin.edu.au 1. Summary: A packet traffic analysis program specifically designed for generating inter-packet arrival time histograms, packet length histograms, and packet rate plots for UDP/IP traffic. I orignally wrote this to assist me in analysing online game traffic in and out of QuakeIII Arena servers. pkthisto can take input from real time capture (where Berkeley packet filter, BPF, is supported by the OS), tcpdump tracefiles, or NAI Sniffer tracefiles. As of version 0.2, pkthisto can also handle the IP packet addressing used by XBox "system link" games (either captured in real-time or post-processed from tcpdump files). As of version 0.3 pkthisto records the average TTL (time to live) value of packets coming from game clients. pkthisto is released under the GNU General Public License, Version 2, 1991. Development of pkthisto began under MS Visual C++ 6.0 in a Win32 environment, but soon migrated to FreeBSD4.3/KDevelop1.4 for the real time capture mode. Development currently occurs under FreeBSD4.8/Kdevelop2.1.5. Real time capture is currently not available under Win32. (See README-Win32.txt for details on compiling under MS Visual C++.) pkthisto compiles 'out of the box' under FreeBSD4.8 Standard C libraries are sufficient, there are no additional packages to install (aside from the possible need to recompile your kernel for BPF support). pkthisto does not come with any additional run-time libraries encumbered by other licenses. The release history is at the end of this README. 2. Installation: The current development environment for pkthisto is FreeBSD 5.3. This distribution contains tools to create an appropriate makefile, with which you can generate a running executable. The following installation steps apply to most *nix environments. The basic distribution is a gzipped tarfile named pkthisto-0.3.4.tar.gz, which creates a subdirectory ./pkthisto-0.3.4 when gunzipped/untar'ed. Once the tarfile is unpacked, perform the following steps: > cd ./pkthisto-0.3.4 > ./configure > make "./configure" will spend a minute or so inspecting your system, compiler settings, etc and generating appropriate makefiles. Once this has completed successfully, you run "make" (or gmake) to actually build pkthisto. [Note: The earliest versions of pkthisto used "make", more rececntly "gmake" was required, but now either version of "make" is acceptable.] You can then either copy pkthisto to somewhere more convenient in your path, or use "make install" to automatically copy pkthisto into /usr/local/bin. (An alternate installation location can be specified during the configuration stage. If you wish to install into //bin then execute "./configure --prefix=//" instead of "./configure" before compiling. "gmake install" will then copy pkthisto to //bin/pkthisto.) Executing "make clean" in ./pkthisto-0.3.4 will subsequently remove all intermediate object files. If you desire more verbose (ugly) output during the build or install, you can instead execute "make VERBOSE=2". Setting the value of VERBOSE to "0" will cause "make" to run silently. The default value of VERBOSE is "1" 2.1 make vs gmake The BSD Make application is not able to correctly regenerate Makefiles during the run. The Makefile itself will be regenerated but the original Makefile will then be used to "make" the project. This is only a problem is you make a change to "configure.in" or any of the "Makefile.in" files. Modification of these files result in either "configure" or the "Makefiles" needing to be regenerated. The Makefile will detect you are using BSD make, regenerate the Makefile, and then give a warning that the project is being remade with the original Makefile instead. This is only a problem when perfecting the build environment, not when working on the source code. GNU-Make (the default make on Linux systems) properly regenerates the Makefiles and configure if necessary and reloads the new Makefiles prior to running "gmake". 2.2 FreeBSD 4.X vs FreeBSD 5.3 vs (Something Else) BSD Make and the gcc compiler compiler on FreeBSD 4.X doesn't seem to want to link the application. Using gmake on FreeBSD 4.X is fine, as is a newer version of FreeBSD. There are no known problems on other platforms 3. Using pkthisto Currently pkthisto takes a stream of IP packets and automatically identifies each unique UDP/IP flow. For every active flow, pkthisto creates histograms of inter-packet arrivals times and IP packet lengths. These histograms can be dumped to disk in stages, or dumped all at once at the end. pkthisto can analyse traffic from one of three sources: - Real time packet capture through a local Ethernet interface if your OS supports the BPF driver - Raw tracefile generated by tcpdump (any platform) - Raw tracefile generated by NAI Sniffer (any platform, so long as it is in ".enc" rather than ".cap" format) pkthisto can also isolate XBox system link game traffic into pseudo-flows. (Because system link games use a fixed IP address "0.0.0.1" for source and destination, pkthisto creates pseudo-addresses from the lower 4 bytes of the source and destination MAC addresses. This allows each box in a system link game to be differentiated.) XBox traffic can be processed either in real-time capture mode, or from a tcpdump tracefile. Before starting pkthisto you'll need to create an appropriate configuration file. Configuration options are described in the file ./pkthisto/conf-demo.txt If your config file is named "conf.txt", start pkthisto with: pkthisto -c conf.txt (pkthisto will default to looking for configuration file 'conf_ph.txt' if the '-c' option is not supplied.) If you require real time packet capture, pkthisto needs to run with sufficient priviledges to open your BPF/Ethernet driver (and optionally, enough priviledges to establish promiscuous mode operation on the BPF driver, unless the "no_promiscuous" configuration file option has been set). If you are reading from a pre-existing tracefile, you only require enough priviledges to read the tracefile and write new files to the current directory. pkthisto supports a 'checkpoint' facility whereby it dumps to disk the current flows and their histograms at regular intervals. This is primarily useful during real time capture mode - it keeps a record that is likely to survive a crash of pkthisto or the machine on which it is running, and reduces the amount of memory pkthisto keeps allocated at any given time. Checkpoint intervals are set in the configuration file with the "checkpoint_interval" option. Checkpoints may also be forced during real time capture mode by sending the running process a SIGUSR1 signal ("kill -USR1 "). When reading from disk pkthisto concludes at the end of the tracefile, and writes a final checkpoint to disk. When doing real time capture, pkthisto can be configured to conclude after a certain number of packets have been seen (use the "total_pkts" configuration file option). During real time capture, pkthisto also concludes gracefully (doing a final checkpoint before exiting) when it receives a SIGTERM signal ("kill -TERM "). 4. Output files pkthisto generates output across three levels of files. The top level summary of pkthisto's activities is collected in a file called: gtout0.txt The second level of files reflect a summary of each flow. Filenames are one of three forms: gtoutxxx.txt, gtoutxxxFSd.txt, or gtoutxxxTSd.txt (where "xxx" is a decimal integer value uniquely identifying each flow.) The latter two forms occur when pkthisto has been informed of a specific IP address and UDP port that is a game server host. Servers are specified with a "specific_server" configuration file option, and multiple servers may be tracked. Thus, "gtout80FS1.txt" represents the 80th flow seen by pkthisto, a flow carrying UDP/IP packets coming *from* the 1st server pkthisto knows about. Conversely "gtout1035TS1.txt" is the 1035th flow, representing UDP/IP packets going *to* the 1st server pkthisto knows about. The form "gtoutxxxx.txt" occurs for flows that do not appear to be going to, or coming from, a known server. The third level of files are the actual histograms and running statistics themselves. Part of the file name is derived from the second level files that summarize each flow. The prefix identifies the file's contents: Length histograms start with "LH-" Cumulative Length distributions start with "CLH-" Inter-arrival histograms start with "IH-" Cumulative inter-arrival distributions start with "CIH" Estimated bit rate plots start with "RATE-" Estimated packet rate plots start with "PPS-" If requested by the user (with the "dump_sizes" option) the following files will also be created: Plot of lowest 5% of packet sizes start with "SIZEL-" Plot of median of packet sizes start with "SIZEM-" Plot of upper 95% of packet sizes start with "SIZEU-" Plot of size ratio start with "SIZER-" The size related output files are optional because they are easily derived from the information in the LH-* and CLH-* files anyway. Every histogram covers only a finite number of packets, so that during post-analysis we can get a reasonable sense of how the histograms (distributions) changed over time. By default, up to 2000 packets make up each histogram, but this can be modified by the "max_pkts_per_histo" configuration option. Thus, histogram and cumulative distribution files (which are derived from the histograms) will contain separate sequences of data representing individual histograms in chronological order. Each sequence is preceded by a short tag containing an integer histogram number. Use this histogram number to find information about each histogram within the flow's associated gtoutxxxx.txt, gtoutxxxFSd.txt, or gtoutxxxTSd.txt file. Finally, pkthisto generates a list of the average TTL (time to live) of packets coming from each IP address that is not a known server. This list is saved in the file "TTL-gtout0.txt" Histograms of TTL associated with each flow are stored in files starting with "TH-" and "CTH-" (cf. the associated "LH-" and "CLH-" files). Typically the histograms for any one flow will have only one data point. The aggregate client to server flow will show multiple points. 4.1 Format of gtout0.txt The top level file gtout0.txt is updated with a summary of pkthisto's ongoing activities each time a checkpoint occurs. When pkthisto starts, it logs key operating parameters to gtout0.txt, and then begins gathering traffic. Checkpoints are marked by the line of the form: **Checkpoint NN, [XX,YY] where NN is an integer representing the number of checkpints so far, XX is the number of unique UDP/IP flows seen so far, and identifies when the checkpoint occurred, in ASCII format (for example, "Wed Sep 19 09:52:12 2001"). YY represents the number of flows that have been released from memory because they never saw enough packets over the previous two checkpoint periods to be considered relevant. Following the "**Checkpoint" line is a sequence of entries summarizing each active flow, with the form: : srcaddr:port -> dstaddr:port: NNN elements, YYY min where is the flow's filename (e.g. "gtout890FS1.txt"), srcaddr:port and dstaddr:port are the IP address (in w.x.y.z notation) and UDP port for the flow's source and destination respectively, NNN is the number of packets seem in the flow so far, and YYY is a floating point number representing the number of minutes the flow has been active since it was first detected. A flow is only mentioned during a checkpoint if it has been active during the period of time since the previous checkpoint. A checkpoint ends with a line of the form: **EndCheckpoint NN [XX dumped, YY stored, BB bytes] where NN is the checkpoint number. The other values are details of limited value to regular users (XX is the number of flows dumped, YY represents the number of 'flows' currently stored in memory [including idle flows, or flows that haven't yet seen enough packets to warrant dumping to disk], and BB represents the memory consumption of structures associated with each flows. The meaning of these numbers may change from one version of pkthisto to the next.) The final checkpoint differs in that "**Checkpoint" is replaced with "**FinalCheckpoint", summary information on *all* flows seen during the pkthisto run are dumped, and followed with the line "**Checkpointing complete." In addition, the summary lines have ", started " appended to the "YYY min" field. The last line of a final checkpoint begins with "Histograms were created for NN active flows" followed by version-specific stats in parentheses. Note that when doing real time capture, the fields represent the local host's internal date and time clock. When reading from a stored tracefile, date&time is taken from the timestamps embedded in the file. 4.2 Format of gtoutxxx.txt, gtoutxxxFSd.txt, and gtoutxxxTSd.txt Each flow has one of these files, which begin with: Src: srcaddr:port Dst: dstaddr:port Start: The source and destination IP/UDP information defines the end points of the flow, the start field identifies the time at which the flow was first seen (to an resolution of one second, and accuracy dependent on the system clock). The rest of the file is a sequence of summaries for each histogram, updated each time a checkpoint occurs. Checkpoints begin with the line: Checkpoint: NN start and end with the line Checkpoint: NN end where NN is an integer number representing the checkpoint (and matches the "Checkpoint NN" in gtout0.txt). Between the start and end fields are a sequence of summaries for each histogram stored during the time since the previous checkpoint. They are of the following form (all on one line): Histo:HHH: BB - EE min, avg LL bytes, RR kbps, PP pps, l/m/h XX/YY/ZZ, MMM pkts, QQ err Where HHH is a positive integer uniquely identifying the histogram, BB and EE are the timestamps of the first and last packets making up the histogram (relative to the flow's "Start:" time), LL is the average IP packet length during this interval, RR is the total kbits of IP packets divided by the length of the interval in seconds, PP is the number of packets per second, XX/YY/ZZ represent the low 5%, median, and upper 95% packet sizes, MMM is the number of packets in the interval, and QQ is the number of packets who were longer than the largest histogram bucket. (There are 800 buckets, with each bucket 1 byte wide by default. The bucket width can be increased to cover non-game traffic that has packets longer than 800 bytes.) Note that HHH counts histograms relative to the flow for which the histograms apply. It only increments, and may sometimes increment by more than one between histograms and checkpoints. The reason is that sometimes a flow goes idle after a period of being active, and the histogram being created at the time the flow went idle may see only a handful of packets. During operation, pkthisto reclaims memory from histograms that will never be dumped to disk. When the flow again becomes suffiently active, histograms will be dumped to disk. However, the internal histogram identifier HHH will have incremented a number of times, reflecting the number of histograms that were begun and then released due to the flow being too idle. The minimum number of packets that must be seen to make a valid histogram defaults to 100, and can be changed with the "min_pkts_in_flow" option. The "flow_max_milliseconds" option specifies how many milliseconds can elapse between packets of the same flow before the flow is considered to have become idle. The default is 500 ms. 4.3 Format of LH-, CLH-, IH-, and CIH- files Histograms generated by pkthisto can take one of two forms: pure ASCII (as sequences of "X Y" data points that can be fed to xgraph or copied directly into spreadsheets) or compressed ASCII (lines of encoded text, where the X axis is implied and the Y axis data points are compressed into two-digit base64 values). The compressed ASCII format provides substantial disk space savings during long term, real time traffic traces. Compressed or pure ASCII mode is selected in the configuration file (see ./pkthisto/conf-demo.txt) By default pkthisto generates pure ASCII output. The configuration file option "compressed_output" causes compressed ASCII output instead. The LH-, CLH-, IH-, and CIH- files are affected by this choice. Details of these two formats can be found in the file ./pkthisto/HistoFormats.txt Note that by default, CIH and CLH files do not include data points where Y < 0.5% or Y > 99.5%. This is to save disk space when tracking traffic whose cumulative curves asymptote slowly up from 0% or towards 100%. The user can modify the upper and lower bounds in the configuration file. (See ./pkhtisto/conf-demo.txt) Histogram files are linear. The "-C - Added "file_is_tcpdump_xbox" and "realtime_capture_xbox" config options to monitor Xbox System-Link IP packets - Added configurable upper and lower bounds for cumulative histograms - Minor edits to README, HistoFormats.txt - Deprecated Win32 compatibility, created README-Win32.txt - Noted the requirement to use "gmake" rather than "make" 0.1.3 November 20th 2001 (Internal release, cleanups and minor functionality) - Added "total_flows" config option. - Improved internal memory usage in histogram storage, and fixed a memory reference bug. - Added "len_histo_width" config option. 0.1.2 October 28th 2001 (Bug fixes) - Real time capture mode was not exiting after specified number of packets captured. Now it does. - Added new "**EndCheckpoint" token to checkpoint file, final "** Checkpointing" is now "**Checkpointing". 0.1 and 0.1.1 September 28th 2001 (First release) gj_armitage@yahoo.com