IPCR Tool version 0.1 --------------------- $Id: README.txt 1037 2015-08-13 03:52:17Z szander $ OVERVIEW -------- IPCR Tool is a command line tool that uses data about observed actively used IP addresses to estimate the population of actively used IP addresses. It is a stripped-down cleaned-up version of the tool used to obtain the results in [1]. IPCR is an R script and you must install R to run it. The motivator for IPCR Tool is that at the time it was written (2014) almost all IPv4 addresses were allocated, yet it was unclear how many of the allocated addresses were actively used. Only knowing the number of actively used IP addresses allows one to make predictions about the actual pressure to transition to IPv6. For a more detailed introduction please read [1]. IPCR Tool builds a log-linear model based on IP address capture data. With IP address capture data we refer to multiple samples of unique IP addresses that are known to have been actively used. The samples can be collected from server logs, active probing or passive traffic capturing. The log-linear model can then be used to estimate the number of active but unseen IP addresses. IPCR supports log-linear Poisson models as well as log-linear Truncated Poisson models (since we have an upper bound for the actively used addresses). For details about the technique please refer to [1] and the references in [1]. This README only focuses on how to use the tool. INSTALLATION ------------ In order to run IPCR Tool you must install R (https://www.r-project.org/) either from source or from a package for your OS. Then you must install the following R packages: - Rcapture - MuMIn - gamlss - gamlss.tr The R manual describes how to install R packages. IPCR is a single R script and does not require installation. Just copy it to a directory of your choice. RUNNING ------- IPCR Tool can be run from the command line as follows: > R CMD BATCH --vanilla ipcr_tool.R || less +G ipcr_tool.Rout Above is a string of command line parameters, which we explain below. By default log information is printed out to a file ipcr_tool.Rout, but this file name can be changed (see R manual). Apart from the log file, when successfully completed IPCR tool generates an output file that looks as follows: --- IPCR output --- "Stratum_name" "Total" "Stratum_Total" "Observed" "LL_Best" "LL_Ind" "LL_Sat" "SC" "Pingable" "LL_Best_Stderr" "LL_Best_Inf" "LL_Best_Sup" "CV_Pingable" "CV_Ground_Truth" "LL_Best_Trunc" "1" "all" 2752461762 2752461762 842027706 1173372271 955510597 1013028326 0 541487438 1280413 1166619853 1180261603 0 0 1173393473 ------------------- The first line contains the field names, the following lines contain the results with one line per stratum. The fields are: Stratum_name: name of stratum (or all if only one stratum) Total: Total number of IPs (from delegated information) Stratum_Total: Total number of IPs for the stratum Observed: Number of observed IPs (from capture data) LL_Best: Estimated number of IPs (best log-linear model) LL_Ind: Estimated number of IPs (independent log-linear model) LL_Sat: Estimated number of IPs (saturated log-linear model) SC: always 0 Pingable: number of pingable IPs (data source named "ping") LL_Best_Stderr: standard error for best log-linear model LL_Best_Inf: lower value of confidence interval LL_Best_Sup: upper value of confidence interval CV_Pingable: pingable IPs if cross-validation is used, otherwise 0 CV_Ground_Truth: ground truth if cross-validation is used, otherwise 0 LL_Best_Trunc: estimate for log-linear model with truncated Poisson Note that the first field in each data line is the line number (generated by R's write.table). PARAMETERS ---------- IPCR Tool accepts a number of parameters, which are simply shell environment variables. In the following we describe the most commonly used parameters. FNAMES=" ... " List of names of the files that contain the capture histories. Each file must have the following format. With N sources the first N columns contain the capture histories. The N+1th column contains the capture counts for each history. The file must have 2^N - 1 lines since the capture history with all 0 is missing and is what the tool estimates. An example for two sources is: 0, 1, 1234 1, 0, 423 1, 1, 23 FNAMESFILE="" Name of file that contains list of capture file names (one per line). DELEGATED="" File name of file with delegated information. The file must have the following format with columns separated by "|". The first column specifies the RIR, the second columns is a 2-letter country code, the third column must be "ipv4", the fourth column is the first address of an allocated IP range, the fifth column is the number of IPs, the sixth column is the date the block was allocated, the seventh column must be "allocated", the eighth column is a classification into the type of industry, the last column is the number of routed IP addresses. An example of the format is: apnic|AU|ipv4|1.0.0.0|256|20110811|allocated|U|256 apnic|CN|ipv4|1.0.1.0|256|20110414|allocated|U|0 [...] The first 7 columns are obtained from the delegated information, which can be downloaded e.g. from wget http://bgp.potaroo.net:/stats/nro/delegated Column 8 can contain anything is stratification according to industry type is not needed. The last column must be generated based on routing data and is not needed is the total should be based on allocated (requires a little modification in the code). STRATUM="" (default: 0) Stratum we are interested in. This is an index starting from 1 that relates to the columns in the delegated file, i.e. if we want to stratify by country then STRATUM=2. Setting this to 0 means no stratification. DS_EST="" (default: "") Treat the IPs observed by data source as unknown and estimate them (index starts with 1). Use this for cross-validation. LEAVE_OUT=" ... " (default: "") Leave out the data sources specified. A source is specified by the numeric index relating to the column in the capture table (index starts with 1). FREQ_SAMPLE_RATE="" (default: 1.0) Specify the sample rate if the capture table contains counts for sampled data. MIN_IP_PER_STRATUM="" (default: 1000) Specify the minimum number of IPs needed. If the capture table has fewer IPs, it is ignored. ADAPT_UNIT="0|1" (default: 1) Set to 1 to used the capture count adaptation described in [1]. Set to 0 otherwise. UNIT="" (default: 1000) Set to the initial divider (if ADAPT_UNIT=1) or divider for the capture counts. OPREFIX="" Prefix for the result file name. MAX_SIZE="" (default: "") Specify the number of routed IPs, if we want to estimate a specific network (mainly used with truncated Poisson). Only set this if you estimate for a single IP range. DO_SUBNETS="0|1" (default: 0) Set to 0 if we estimate IPs, set to 1 if we estimate /24 subnets. For subnets the counters in the delegated file need to be divided by 256. MAX_SOURCE_INTERACT="" (default: N) Max number of source interactions consider when building the model. USE_IC="AIC|BIC" (default: AIC) Specify which IC to use. MIN_DELTA_IC="" (default: 7) Specify the minimum IC difference needed for a better model. ALPHA="" (default: 0.0000001) Specify alpha for the CI estimation. SAME_MODEL_TRUNCPOIS="0|1" (default: 0) If set to 1 the truncated Poisson estimation uses the model build for Poisson. If set to 0 different models are build for Poisson and truncated Poisson. READ_MODEL="" (default: "") If specified reads the model from the file name. WRITE_MODEL="" (default: "") If specified writes the best model to the file name. SNAMES=" ... " This parameter specifies the list of source names. If it is not specified IPCR Tool will try to extract the names from the capture history file name, which may fail if the naming of the file does follow a certain convention. The names can be chosen freely, with one exception. IPCR Tool reports the number of pingable addresses as the number of addresses seen by the source with the name "ping". EXAMPLE USAGE ------------- The examples here assume you execute the tool under the Bash shell. If you use a different shell, look at the shell's man page to find out how to specify environment variables. The following is an example to estimate the total used IPv4 space: > SAME_MODEL_TRUNCPOIS=1 ADAPT_UNIT=1 UNIT=1000 MAX_SRC_INTERACT=4 MIN_DELTA_IC=7 USE_IC=BIC FNAMES="wiki_web_ping_tping_mlab_spam_game_swin_netfl_capture_strat.txt" SNAMES="wiki web ping tping mlab spam game swin netfl" LEAVE_OUT="" DELEGATED=delegated_ipv4_routed_integrated OPREFIX=test R CMD BATCH --vanilla ipcr_tool.R || less +G ipcr_tool.Rout The following is an example for estimating the used IPv4 addresses for a specific subnet: > MAX_SIZE=65536 ALPHA=0.01 MIN_IP_PER_STRATUM=1 SAME_MODEL_TRUNCPOIS=0 ADAPT_UNIT=1 UNIT=1 USE_IC=BIC MAX_SRC_INTERACT=4 FNAMES="subnetX_wiki_spam_mlab_web_game_tping_iping_swin_netfl_capture_data.txt" SNAMES="wiki spam mlab web game tping iping swin netfl" LEAVE_OUT="" DELEGATED=delegated_ipv4_routed_integrated OPREFIX=subnetX_test R CMD BATCH --vanilla ipcr_tool.R || less +G ipcr_tool.Rout COPYRIGHT --------- Copyright (c) 2015 Centre for Advanced Internet Architectures, Swinburne University of Technology. Author: Sebastian Zander (szander@swin.edu.au) This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 as published by the Free Software Foundation. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA ACKNOWLEDGEMENTS ---------------- This tool has been made possible in part by an Australian Research Council (ARC) Linkage Project grant (project LP110100240) with APNIC Pty Ltd as partner organisation, for a project titled "Tools and models for measuring and predicting growth in internet addressing and routing complexity". REFERENCES ---------- [1] S. Zander, L. L. H. Andrew, G. Armitage, "Capturing Ghosts: Predicting the Used IPv4 Space by Inferring Unobserved Addresses," in Internet Measurement Conference (IMC), November 2014. CONTACT ------- If you have any questions or want to report any bugs please contact Sebastian Zander (szander@swin.edu.au). Centre for Advanced Internet Architectures Swinburne University of Technology Melbourne, Australia CRICOS number 00111D http://www.caia.swin.edu.au