Date: Wed, 17 Aug 1994 21:20:58 -0700 From: Hal To: cypherpunks@toad.com Subject: Statistics on remail message sizes A couple of weeks ago Eric asked for statistical information on remailer message sizes. I put in a size-counter a week ago (just piping each message into wc >> remail2/SIZE.REMAIL) or so, and here are some results. They show 645 messages logged, a sample of what the logs look like, the average size of a message in characters (counting the header) of about 15K, and a histogram of message sizes rounded to the nearest 1000. Note that the histogram is pretty irregular, possibly being affected by repeated sending of certain messages. jobe% wc remail2/SIZE.REMAIL 645 1935 16125 remail2/SIZE.REMAIL jobe% tail remail2/SIZE.REMAIL 58 189 3225 16 90 850 18 121 1016 14 90 896 23 140 1350 653 803 41937 710 860 45666 710 860 45642 20 96 901 28 146 1344 jobe% awk '{sum=sum+$3} END{print sum/NR}' < remail2/SIZE.REMAIL 14794.4 jobe% < remail2/SIZE.REMAIL awk '{print int(($3+500)/1000)*1000}' | sort -n | uniq -c 229 1000 82 2000 50 3000 21 4000 3 5000 45 6000 9 7000 1 8000 1 9000 3 10000 2 11000 1 12000 2 13000 5 14000 3 16000 3 17000 2 18000 1 19000 2 21000 3 23000 1 24000 2 25000 2 26000 2 27000 1 28000 1 30000 1 31000 1 32000 39 34000 37 35000 1 37000 2 38000 2 42000 2 46000 1 48000 1 49000 1 50000 1 51000 1 55000 9 59000 69 60000 I did one other test, which was to see which message sizes were repeated the most. The first number shows the number of lines which have messages of exactly the second number of bytes: jobe% < remail2/SIZE.REMAIL awk '{print }' | sort -n | uniq -c | sort -nr | sed 20q > times2 40 896 40 1350 20 5797 14 1344 11 33845 11 1242 10 892 9 33992 9 1248 8 1753 7 33975 5 1765 5 1757 5 1236 4 901 4 1749 4 1251 3 59725 3 59668 3 5945 It is clear that there is a lot of repetition, probably standard ping messages and the like. This should give enough info to discard the highly repeated sets from the histogram above in order to get a possibly more representative set of numbers. Hal