Counting network flows
The design of efficient algorithms to count the number of active flows in high-speed networks has recently attracted the interest of the network measurement community (Estan et al., 2003; Fusy and Giroire, 2007; Kim and O’Hallaron., 2003). Counting the number of flows in real-time is particularly relevant to network operators and administrators for network management and security tasks.
However, the naive solution of tracking flows using a hash table is becoming unfeasible in high-speed links. First, it requires several memory accesses per packet with the overhead of creating new flow entries and handling collisions. Second, this solution uses large amounts of memory, since it requires storing all flow identifiers. The number of concurrent flows present in high speed networks is very large, and can be well over one million in current backbone links (Fang and Peterson, 1999). Therefore, hash tables must be stored in DRAM, which has an access time greater than current packet interarrival times. For example, access times of standard DRAM are in the order of tens of nanoseconds, while packet interarrival times can be up to 32 ns and 8 ns in OC-192 (10 Gb/s) and OC-768 (40 Gb/s) links respectively. Thus, flow counting algorithms must be able to process each packet in very few nanoseconds to be suitable for high-speed links.
Counting distinct items
The problem of counting flows in network traffic is equivalent to that of counting distinct elements within a multiset, where the distinct elements to be counted are the flow identifiers. This problem has received attention from the database research community. Many algorithms have been proposed that can obtain approximate distinct element counts over a large dataset by using little memory, requiring one single pass over the data, and with extremely low computational complexity. This problem was first explored in database research, where these algorithms were used for database query optimization (Flajolet, 2004).
The proposal by Whang et al. (1990), called linear-time probabilistic counting, is based on pseudo-random hashing. The ratio of buckets that would have stayed empty on a hypothetical hash table is used to estimate the number of unique elements in the original dataset. This algorithm was popularized under the name direct bitmaps in the networking community by Estan et al. (2003). An algorithm called Loglog counting of similar efficiency to multi-resolution bitmaps is proposed in (Durand and Flajolet, 2003). Giroire’s proposal (2009) is based on estimating cardinalities from the minimum of the hashed values. These and other proposed algorithms are compared in (Metwally et al., 2008). A remarkable conclusion from this study is that, while direct bitmaps are theoretically slightly inferior to its competitors, they are the most practical choice, given their overall good accuracy, extreme speed and ease of implementation.
Direct bitmaps
Like the naive algorithm of using a hash table to track flows, direct bitmaps are based on hashing, but do not create one table entry per flow. Instead, they just record whether a position in the hash table would be used or not (hence the name bitmap). The algorithm uses a pseudo-random hash function (e.g., (Carter and Wegman, 1979)) to evenly distribute the positions corresponding to each flow identifier.
One could think of extrapolating the number of ones in the bitmap as the number of flows in the original dataset. However, this would not be correct, since there is a non-negligible probability that several flow identifiers collide (i.e., hash to the same bitmap position), given the reduced size of the bitmap. While the bitmap cannot be used to extract the exact count of flows, instead, an estimate can be obtained from the bitmap size (b) and the number of zeros in the bitmap (z) as follows:
The principal advantage of this technique is that, with a small amount of memory (especially when compared to the naive algorithm), the number of flows can be estimated with high accuracy. A second important characteristic of this approach is that the accuracy can be arbitrarily increased or reduced by the bitmap size. To correctly dimension the bitmap, the appropriate size must be chosen so that the expected amount of unique elements can be estimated within the desired error bounds, as explained in (Whang et al., 1990).
To give an idea of how little memory this algorithm requires, it suffices to state that this technique can count 10M elements by using around 20KB with an accuracy of 1%.
Estan et al. (2003) present several variants of this basic algorithm that are algorithmically more complex but offer an increased space efficiency, most notably including multi-resolution bitmaps.
The sliding window model
Usually, an analyst is not interested on calculating a metric across the whole stream’s lifetime. The latest data of the stream is often more relevant than the older data. Several models exist to capture this notion of interest decay. The most relevant is the sliding window model, where data older than a certain age is completely irrelevant. This and other interest decay functions are analyzed in (Cohen and Strauss, 2003).
Interest decay functions, including the sliding window model, introduce an extra layer of complexity to algorithms, in that they must take into account the aging of information, and expire it accordingly. For example, when tracking the number of flows over a sliding window using a hash table, each entry would additionally require an expiration timestamp, and the algorithm should discount old entries when calculating its result.
Several efficient specialized algorithms to extract statistics over a sliding window have been proposed. Datar et al. (2002) analyze the basic problem of counting the number of ones within the last N elements of an arbitrary stream composed of zeros and ones. They present an algorithm based on a technique called Exponential Histograms that can provide an approximate solution, and prove its optimality in terms of memory usage. They then proceed to solve, using their basic counting algorithm as a building block, similar problems such as calculating sums, averages, maximum and minimum values and distinct counts. For the distinct count problem, they propose adapting probabilistic counting techniques to the sliding window model by storing timestamps instead of bits in each bitmap position, an approach that (Kim and O’Hallaron, 2003) also takes.
Fusy and Giroire (2007) present in turn an adaptation of the proposal of Giroire (2009) to the sliding window model. Their algorithm tracks the minimum of the hashes of each flow identifier over a sliding window to estimate the flow count.
Timestamp vector
Direct bitmaps do not incorporate a sense of time and thus must be periodically reset between measurement intervals. Periodically querying and resetting a direct bitmap would provide flow counts over consecutive, non-overlapping windows. While there is value to this application of the technique, it imposes the restriction that queries must be aligned with bitmap resets. In contrast, in the sliding window model, queries can arrive at any time. This requirement implies that old information must be removed as newer arrives, in order to continuously maintain the data structures to be able to provide an estimate at any time.
A straightforward solution to adapt the direct bitmaps to the sliding window model is the Timestamp Vector algorithm (Kim and O’Hallaron, 2003). Instead of a bitmap, a vector of timestamps is now used. When a packet hashes to a particular position, its timestamp is set to the timestamp of that packet. Using this vector, when a query for a time window of w time units arrives at time t, the number of flows can be estimated by using (1), where in this case z corresponds to the number of positions with timestamp less than t – w, and b to the number of positions in the vector. Note that this requires a full traversal of the vector for each query.
However, the timestamp vector increases the memory requirements of the original bitmap, since each vector position now holds a high resolution timestamp, usually of 64 bits. Relative timestamps can be used to reduce this overhead as described in (Sanjuàs-Cuxart et al., 2009).
Countdown vector
The main difficulty when adapting the direct bitmap to the sliding window model is to remove old information from the data structure as time advances. The Countdown vector algorithm (CDV) presented in (Sanjuàs-Cuxart et al., 2009) also builds upon the direct bitmaps technique. The algorithm, however, obtains significant cost reductions both in terms of memory and CPU by introducing an extra approximation in the mechanism in charge of the expiration of old information.
Assume the following ideal algorithm which would precisely calculate z in a sliding window of w time units. As discussed, b (vector size) and z (number of zeros in it) suffice to estimate the number of flows in the traffic using (1). A vector of b positions could be used, with the values set to w time units every time a packet hashed to the corresponding position. Every time unit, all the positions of the vector with non-zero value would be decremented by one. The count of positions with counter value zero would correspond to z in order to estimate the number of flows seen in the time window.
In order to obtain a perfect resolution, this scheme would require defining the time unit to the maximum resolution of the system clock. This ideal scheme would therefore be very costly in terms of both memory and CPU. First, a high resolution counter would have be stored in each vector position, thus increasing the overall memory required to store the vector. Second, all of the counters would need to be updated for each time unit. These additional costs make this technique infeasible as described, especially when it would achieve results equivalent to the TSV.
The CDV is very similar to the ideal algorithm just described but, instead, using small integer counters in each position. Using small values requires less memory and calls for a low counter decrement frequency for counters to reach zero after w time units, in exchange of introducing small inaccuracies. The key to the effectiveness of the CDV is that surprisingly low values suffice to achieve good accuracy to estimate network flow counts (Sanjuàs-Cuxart et al., 2009).
References
-
[journals-jcss-CarterW79] bibtexL. Carter and M. N. Wegman, "Universal Classes of Hash Functions.," J. Comput. Syst. Sci., vol. 18, iss. 2, pp. 143-154, 1979.
@article{journals/jcss/CarterW79, added-at = {2011-07-05T00:00:00.000+0200},
author = {Carter, Larry and Wegman, Mark N.},
biburl = {http://www.bibsonomy.org/bibtex/2699bdae3e1016d857e5c2120c826b48f/dblp},
ee = {http://dx.doi.org/10.1016/0022-0000(79)90044-8},
interhash = {20e4f9c5ba054a0bba7d3045aeab3c4f},
intrahash = {699bdae3e1016d857e5c2120c826b48f},
journal = {J. Comput. Syst. Sci.},
keywords = {dblp},
number = 2, pages = {143-154},
timestamp = {2011-07-05T00:00:00.000+0200},
title = {Universal Classes of Hash Functions.},
url = {http://dblp.uni-trier.de/db/journals/jcss/jcss18.html#CarterW79},
volume = 18, year = 1979 }
-
[conf-pods-CohenS03] bibtexE. Cohen and M. Strauss, "Maintaining time-decaying stream aggregates.," in PODS, 2005-05-20 2003, pp. 223-233.
@inproceedings{conf/pods/CohenS03, added-at = {2005-05-20T00:00:00.000+0200},
author = {Cohen, Edith and Strauss, Martin},
biburl = {http://www.bibsonomy.org/bibtex/27de4522e5d46e20946b9555a87a44bba/dblp},
booktitle = {PODS},
crossref = {conf/pods/2003},
date = {2005-05-20},
description = {dblp},
ee = {http://doi.acm.org/10.1145/773153.773175},
interhash = {11729da6a26f9181ef5d9bc0c336ea99},
intrahash = {7de4522e5d46e20946b9555a87a44bba},
isbn = {1-58113-670-6},
keywords = {dblp},
pages = {223-233},
publisher = {ACM},
timestamp = {2005-05-20T00:00:00.000+0200},
title = {Maintaining time-decaying stream aggregates.},
url = {http://dblp.uni-trier.de/db/conf/pods/pods2003.html#CohenS03},
year = 2003 }
-
[journals-siamcomp-DatarGIM02] bibtexM. Datar, A. Gionis, P. Indyk, and R. Motwani, "Maintaining Stream Statistics over Sliding Windows.," SIAM J. Comput., vol. 31, iss. 6, pp. 1794-1813, 2002.
@article{journals/siamcomp/DatarGIM02, added-at = {2011-09-12T00:00:00.000+0200},
author = {Datar, Mayur and Gionis, Aristides and Indyk, Piotr and Motwani, Rajeev},
biburl = {http://www.bibsonomy.org/bibtex/2e46686e593ae6a72e018c58f9532cbee/dblp},
ee = {http://dx.doi.org/10.1137/S0097539701398363},
interhash = {dadb83a56e419de1015fc5c0e8073d8d},
intrahash = {e46686e593ae6a72e018c58f9532cbee},
journal = {SIAM J. Comput.},
keywords = {dblp},
number = 6, pages = {1794-1813},
timestamp = {2011-09-12T00:00:00.000+0200},
title = {Maintaining Stream Statistics over Sliding Windows.},
url = {http://dblp.uni-trier.de/db/journals/siamcomp/siamcomp31.html#DatarGIM02},
volume = 31, year = 2002 }
-
[conf-esa-DurandF03] bibtexM. Durand and P. Flajolet, "Loglog Counting of Large Cardinalities (Extended Abstract).," in ESA, 2003, pp. 605-617.
@inproceedings{conf/esa/DurandF03, added-at = {2011-07-05T00:00:00.000+0200},
author = {Durand, Marianne and Flajolet, Philippe},
biburl = {http://www.bibsonomy.org/bibtex/2f4cd91e3db4a43f80812d21db6460096/dblp},
booktitle = {ESA},
crossref = {conf/esa/2003},
editor = {Battista, Giuseppe Di and Zwick, Uri},
ee = {http://dx.doi.org/10.1007/978-3-540-39658-1_55},
interhash = {a51183a3c316219530f2d6708341a4ed},
intrahash = {f4cd91e3db4a43f80812d21db6460096},
isbn = {3-540-20064-9},
keywords = {dblp},
pages = {605-617},
publisher = {Springer},
series = {Lecture Notes in Computer Science},
timestamp = {2011-07-05T00:00:00.000+0200},
title = {Loglog Counting of Large Cardinalities (Extended Abstract).},
url = {http://dblp.uni-trier.de/db/conf/esa/esa2003.html#DurandF03},
volume = 2832, year = 2003 }
-
[conf-imc-EstanVF03] bibtexC. Estan, G. Varghese, and M. Fisk, "Bitmap algorithms for counting active flows on high speed links.," in Internet Measurement Comference, 2006-02-15 2003, pp. 153-166.
@inproceedings{conf/imc/EstanVF03, added-at = {2006-02-15T00:00:00.000+0100},
author = {Estan, Cristian and Varghese, George and Fisk, Mike},
biburl = {http://www.bibsonomy.org/bibtex/255cd6e36a3edd506892634da262968ec/dblp},
booktitle = {Internet Measurement Comference},
crossref = {conf/imc/2003},
date = {2006-02-15},
description = {dblp},
ee = {http://doi.acm.org/10.1145/948205.948225},
interhash = {db8097ef626bf06c225928a59e0a02c7},
intrahash = {55cd6e36a3edd506892634da262968ec},
isbn = {1-58113-773-7},
keywords = {dblp},
pages = {153-166},
publisher = {ACM},
timestamp = {2006-02-15T00:00:00.000+0100},
title = {Bitmap algorithms for counting active flows on high speed links.},
url = {http://dblp.uni-trier.de/db/conf/imc/imc2003.html#EstanVF03},
year = 2003 }
- Fang, W. and Peterson, L. Inter-AS traffic patterns and their implications. In Proc. of IEEE Globecom, 1999.
-
[conf-asian-Flajolet04] bibtexP. Flajolet, "Counting by Coin Tossings.," in ASIAN, 2004-12-13 2004, pp. 1-12.
@inproceedings{conf/asian/Flajolet04, added-at = {2004-12-13T00:00:00.000+0100},
author = {Flajolet, Philippe},
biburl = {http://www.bibsonomy.org/bibtex/284cfd381bcfc424f34d086fbcd3f9794/dblp},
booktitle = {ASIAN},
crossref = {conf/asian/2004},
date = {2004-12-13},
description = {dblp},
editor = {Maher, Michael J.},
ee = {http://springerlink.metapress.com/openurl.asp?genre=article&issn=0302-9743&volume=3321&spage=1},
interhash = {146b70ad3fb642d5ca7fc00a9492963b},
intrahash = {84cfd381bcfc424f34d086fbcd3f9794},
isbn = {3-540-24087-X},
keywords = {dblp},
pages = {1-12},
publisher = {Springer},
series = {Lecture Notes in Computer Science},
timestamp = {2004-12-13T00:00:00.000+0100},
title = {Counting by Coin Tossings.},
url = {http://dblp.uni-trier.de/db/conf/asian/asian2004.html#Flajolet04},
volume = 3321, year = 2004 }
- Fusy, E. and Giroire, F. Estimating the number of Active Flows in a Data Stream over a Sliding Window. Proc. of SIAM Workshop on Analytic Algorithmics and Combinatorics, 2007.
-
[journals-dam-Giroire09] bibtexF. Giroire, "Order statistics and estimating cardinalities of massive data sets.," Discrete Applied Mathematics, vol. 157, iss. 2, pp. 406-427, 2009.
@article{journals/dam/Giroire09, added-at = {2009-01-20T00:00:00.000+0100},
author = {Giroire, Frédéric},
biburl = {http://www.bibsonomy.org/bibtex/2bda31a9f274ff5fbc1d684d1d2cecd8e/dblp},
date = {2009-01-20},
description = {dblp},
ee = {http://dx.doi.org/10.1016/j.dam.2008.06.020},
interhash = {a1b8b2e85bfc70f398087675a2dde20d},
intrahash = {bda31a9f274ff5fbc1d684d1d2cecd8e},
journal = {Discrete Applied Mathematics},
keywords = {dblp},
number = 2, pages = {406-427},
timestamp = {2009-01-20T00:00:00.000+0100},
title = {Order statistics and estimating cardinalities of massive data sets.},
url = {http://dblp.uni-trier.de/db/journals/dam/dam157.html#Giroire09},
volume = 157, year = 2009 }
- Kim, H.A. and O’Hallaron, D.R. Counting network flows in real time. In Proc. of IEEE Globecom, 2003.
-
[conf-edbt-MetwallyAA08] bibtexA. Metwally, D. Agrawal, and A. E. Abbadi, "Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic.," in EDBT, 2008-04-01 2008, pp. 618-629.
@inproceedings{conf/edbt/MetwallyAA08, added-at = {2008-04-01T00:00:00.000+0200},
author = {Metwally, Ahmed and Agrawal, Divyakant and Abbadi, Amr El},
biburl = {http://www.bibsonomy.org/bibtex/24ee03c6a2746beac02d3e996b3a99230/dblp},
booktitle = {EDBT},
crossref = {conf/edbt/2008},
date = {2008-04-01},
description = {dblp},
editor = {Kemper, Alfons},
ee = {http://doi.acm.org/10.1145/1353343.1353418},
interhash = {5579cef998b931391f5bb1030b53e017},
intrahash = {4ee03c6a2746beac02d3e996b3a99230},
isbn = {978-1-59593-926-5},
keywords = {dblp},
pages = {618-629},
publisher = {ACM},
series = {ACM International Conference Proceeding Series},
timestamp = {2008-04-01T00:00:00.000+0200},
title = {Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic.},
url = {http://dblp.uni-trier.de/db/conf/edbt/edbt2008.html#MetwallyAA08},
volume = 261, year = 2008 }
- Sanjuàs-Cuxart, J., Barlet-Ros, P. Solé-Pareta, J. Counting flows over sliding windows in high speed networks. In Proc. of IFIP-TC6 Networking, 2009.
-
[journals-tods-WhangVT90] bibtexK. Whang, B. V. T. Zanden, and H. M. Taylor, "A Linear-Time Probabilistic Counting Algorithm for Database Applications.," ACM Trans. Database Syst., vol. 15, iss. 2, pp. 208-229, 1990.
@article{journals/tods/WhangVT90, added-at = {2012-01-25T00:00:00.000+0100},
author = {Whang, Kyu-Young and Zanden, Brad T. Vander and Taylor, Howard M.},
biburl = {http://www.bibsonomy.org/bibtex/237d65a23f005cde093f4d627a451a775/dblp},
cdrom = {TODS15/P208.PDF},
cite = {conf/sigmod/YuLSO79},
ee = {http://doi.acm.org/10.1145/78922.78925},
interhash = {6205d502485f819a900adcca9485362c},
intrahash = {37d65a23f005cde093f4d627a451a775},
journal = {ACM Trans. Database Syst.},
keywords = {dblp},
number = 2, pages = {208-229},
timestamp = {2012-01-25T00:00:00.000+0100},
title = {A Linear-Time Probabilistic Counting Algorithm for Database Applications.},
url = {http://dblp.uni-trier.de/db/journals/tods/tods15.html#WhangVT90},
volume = 15, year = 1990 }
