Data streaming algorithms for traffic analysis

The most challenging aspect of traffic analysis algorithms is that they have to operate under heavy resource constraints. Due to the ever-increasing bandwidth of network links and, especially, of backbone network links, they have little time to process each incoming packet. Also due to the massive data rates, buffering for long periods of time is impractical. Therefore, network traffic has to be processed in strict real time. Failure to do so results in buffer overflow, which in turn leads to packet losses that impact severely on the measurement accuracy.

The problem of data stream processing has recently attracted much attention in the database community. The stream processing model assumes, instead of the traditional static data set, that data flows continuously from a source. The system, thus, cannot control the input data rates. In addition, data streams are potentially unbounded in size, which impedes storing the whole dataset. Therefore, algorithms can perform only one single pass on the data.

Examples of data streams are the network traffic, information from sensors and financial data, their key characteristic being the large volume and data rates. Interesting introductions to the field of data stream research can be found in [Muthukrishnan, 2005] and [Babcock et al., 2002]. A more informal discussion of the motivations behind this research area can be found here.

The data streaming community has provided very efficient algorithms that can be used for traffic analysis. For example, algorithms for counting unique items in a data stream can be used for counting network flows in high-speed links, while techniques to identify frequent items can be applied to detect heavy hitters or elephant flows.

Traffic analysis algorithms

References

  • [conf-pods-BabcockBDMW02] bibtex
    B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, "Models and Issues in Data Stream Systems.," in PODS, 2006-03-31 2002, pp. 1-16.
    @inproceedings{conf/pods/BabcockBDMW02,
      author = {Brian Babcock and Shivnath Babu and Mayur Datar and Rajeev Motwani and Jennifer Widom},
      booktitle = {PODS},
      crossref = {conf/pods/2002},
      editor = {Lucian Popa},
      interhash = {80a3fd2acfc5bec5b6af3f3e28901604},
      intrahash = {47ce53a26d706f07040f0d3600e735f0},
      pages = {1-16},
      publisher = {ACM},
      title = {Models and Issues in Data Stream Systems.},
      url = {http://dblp.uni-trier.de/db/conf/pods/pods02.html#BabcockBDMW02},
      year = 2002, keywords = {dblp},
      ee = {http://www.acm.org/sigs/sigmod/pods/proc02/papers/001-BabcockBDMW.pdf},
      added-at = {2006-03-31T00:00:00.000+0200},
      description = {dblp},
      biburl = {http://www.bibsonomy.org/bibtex/247ce53a26d706f07040f0d3600e735f0/dblp},
      date = {2006-03-31}
    }
  • [muthu] bibtex
    S. Muthukrishnan, Data streams: algorithms and applications, Now Publishers, 2005.
    @book{muthu,
      author = {S. Muthukrishnan},
      interhash = {a365da8fa1ad76fc705c326614e82138},
      intrahash = {e316fa9da3ef8c54ad6c2953b5e5f332},
      publisher = {Now Publishers},
      title = {Data streams: algorithms and applications},
      year = 2005, keywords = {imported},
      added-at = {2008-04-30T12:59:47.000+0200},
      description = {KDubiq Blueprint},
      biburl = {http://www.bibsonomy.org/bibtex/2e316fa9da3ef8c54ad6c2953b5e5f332/kdubiq}
    }