Traffic Modelling

A large body of literature has developed concepts and techniques for modeling Internet traffic, especially in terms of statistical properties (e.g., heavy-tail, self-similarity). For example, heavy-tailed distributions appear in the sizes of files stored on web servers [124], data files transferred through the Internet [294], and files stored in general-purpose Unix filesystems, suggesting the prevalence and importance of these distributions. Self-similarity characteristics exist in Internet traffic. In a pioneering work, Leland et al. showed that LAN traffic exhibits a self-similar nature [243]. Evidence of self-similarity was also found in WAN traffic [296]. In that work, Paxson and Floyd demonstrated that self-similar processes capture the statistical characteristics of the WAN packet arrival more accurately than Poisson arrival processes, which are quite limited in their burstiness, especially when multiplexed to a high degree. Self-similar traffic does not exhibit a natural length for its “bursts”. Its traffic bursts appear in various time scales [243]. The relation of the self-similarity and heavy-tailed behavior in wired LAN and WAN traffic was analyzed by Willinger et al. [355]. On the other hand, Poisson processes can be used tomodel the arrival of user sessions (e.g., telnet connections and ftp control connections). However, modeling packet arrivals in telnet connections by a Poisson process may result in inaccurate delay characteristics, since packet arrivals are strongly affected by network dynamics and protocol characteristics.

Web traffic exhibits also self-similarity characteristics. Crovella and Bestavros showed evidence of this and attempted to explain them in terms of file system characteristics (e.g., distribution of web file size, user preference in file transfer, effects of caching), user behavior (e.g., “think time” accessing a web page), and the aggregation of many such flows in a LAN [123]. The majority of web traffic in wired networks is below 10KB while a small percentage of very large flow account for 90% of the total traffic. They employed powerlaws to describe web flow sizes. We also observed similar phenomena in the campus-wide wireless traffic. A nice discussion of the use of power law and lognormal distributions in other fields can be found in [266].

Peer-to-peer applications evolve rapidly, dominating the traffic mix in several cases. As recent studies have indicated, peer-to-peer and web traffic differ significantly (e.g., unlike in web, where web clients may download a popular web page, multiple times, the immutability of Kazaa’s multimedia objects leads clients to fetch objects at most once) [168]. However due to their increasing number, the differences in their communication pattern, and the difficulty to classify them accurately, modeling of peer-to-peer traffic is challenging.

Two general approaches for traffic generation are the packet-level replay and source-level generation. The packet-level replay is an exact reproduction of a collected trace both in terms of packet arrival times, size, source and destination, and content type. To analyze a system under various traffic conditions, researchers need to employ the appropriate packet-level trace that exhibits the required traffic conditions. However, collecting the appropriate empirical data is a non-trivial task. Specifically, reproducing the intended packet arrival process can be complex due to the arbitrary delays introduced at the various network components by various interrupts, service mechanisms, and scheduling processes. Closed-loop or feedback-loop characteristics manifest the reactions of the source and destination of a flow to network conditions, triggering further changes (e.g., tcp’s congestion avoidance mechanism). However, packet-level replays cannot reflect such feedback-loop characteristics.

Adopting a different approach, the source-level models the sources of traffic (e.g., the applications running on the source and destination). These sources are used as building blocks, along with the various network components that can be modeled or simulated, allowing the analysis of a system under various conditions. The generation of packet-level data can be based on some statistical properties that characterize the empirical data, and thus, ensure that the synthetic data are “realistic enough”. However, it is important to note that the realism of a trace depends tightly on the system to be studied. The selection of these statistical properties that are general enough but also tunable to express different traffic conditions/profiles is a non-trivial task and depends on the characteristics of the system to be studied. The source-level approach, advocated by Paxson and Floyd [149], allows the underlying network, protocol, and application layer to specify and control the packet arrival process. The infinite source model is one of the simplest and popular source-level models. It has no parameters and is used to model very large network flows. However the infinite source model models the traffic poorly, since the majority of Internet traffic is relatively light, with bidirectional flows, and of small packet size [107, 211, 150]. An enlightening discussion of these approaches is included in [178].

While there is rich literature on traffic characterization in wired networks (e.g., [354, 80, 120, 102, 278]), there is significantly less work of the same  depth for WLANs. Hierarchical approaches to modeling the wireless demand and its spatial and temporal phenomena have received little attention from our community. Meng et al. [261] used the available Dartmouth traces, that include syslog messages and tcpdump data from 31 APs in five buildings. They proposed a two-tier (Weibull regression) model for the arrival of flows at APs and a Weibull model for flow residing times, and they also observed high spatial similarity within the same building. The authors also studied the modeling of flow size, and suggest that a Lognormal model provides the best approximation. Minkyong et al. [223] clustered APs based on their peak hours and analyzed the distribution of arrivals for each cluster, using the aggregate client arrivals and departures at APs. Similar clusters based on registration patterns were also reported by Ravi Jain et al. in their modeling study of user registration at APs [201].

Papadopouli et al. [179, 217] used a novel methodology for modeling the wireless access and traffic demand by providing a multilevel perspective. In particular, they modeled the arrival and size of sessions and flows considering various spatio-temporal scales and explored their statistical properties, dependencies and inter-relations. Time-varying Poisson processes provide a suitable tool for modeling the arrival processes of clients at APs. They validated these results by modeling the visit arrival rates at different time intervals and APs. In addition, they proposed a clustering of the APs based on their visit arrival and the functionality of the area in which they are located. The models have been validated using empirical data from different time periods (an entire week in April 2005 and another one in April 2006), different time scales (week, day, hour), different spatial scales (AP, group of APs located within the same building, set of APs located within buildings of the same functionality, and entire wireless infrastructure), and various workload conditions (with respect to the application-mixes and amount of traffic load). The BiPareto distribution models well the flow sizes of the Dartmouth trace, collected from its wireless campus-wide infrastructure. They generated synthetic traces based on these models and showed that they result in a performance very close to the one when empirical traces are used as input. Furthermore, synthetic traces based on popular models—employed frequently in simulations—exhibit large deviations from the empirical traces. The trade-offs between accuracy and scalability of these models were also evaluated using statistics-based and systems-based benchmarks. 

References