In the term of software upgrades of stateful network elements, fast recovery from downtime and strict avoidance on service disruption are two critial principles for guaranteeing performance and enhancing scalability. On one hand, the upgrade recovery speed of a network element is mainly affected by its module integration time. While Click’s flexible design has satisfied many of the demands for rapid prototyping, its internal architecture seems not caught up with potential software upgrades. Traditional Click-driven network element upgrades is a time-consuming process, since the software integration of upgrades didn’t stick religiously to the service context. Therefore a series of functionality-neutral modules are redundantly shipped with essential modules.
As shown in the above figure, we integrate various number of new modules into upgraded network element in a Click-native manner, and observe their integration latency. The latency is 3.806s when the number of itegrated new module is 1, and increases to 5.652s when the number becomes 5. Besides, if one software upgrade starts a new network element from scatch, involving a comparable module number size with commercial network elements (with about 45 modules), the latency will become 14.459s. In fact, the result does not look good. For the most common number range of integraded new modules, i.e. 1-5, an ideal integration should with sub-millisecond latency instead of current sub-second latency. Therefore, explicitly integrating essential modules in a service context aware manner may be the breakthrough for cutting down upgrade overheads and speeding up recovery.
On the other hand, the avoidance on service disruption or not is determined by whether the service states of stateful network elements can be correctly reconstructed after upgrades. However, unfortunately, with no Click-native mechanism to reconstruct lost network element states, stateful functionalities may be unable to correctly process packets after upgrades, leading to service disruption.
As shown in the above figure, a 2Gb-size file is transmitted through a nat network element, whose software upgrade happens with partial file transmitted. The upgraded nat network element conducts port mapping according to the service states of file transmission session. If the states are lost after upgrade, the original transmission progress will be reset from the beginning of the file, and waste the previous transmission progress before the upgrade happens. When no upgrade happens during the file transmission (x-axis coordinate is 0%), the total transmission time is 43.218s. When the upgrade happens with 40% file transmitted, the time becomes 63.599s. Note this figure is based on that the duration of the upgrade is transient enough.
As shown in the above figure, let us obeserve the changes when the duration of upgrades is different and the upgrade happens with 50% file transmitted. When the upgrade delay is 1s, the total transmission time is 68.810s. When the upgrade delay is 5s, the time becomes 77.837s. Besides, there are two sudden changes in the figure, i.e. x-axis coordinates 6-7 and 12-13. This is caused by tcp retransmission idleness. It further means that the waste time not only includes the 50% file transmission time, but also involves the upgrade downtime, and file transmission session reconstruction time. Briefly, it is obvious that state loss will cause service disruption and demage performance, and current Click modularity still poses challenges in network state maintaining during software upgrages. We cannot expect the operators originating the network elements to manage any problem, thus, a state synchronization scheme which is enforced inside the integrated modules is essential.
All in all, in order to solve these problems and satisfy practical requirements of stateful network element upgrades, we present CLICK-UP, which is the effort towards software upgrades of Click-driven stateful network elements.