Layered Encapsulation of Congestion Notification

Layered Encapsulation of Congestion Notification BT

B54/77, Adastral Park Martlesham Heath Ipswich IP5 3RE UK +44 1473 645196 bob.briscoe@bt.com http://www.cs.ucl.ac.uk/staff/B.Briscoe/

Transport Transport Area Working Group Congestion Control and Management Congestion Notification Information Security Tunnelling Protocol ECN IPsec This document redefines how the explicit congestion notification (ECN) field of the outer IP header of a tunnel should be constructed. It brings all IP in IP tunnels (v4 or v6) into line with the way IPsec tunnels now construct the ECN field, ensuring that the outer header reveals any congestion experienced so far on the path. It specifies the default ECN tunneling behaviour for any Diffserv per-hop behaviour (PHB), but also gives general principles to guide the design of alternate congestion marking behaviours for specific PHBs and for lower layer congestion notification schemes.

This document redefines how the explicit congestion notification (ECN) field of the outer IP header of a tunnel should be constructed. It brings all IP in IP tunnels (v4 or v6) into line with the way IPsec tunnels now construct the ECN field, ensuring that the outer header reveals any congestion experienced so far on the path. Although this memo focuses on IP in IP tunnelling it also gives generalised advice for any encapsulation by lower layer headers. ECN allows a congested resource to notify the onset of congestion without having to drop packets, by explicitly marking a proportion of packets with the congestion experienced (CE) codepoint. Congestion notification is unusual in that it propagates from the physical layer upwards to the transport layer, because congestion is exhaustion of a physical resource. The transport layer can directly detect loss of a packet (or frame) by a lower layer. But if a lower layer marks a packet (or frame) to notify incipient congestion, this marking has to be explicitly copied up the layers at every header decapsulation. So, at each decapsulation of an outer (lower layer) header a congestion marking has to be arranged to propagate into the forwarded (upper layer) header. It must continue upwards until it reaches the destination transport, which should feed congestion notification back to the source transport. Note that often lower layer resources are arranged to be protected by higher layer buffers, so instead of blocking occurring at the lower layer, it occurs when the higher layer queue overflows. Thus, non-blocking link and physical layer technologies do not have to implement congestion notification, which can be introduced solely in IP layer active queue management (AQM). However, if we want to use congestion notification, we have to arrange for it to be explicitly copied up the layers when IP is tunnelled in IP (and if a particular link layer technology isn't protected from blocking by network layer queues). IPsec tunnel mode is a specific form of tunnelling that can hide the inner headers. Because the ECN field has to be mutable, it cannot be covered by IPsec encryption or authentication calculations. Therefore concern has been raised in the past that the ECN field could be used as a low bandwidth covert channel to communicate with someone on the unprotected public Internet even if an end-host is restricted to only communicate with the public Internet through an IPsec gateway. However, the recently updated version of IPsec chose not to block this covert channel, deciding that the threat could be managed given the channel bandwidth is so limited (ECN is a 2-bit field). An unfortunate sequence of standards actions leading up to this latest change in IPsec has left us with nearly the worst of all possible combinations of outcomes, despite the best endeavours of everyone concerned. Even though information about congestion experienced on the upstream path has various uses if it is revealed in the outer header of a tunnel, when ECN was standardised it was decided that all IP in IP tunnels should hide upstream congestion information simply to avoid the extra complexity of two different mechanisms for IPsec and non-IPsec tunnels. However, now that IPsec tunnels deliberately no longer hide this information, we are left in the perverse position where non-IPsec tunnels still hide congestion information unnecessarily. This document is designed to correct that anomaly. Specifically, RFC3168 says that, if a tunnel supports ECN (termed a 'full-functionality' ECN tunnel), the tunnel ingress must not copy a CE marking from the inner header into the outer header that it creates. Instead the tunnel ingress has to set the ECN field of the outer header to ECT(0) (i.e. codepoint 10). We term this 'resetting' a CE codepoint. However, RFC4301 reverses this, stating that the tunnel ingress must simply copy the ECN field from the inner to the outer header. The main purpose of this document is to carry over this new relaxed attitude to covert channels from IPsec to all IP in IP tunnels, so all tunnel ingress nodes consistently copy the ECN field. The rest of the document deals with the knock-on effects of this apparently minor change. It is organised as follows: §5 of RFC3168 permits the Diffserv codepoint (DSCP) to 'switch in' different behaviours for marking the ECN field, just as it switches in different per-hop behaviours (PHBs) for scheduling. Therefore we cannot only discuss the ECN protocol that RFC3168 gives as a default. We need to also give guidance for possible different marking schemes. Therefore in we lay out the design constraints when tunneling congestion notification. Then in we resolve the tensions between these constraints to give general design principles on how a tunnel should process congestion notification; principles that could apply to any marking behaviour for any PHB, not just the default in RFC3168. In particular, we examine the underlying principles behind whether CE should be reset or copied into the outer header at the ingress to a tunnel—or indeed at the ingress of any layered encapsulation of headers with congestion notification fields. then confirms the precise rules for the default ECN tunnelling behaviour based on the above design principles. These rules apply to all PHBs, unless stated otherwise in the specification of a PHB. There is no requirement for a PHB to state anything about ECN behaviour if the default behaviour is sufficient. Extending the new IPsec tunnel ingress behaviour to all IP in IP tunnels causes one further knock-on effect that is dealt with in on Backward Compatibility. If one end of an IPsec tunnel is compliant with , assuming IKEv2 key management is used, the other end can be guaranteed to also be compliant. So there is no backward compatibility problem with IKEv2 RFC4301 IPsec tunnels. But once we extend our scope to any IP in IP tunnel, we have to cater for the possibility that a tunnel ingress compliant with this specification is sending to an egress that doesn't even understand ECN (e.g. a legacy tunnel egress). If a tunnel ingress copied incoming ECN-capable headers into outer headers, then a legacy tunnel egress would discard any congestion markings added to the outer header within the tunnel. ECN-capable traffic sources would not see any congestion feedback and instead continually ratchet up their share of the bandwidth without realising that cross-flows from other ECN sources were continually having to ratchet down. The scope of this document is all IP in IP tunnelling, irrespective of whether IPv4 or IPv6 is used for either of the inner and outer headers. The document only concerns wire protocol processing at tunnel endpoints and makes no changes or recommendations concerning algorithms for congestion marking or congestion response. The general design principles of may also be useful when any datagram/packet/frame with a congestion notification capability is encapsulated by a connectionless outer header that might also support a congestion notification capability in the future as discussed in §9.3 of (e.g. IP encapsulated in L2TP , GRE or PPTP ). However, of course, the IETF does not have standards authority over every link or tunnel protocol, so this document focuses only on IP in IP. applies these principles to IP in MPLS and to MPLS in MPLS.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in .

Tunnel processing of a congestion notification field has to meet congestion control needs without creating new information security vulnerabilities (if information security is required).

Information security can be assured by using various end to end security solutions (including IPsec in transport mode ), but a commonly used scenario involves the need to communicate between two physically protected domains across the public Internet. In this case there are certain management advantages to using IPsec in tunnel mode solely across the publicly accessible part of the path. The path followed by a packet then crosses security 'domains'; the ones protected by physical or other means before and after the tunnel and the one protected by an IPsec tunnel across the otherwise unprotected domain. We will use the scenario in where endpoints 'A' and 'B' communicate through a tunnel with ingress 'I' and egress 'E' within physically protected edge domains across an unprotected internetwork where there may be 'men in the middle', M.

<--domain--><-protected domain-> +------------------+ +------------------+ | | M | | | A-------->I=========>==========>E-------->B | | | | | +------------------+ +------------------+ <----IPsec secured----> tunnel ]]> IPsec encryption is typically used to prevent 'M' seeing messages from 'A' to 'B'. IPsec authentication is used to prevent 'M' masquerading as the sender of messages from 'A' to 'B' or altering their contents. But 'I' can also use IPsec tunnel mode to allow 'A' to communicate with 'B', but impose encryption to prevent 'A' leaking information to 'M'. Or 'E' can insist that 'I' uses tunnel mode authentication to prevent 'M' communicating information to 'B'. Mutable IP header fields such as the ECN field (as well as the TTL/Hop Limit and DS fields) cannot be included in the cryptographic calculations of IPsec. Therefore, if 'I' encrypts but copies these mutable fields into the outer header that is exposed across the tunnel it will have allowed a covert channel from 'A' to M. And if 'E' copies these fields from the outer header to the inner, even if it validates authentication from 'I', it will have allowed a covert channel from 'M' to 'B'. ECN at the IP layer is designed to carry information about congestion from a congested resource to some downstream node that will feed the information back somehow to the point upstream of the congestion that can regulate the load on the congested resource. In terms of the above scenario, ECN is effectively intended to create an information channel from 'M' to 'B', for 'B' to forward to 'A'. Therefore the goals of IPsec and ECN are mutually incompatible. With respect to the DS or ECN fields, §5.1.2 of RFC4301 says, "controls are provided to manage the bandwidth of this [covert] channel". Using the ECN processing rules of RFC4301, the channel bandwidth is two bits per datagram from 'A' to 'M' and one bit per datagram from 'M' to 'A' because 'E' limits the combinations it will copy. In both cases the covert channel bandwidth is further reduced by noise from any real congestion marking. RFC4301 therefore implies that these covert channels are sufficiently limited to be considered a manageable threat. However, with respect to the larger (6b) DS field, the same section of RFC4301 says not copying is the default, but a configuration option can allow copying "to allow a local administrator to decide whether the covert channel provided by copying these bits outweighs the benefits of copying". Of course, an administrator considering copying of the DS field has to take into account that it could be concatenated with the ECN field giving an 8b per datagram channel.

Congestion control requires that any congestion notification marked into packets by a resource will be able to traverse a feedback loop back to a node capable of controlling the load on that resource. To avoid ambiguity later rather than calling this node the data source we will call it the Load Regulator. This will allow us to deal with exceptional cases where load is not regulated by the data source, but usually the two will be synonymous. Note the term "a node capable of controlling the load" deliberately includes a source application that doesn't actually control the load but ought to (e.g. an application without congestion control that uses UDP).

R--->I=========>M=========>E-------->B ]]> We now consider a similar tunneling scenario to the IPsec one just described, but without the different security domains so we can just focus on ensuring the control loop and management monitoring can work (). If we want resources in the tunnel to be able to explicitly notify congestion and the feedback loop is from 'B' to 'A', it will certainly be necessary for 'E' to copy any CE marking from the outer header to the inner header for onward transmission to 'B', otherwise congestion notification from resources like 'M' cannot be fed back to the Load Regulator ('A'). But it doesn't seem necessary for 'I' to copy CE markings from the inner to the outer header. For instance, if resource 'R' is congested, it can send congestion information to 'B' using the congestion field in the inner header without 'I' copying the congestion field into the outer header and 'E' copying it back to the inner header. 'E' can then write any additional congestion marking introduced across the tunnel into the congestion field of the inner header. Indeed, this arrangement can be extended to multi-level congestion marking (such as that proposed for PCN ) as long as all the marks have unambiguously ranked values. For instance, if a hypothetical multi-level marking scheme for PCN had PCN-capable codepoints ranked 1, 2 and 3, then, if 'I' reset the outer congestion field to the lowest ranked value that is PCN-capable (1), 'E' would simply write the highest ranked of the inner and outer congestion markings into the forwarded header. For instance, if the inner marking on arrival at 'I' was 3 and 'I' reset the outer to 1, but 'M' subsequently set it to 2, then the header forwarded by 'E' would be max(3,2) = 3. It might be useful for the tunnel egress to be able to tell whether congestion occurred across a tunnel or upstream of it. If outer header congestion marking was reset at the tunnel ingress ('I'), by the end of a tunnel ('E') the outer headers would indicate congestion experienced across the tunnel ('I' to 'E'), while the inner header would indicate congestion upstream of 'I'. But the same information could be gleaned even if the tunnel ingress copied the inner to the outer headers. By the end of the tunnel ('E'), any packet with an extra mark in the outer header relative to the inner header would indicate congestion across the tunnel ('I' to 'E'), while the inner header would still indicate congestion upstream of ('I'). All this shows that 'E' can preserve the control loop irrespective of whether 'I' copies congestion notification into the outer header or resets it.

As well as control, there are also management constraints. Specifically, a management system may monitor congestion markings in passing packets, perhaps at the border between networks as part of a service level agreement. For instance, monitors at the borders of autonomous systems may need to measure how much congestion has accumulated since the original source to determine between them how much of the congestion is contributed by each domain. Therefore it should be clear how far back in the path the congestion markings have accumulated from. In this document we term this the baseline of the congestion marking, i.e. the source of the layer that last reset rather than copied the congestion notification field when creating an outer header. Given some tunnels cross domain borders (e.g. consider M in is monitoring a border), it is therefore desirable for 'I' to copy congestion accumulated so far into the outer headers exposed across the tunnel. discusses various scenarios where the Load Regulator lies in-path, not at the source host as we would typically expect. It concludes that the baseline for congestion notification should be determined by where the Load Regulator function is, whether it is at the source host or within the path. Therefore every tunnel ingress should copy the ECN field into the outer header it creates unless it is also a Load Regulator, in which case it should reset any CE markings, which is an exception to the normal copying rule for a tunnel ingress.

The constraints from the three perspectives of security, control and management in are somewhat in tension as to whether a tunnel ingress should copy congestion markings into the outer header it creates or reset them. From the control perspective either copying or resetting works. From the management perspective copying is preferable (with the exception of an in-path load regulator). From the security perspective resetting is preferable but copying is now considered acceptable given the bandwidth of a 2-bit covert channel can be managed. Therefore an outer encapsulating header capable of carrying congestion markings SHOULD reflect accumulated congestion since the last interface designed to regulate load (the Load Regulator). This implies congestion notification SHOULD be copied into the outer header of each new encapsulating header that supports it—except at an in-path Load Regulator. An in-path Load Regulator knows its function is to regulate load, so if it also acts as the ingress to a tunnel, in every new outer header it creates it MUST reset any congestion marking. The Load Regulator is the node to which congestion feedback should be returned by the next downstream node with a transport layer function (typically but not always the data receiver). The Load Regulator is not always (or even typically) the same thing as the node identified by the source address of the outermost exposed header. In general the addressing of the outermost encapsulation header says nothing about the identifiers of either the upstream or the downstream transport layer functions. As long as the transport functions know each other's addresses, they don't have to be identified in the network layer or in any link layer. It was only a convenience that a TCP receiver assumed that the address of the source transport is the same as the network layer source address of a packet it receives. More generally, the return transport address could be identified solely in the transport layer protocol. For instance, a signalling protocol like RSVP breaks up a path into transport layer hops and informs each hop of the address of its transport layer neighbour without any need to identify these hops in the network layer. RSVP can be arranged so that these transport layer hops are bigger than the underlying network layer hops. The host identity protocol (HIP) architecture also supports the same principled separation (for mobility amongst other things), where the transport layer receiver identifies the transport layer sender using an identifier provided by the transport layer, which gets mapped to a network layer address below the transport layer. Note that this principle deliberately doesn't require a packet header to reveal the origin address of the baseline that congestion notification has accumulated from. It is not necessary for the network and lower layers to know the address of the Load Regulator. Only the destination transport needs to know that. With congestion notification, the network and link layers only notify congestion forwards, they aren't involved in feeding it backwards. If they are, e.g. backward congestion notification (BCN) in Ethernet , that should be considered as a transport function added to the lower layer, which must sort out its own addressing. Indeed, this is one reason why ICMP source quench is now deprecated ; when congestion occurs within a tunnel it is complex (particularly in the case of IPsec tunnels) to return the ICMP messages beyond the tunnel ingress back to the Load Regulator . Similarly, if a management system is monitoring congestion and needs to know the baseline of congestion notification, the management system has to find this out from the transport; in general it cannot tell solely by looking at the network or link layer headers. We have said that a tunnel ingress that is not a Load Regulator SHOULD (as opposed to MUST) copy incoming congestion notification into an outer encapsulating header that supports it. In the case of 2-bit ECN, the IETF security area have deemed the benefit always outweighs the risk. Therefore for 2-bit ECN we can and we will say 'MUST' (). But in this section where we are setting down general design principles, we leave it as a 'SHOULD'. This allows for future multi-bit congestion notification fields where the risk from the covert channel created by copying congestion notification might outweigh the congestion control benefit of copying.

The following ECN tunnel processing rules are the default for a packet with any DSCP. If required, different ECN processing rules MAY be defined for the appropriate Diffserv PHB using the guidelines in . When a tunnel ingress creates an encapsulating IP header, the 2-bit ECN field of the inner IP header MUST be copied into the outer IP header, for all types of IP in IP tunnel (except if the tunnel ingress is in compatibility mode—see ). If the tunnel ingress is also a Load Regulator, it MUST instead reset the outer header to ECT(0). To decapsulate the inner header at the tunnel egress, the outgoing inner header MUST be calculated from the combination of the incoming inner and outer headers setting the outgoing ECN field to the codepoints displayed in the body of . +--Incoming Outer Header--- Incoming Inner Header Not-ECT ECT(0) ECT(1) CE Not-ECT Not-ECT drop (!!!) drop(!!!) drop(!!!) ECT(0) ECT(0) ECT(0) ECT(0) CE ECT(1) ECT(1) ECT(1) ECT(1) CE CE CE CE (!!!) CE (!!!) CE +-----Outgoing Header------ The exclamation marks '(!!!)' in indicate that this combination of inner and outer headers should not be possible if only legal transitions have taken place. So, the decapsulator should drop or mark the ECN field as the table specifies, but it MAY also raise an appropriate alarm. It MUST NOT raise an alarm so often that the illegal combinations would amplify into a flood of alarm messages.

A legacy tunnel egress may not know how to process an ECN field, so it will most likely simply disregard all outer headers. Therefore, unless a compliant tunnel ingress has established that the tunnel egress understands ECN processing, it MUST only send packets with the ECN field set to Not-ECT in the outer header. Otherwise, if ECN capable outer headers were sent towards a legacy egress, it would dangerously remove information about congestion experienced within the tunnel. A tunnel ingress may establish whether its tunnel egress will understand ECN processing by configuration or by negotiation. Note that a tunnel ingress that has used IKEv2 key management can guarantee that the tunnel egress is also RFC4301-compliant and therefore need not negotiate ECN capabilities. To be compliant with this specification a tunnel ingress that does not know the egress ECN capability (e.g. by configuration) MUST implement a 'normal' mode and a 'compatibility' mode, and it MUST initiate each negotiated tunnel in compatibility mode. On the other hand, a compliant tunnel egress MUST merely implement the one behaviour in , which we term 'full-functionality' mode. Before switching to normal mode, a compliant tunnel ingress that does not know the egress ECN capability (e.g. by configuration) MUST negotiate with the tunnel egress to establish whether the egress is in full functionality mode. If the egress is in full functionality mode, the ingress puts itself into normal mode. In normal mode the ingress follows the encapsulation rule in (i.e. it copies the inner ECN field into the outer header). If the egress is not in full-functionality mode or doesn't understand the question, the tunnel ingress MUST remain in compatibility mode. A tunnel ingress in compatibility mode MUST set all outer headers to Not-ECT. The decapsulation rules for the egress of the tunnel in have been defined in such a way that congestion control will still work safely if any of the earlier versions of ECN processing are used unilaterally at the encapsulating ingress of the tunnel. If a tunnel ingress tries to negotiate to use limited functionality mode or full functionality mode, a decapsulating tunnel egress compliant with this specification MUST agree to the request, even though its behaviour will be the same in both cases. For 'forward compatibility', a compliant tunnel egress MUST raise a warning about any requests to enter modes it doesn't recognise, but it can continue operating. If no ECN-related mode is requested, no error or warning need be raised as the egress behaviour is compatible with all the legacy ingress behaviours that don't negotiate capabilities. Note that if a compliant node is the ingress for multiple tunnels, a mode setting will need to be stored for each tunnel ingress. However, if a node is the egress for multiple tunnels, none of the tunnels will need to store a mode setting, because a compliant egress can only be in one mode.

The rule that a tunnel ingress MUST copy any ECN field into the outer header is a change to RFC3168 (unless it is a Load Regulator as well, in which case there is no change). The rules for calculating the outgoing ECN field on decapsulation at a tunnel egress are in line with the full functionality mode of ECN in RFC3168 and with RFC4301, except that neither identified the need to raise an alarm if the inner header was CE but the outer header was ECT. The rules for how a tunnel establishes whether the egress has full functionality ECN capabilities are an update to RFC3168. For all the typical cases, RFC4301 is not updated by the ECN capability check in this specification, because a typical RFC4301 tunnel ingress will have already established that it is talking to an RFC4301 tunnel egress (e.g. if it uses IKEv2). However, there may be some corner cases (e.g. manual keying) where an RFC4301 tunnel ingress talks with an egress with limited functionality ECN handling. For such corner cases, the requirement to use compatibility mode in this specification updates RFC4301. The optional ECN Tunnel field in the IPsec security association database (SAD) and the optional ECN Tunnel Security Association Attribute defined in RFC3168 are no longer needed. The security association (SA) has no policy on ECN usage, because all RFC4301 tunnels now support ECN without any policy choice. RFC3168 defines a (required) limited functionality mode and an (optional) full functionality mode for a tunnel, but RFC4301 doesn't need modes. In this specification only the ingress might need two modes, unlike the modes of RFC3168 that were properties of the pair of tunnel endpoints after negotiation. All these ECN processing rules update RFC2003 on IP in IP tunnelling.

This memo includes no request to IANA.

discusses the security constraints imposed on ECN tunnel processing. The Design Principles of trade-off between security (covert channels) and congestion monitoring & control. In fact, ensuring congestion markings are not lost is itself another aspect of security, because if we allowed congestion notification to be lost, any attempt to enforce a response to congestion would be much harder. We keep the behaviour defined in both RFC3168 and RFC4301 where, if the inner and outer headers carry contradictory ECT values the inner header is preserved for onward forwarding. However, in writing this document we noticed this behaviour would hide illegal suppression of congestion notification from the detection mechanism designed for this attack. One reason two ECT codepoints were defined was to enable the source to detect if a CE marking had been applied then subsequently removed. The source could detect this by weaving a pseudo-random sequence of ECT(0) and ECT(1) values into a stream of packets . With the rules as they stand in RFC3168 and RFC4301, within a tunnel a CE marking could be added and subsequently removed by a non-compliant node without detection, because the evidence of such misbehaviour is removed by the decapsulator. We could have specified that an outer header value of ECT should overwrite a contradictory ECT value in the inner header to close this loophole. But we chose not to for two reasons: i) we wanted to avoid any changes to IPsec tunnelling behaviour; ii) allowing ECT values in the outer header to override the inner header would have increased the bandwidth of the covert channel through the egress gateway from 1 to 1.5 bit per datagram, potentially threatening to upset the consensus established in the security area that says that the bandwidth of this covert channel can now be safely managed.

This document updates the tunnelling treatment of RFC3168 ECN for all IP in IP tunnels to bring it into line with the new behaviour in the IPsec architecture of RFC4301. At the tunnel egress, header decapsulation for the default ECN marking behaviour is broadly unchanged except that one exceptional case has been catered for. At the ingress, for all forms of IP in IP tunnel, encapsulation has been brought into line with the new IPsec rules in RFC4301 which copy rather than reset CE markings when creating outer headers. Previously, upstream congestion information was not revealed in the outer header, which limited the scope of some management monitoring techniques and prevented certain active queue management algorithms from taking account of upstream congestion markings. The change ensures all IP in IP tunnels reflect the more relaxed attitude to revealing congestion information in the new IPsec architecture, which now deems that the threat from 2-bit covert channels can be managed without disabling ECN. Also, this document defines more generic principles to guide the design of alternate forms of tunnel processing of congestion notification, if required for specific Diffserv PHBs (such as will be required for the PCN working group) or for other lower layer encapsulating protocols that might support congestion notification in the future (e.g. MPLS).

Thanks to David Black, Bruce Davie, Toby Moncaster and Gabriele Corliano for their careful review comments.

Comments and questions are encouraged and very welcome. They can be addressed to the IETF Transport Area working group mailing list <tsvwg@ietf.org>, and/or to the authors.

In the traditional Internet architecture one tends to think of the source host as the Load Regulator for a path. It is generally not desirable or practical for a node part way along the path to regulate the load. However, various reasonable proposals for in-path load regulation have been made from time to time (e.g. fair queuing, traffic engineering). Also the IETF has recently chartered a working group to standardise admission control across a part of a path using pre-congestion notification (PCN) , which involves in-path load regulation. This is of particular relevance here because it involves congestion notification with an in-path Load Regulator and it can involve tunnelling. We will use the more complex scenario in to tease out all the issues that arise when combining congestion notification and tunnelling with various possible in-path load regulation schemes. In this case 'I1' and 'E2' break up the path into three separate congestion control loops. The feedback for these loops is shown going right to left across the top of the figure. The 'V's are arrow heads representing the direction of feedback, not letters. But there are also two tunnels within the middle control loop: 'I1' to 'E1' and 'I2' to 'E2'. The two tunnels might be VPNs, perhaps over two MPLS core networks. M is a congestion monitoring point, perhaps between two border routers where the same tunnel continues unbroken across the border.

R--->I1===========>E1----->I2=========>==========>E2------->B ]]> The question is, should the congestion markings in the outer exposed headers of a tunnel represent congestion only since the tunnel ingress or over the whole upstream path from the source of the inner header (whatever that may mean)? Or put another way, should 'I1' and 'I2' copy or reset CE markings? The answer is that the baseline of congestion marking should be the nearest upstream interface designed to regulate traffic load—the Load Regulator. In 'A', 'I1' or 'E2' are all Load Regulators. We have shown the feedback loops returning to each of these nodes so that they can regulate the load causing the congestion notification. So the baseline for congestion markings exposed to M should be 'I1' (the Load Regulator), not 'I2'. That is, 'I2' SHOULD copy any CE marking into the outer header it creates, while 'I1' is an exception because it is an in-path load regulator, so it should reset the ECN field in the outer header it creates. The following further examples illustrate how this answer might be applied: Preemption marking is currently defined for PCN so that the rate of unmarked packets at the end of a path of multiple bottlenecks determines the maximum sustainable aggregate bit rate over that path. To produce the correct marking by the end, each congested node must only consider packets to be eligible for marking if they have not already been marked by any previous bottleneck along a path that may span multiple tunnels (including MPLS encapsulations etc.). This scheme only results in the correct marking rate if the markings accumulated so far along the path are copied into the outer exposed header of each tunnel or encapsulation. Consider that 'I1' and 'E2' in the complex scenario of are edge gateways of a PCN region. Admission control based on PCN measurements is a form of load regulation, so 'I1' regulates the load on the PCN region. Therefore 'I1' should be the baseline of congestion marking for both tunnels within the scope of its feedback loop. Therefore 'I2' should follow the normal rules and copy congestion marking into the outer tunnel header, while 'I1' is an exception because it is also a load regulator, so it should reset CE markings in the outer header. suggested feedback of ECN accumulated across an MPLS domain could cause the ingress to trigger re-routing to mitigate congestion. This case is more like the simple scenario of , with a feedback loop across the MPLS domain ('E' back to 'I'). The baseline for congestion exposed in outer headers in this case will be the tunnel ingress, which should therefore reset the ECN field in the outer headers it creates. But the reason it should act as the baseline is because it is an in-path load regulator (re-routing around congestion is a load regulation function), not just because it is a tunnel ingress. The PWE3 working group of the IETF is considering the problem of how and whether an aggregate private wire emulation should respond to congestion . Although the study is still at the requirements stage, some (controversial) solution proposals include in-path load regulation at the ingress to the tunnel that could lead to tunnel arrangements with similar complexity to that of . These are not contrived scenarios—they could be a lot worse. For instance, a host may create a tunnel for IPsec which is placed inside a tunnel for Mobile IP over a remote part of its path. And around this all we may have MPLS labels being pushed and popped as packets pass across different core networks. Similarly, it is possible that subnets could be built from link technology (e.g. ethernet switches) so that link headers being added and removed could involve congestion notification in future link headers with all the same issues as with IP in IP tunnels. The reason we introduced the concept of a Load Regulator was to allow for in-path load regulation. In the traditional Internet architecture one tends to think of a host and a Load Regulator as synonymous, but when considering tunnelling, even the definition of a host is too fuzzy, whereas a Load Regulator is a clearly defined function. Similarly, the concept of innermost header is too fuzzy to be able to (wrongly) say that the source address of the innermost header should be the baseline. Which is the innermost header when multiple encapsulations may be in use? Where do we stop? If we say the original source in the above IPsec-Mobile IP case is the host, how do we know it isn't tunnelling an encrypted packet stream on behalf of another host in a p2p network? The reason there has been so much confusion over the question of whether a tunnel ingress should copy or reset CE markings is that we have become used to thinking that only hosts regulate load. The end to end design principle advises that this is a good idea , but it also advises that it is only a guiding principle intended to make the designer think very carefully before breaking it. We do have proposals where load regulation functions sit within a network path for good, if sometimes controversial, reasons, e.g. PCN edge admission control gateways or traffic engineering functions at domain borders to re-route around congestion .