Difference between revisions of "QoS"

From Hack Sphere Labs Wiki
Jump to: navigation, search
(CoDel)
(Implementation (pfSense))
Line 77: Line 77:
  
 
*http://pfsensesetup.com/queue-configuration-in-pfsense-21/
 
*http://pfsensesetup.com/queue-configuration-in-pfsense-21/
 +
*https://forum.pfsense.org/index.php?topic=62133.0
  
 
=HFSC Details=
 
=HFSC Details=

Revision as of 14:38, 4 February 2015

General Notes

  • With queueing we determine the way in which data is SENT. It is important to realise that we can only shape data that we transmit.
  • If your link is truly full and you want to make sure that no single session can dominate your outgoing bandwidth, use Stochastical Fairness Queueing.
  • Queueing Discipline (qdisc) - An algorithm that manages the queue of a device, either incoming (ingress) or outgoing (egress).

Any QoS algorithm seems to do one of two things. It can work with the upstream buffer changing the TCP flow....and that is it, just looking at the buffer. Another method is to slow everything down and let QoS manage the max of the connection. This makes it so the downstream buffers are never fully used.

Bufferbloat

The flow of packets slows down while travelling through a network link between a fast and a slow network, especially at the start of a TCP session, when there is a sudden burst of packets and link to the slower network may not be able to process the burst quickly enough. Buffers exist to ease this problem by giving the fast network a place to push packets, to be read by the slower network as fast as it can.[1] In other words, buffers act like shock absorbers to convert bursty arrivals into smooth, steady departures. However, a buffer has a finite size, and it can hold only a specific maximum number of packets. The ideal buffer is sized so it can handle a sudden burst of communication and match the speed of that burst to the speed of the slower network. Ideally, the "shock absorbing" situation is characterized by a temporary delay for packets in the buffer during the transmission burst, after which the delay rapidly disappears and the network reaches a balance in offering and handling packets.[1]

The TCP congestion avoidance algorithm relies on packet drops to determine the available bandwidth. It speeds up the data transfer until packets start to drop, and then slows down the transmission rate. Ideally it keeps speeding up and slowing down the transmission rate, until it finds an equilibrium to the speed of the link. However, for this to work the packet drops must occur in a timely manner, so that the algorithm can select a suitable transfer speed. With a large buffer that has been filled, the packets will arrive at their destination, but with a higher latency. The packet is not dropped, so TCP does not slow down once the uplink has been saturated, further filling the buffer. Newly arriving packets are dropped only when the buffer is fully saturated. TCP may even decide that the path of the connection has changed, and again go into the more aggressive search for a new operating point.[8][9]

In the problematic situation, packets queued in a network buffer are only dropped if the buffer is full. Having a big and constantly full buffer which causes increased transmission delays and reduced interactivity, especially when looking at two or more simultaneous transmissions over the same channel, is called bufferbloat. Available channel bandwidth can also end up being unused, as some fast destinations may not be reached due to buffers clogged with data awaiting delivery to slow destinations, what is caused by contention between simultaneous transmissions competing for some space in an already full buffer.

General Stochastical Fairness Queueing

Stochastic Fairness Queueing (SFQ) is a simple implementation of the fair queueing algorithms family. It's less accurate than others, but it also requires less calculations while being almost perfectly fair.

The key word in SFQ is conversation (or flow), which mostly corresponds to a TCP session or a UDP stream. Traffic is divided into a pretty large number of FIFO queues, one for each conversation. Traffic is then sent in a round robin fashion, giving each session the chance to send data in turn.

This leads to very fair behaviour and disallows any single conversation from drowning out the rest. SFQ is called 'Stochastic' because it doesn't really allocate a queue for each session, it has an algorithm which divides traffic over a limited number of queues using a hashing algorithm.

Because of the hash, multiple sessions might end up in the same bucket, which would halve each session's chance of sending a packet, thus halving the effective speed available. To prevent this situation from becoming noticeable, SFQ changes its hashing algorithm quite often so that any two colliding sessions will only do so for a small number of seconds.

It is important to note that SFQ is only useful in case your actual outgoing interface is really full! If it isn't then there will be no queue on your linux machine and hence no effect. Later on we will describe how to combine SFQ with other qdiscs to get a best-of-both worlds situation.

Specifically, setting SFQ on the ethernet interface heading to your cable modem or DSL router is pointless without further shaping!

MORE COPYPASTA: 9.2.3.1. Parameters & usage [1]

The SFQ is pretty much self tuning:

perturb

   Reconfigure hashing once this many seconds. If unset, hash will never be reconfigured. Not recommended. 10 seconds is probably a good value.

quantum

   Amount of bytes a stream is allowed to dequeue before the next queue gets a turn. Defaults to 1 maximum sized packet (MTU-sized). Do not set below the MTU!

limit

   The total number of packets that will be queued by this SFQ (after that it starts dropping them). 

General Hierarchical Token Bucket

Martin Devera (<devik>) rightly realised that CBQ is complex and does not seem optimized for many typical situations. His Hierarchical approach is well suited for setups where you have a fixed amount of bandwidth which you want to divide for different purposes, giving each purpose a guaranteed bandwidth, with the possibility of specifying how much bandwidth can be borrowed.

HTB works just like CBQ but does not resort to idle time calculations to shape. Instead, it is a classful Token Bucket Filter - hence the name. It has only a few parameters, which are well documented on his site.

pfSense

The pfSense shaper tries to support all the shapers with one web form. This means that sometimes you will see lements that don't apply, such as bandwidth settings for CODEL and PRIQ. Just ignore them.[2]

Schedulers Available:

  • HFSC
  • CBQ
  • FAIRQ
  • CODELQ
  • PRIQ

ALTQ

options         ALTQ
options         ALTQ_CBQ
options         ALTQ_RED
options         ALTQ_RIO
options         ALTQ_HFSC
options         ALTQ_PRIQ
options         ALTQ_NOPCC
options         ALTQ_FAIRQ

Implementation (pfSense)

https://forum.pfsense.org/index.php?topic=62041.0 There can be no integration for CODEL in the wizard since it still does not make sense to me. If you put codel as the scheduling policy on the interface, i would recommend on WAN only, you have nothing to tweak apart the bandwidth value. There are no sub queues or anything like that in this variant and nothing to be configured on rules.

HFSC Details

If you put ack packets in a high bw queue, they will confirm with the remote system that data was received.

You can give certain services priority and keep speed and latency low.

You can serve x amount of data out quickly while slowing long term objects. You decide you want to serve out data quickly in the beginning of the connection and slow down after a few seconds. This is called a nonlinear service curve (NLSC or just SC).[3]

bandwidth

  • parent queue is max bandwidth on entire interface
  • child is percent or hard number that cannot exceed parent queue

priority level specifies the order in which a service is to occur relative to other queues and is used in CBQ and PRIQ, but not HFSC. Priority is does _not_ define an amount of bandwidth, but the order in which packets are buffered before being set out of the interface. Default (1)[3]

qlimit: the amount of packets to buffer and queue when the amount of available bandwidth has been exceeded. This value is 50 packets by default. When the total amount of upload bandwidth has been reached on the outgoing interface or higher queues are taking up all of the bandwidth then no more data can be sent. The qlimit will put the packets the queue can not send out into slots in memory in the order that they arrive. When bandwidth is available the qlimit slots will be emptied in the order they arrived; first in, first out (FIFO). If the qlimit reaches the maximum value of qlimit, the packets will be dropped.[3]

Look at qlimit slots as "emergency use only," but as a better alternative to dropping the packets out right. Understand dropping packets is the proper way TCP knows it needs to reduce bandwidth; so dropping packets are not bad. The problem is TCP Tahoe or Reno methods will slow down the connection too severely and it takes a while to ramp back up after a dropped packet. A small qlimit buffer helps smooth out the connection, but "buffer bloat" works against TCP's congestion control. Also, do not think that setting the qlimit really high will solve the problem of bandwidth starvation and packet drops. What you want to do is setup a queue with the proper bandwidth boundaries so that packets only go into the qlimit slots for a short time (no more than a second), if ever.[3]

Calculating qlimit: If the qlimit is too large then you will run into a common issue called buffer bloat. Search on Google for "buffer bloat" for more information. A good idea is to set the qlimit to the amount of packets you want to buffer (not drop) in no more then a given amount of time. Take the total amount of upload bandwidth you have for your connection. Lets say that is 25 megabit upload speed. Now decide how much time you are willing to buffer packets before they get sent out. Lets say we will buffer 0.5 seconds which is quite long. So, 25 megabit divided by 8 is 3.125 megabytes per second. The average maximum segment size is 1460 bytes. 3.125 MB/sec divided by 0.001460 MB is 2140.41 packets per second. Now, we decided that we want to queue 0.5 seconds which is 2140.41 packets per second time 0.5 seconds which is 1070 packets. Thus, we set the qlimit at 1070. 1070 packets at a MSS of 1460 bytes is a 1.562 megabyte buffer. This is just a rough model, but you get the idea. We prefer to set our buffer a little high so that network spikes get buffered for 0.5 to one(1) second and then sent out. This method smooths out upload spikes, but does add some buffer bloat to our external network connection. In _our_tests on _our_ network a larger buffer worked better in the real world then the default qlimit of 50 packets set by OpenBSD. Do your own tests and make an informed decision.

realtime: the amount of bandwidth that is guaranteed to the queue no matter what any other queue needs. Realtime can be set from 0% to 80% of total connection bandwidth. Lets say you want to make sure that your web server gets 25KB/sec of bandwidth no matter what. Setting the realtime value will give the web server queue the bandwidth it needs even if other queues want to share its bandwidth.

upperlimit: the amount of bandwidth the queue can _never_ exceed. For example, say you want to setup a new mail server and you want to make sure that the server never takes up more than 50% of your available bandwidth. Or lets say you have a p2p user you need the limit. Using the upperlimit value will keep them from abusing the connection.

linkshare (m2): this value has the exact same use as "bandwidth" above. If you decide to use both "bandwidth" and "linkshare" in the same rule, pf (OpenBSD) will override the bandwidth directive and use "linkshare m2". This may cause more confusion than it is worth especially if you have two different settings in each. For this reason we are not going to use linkshare in our rules. The only reason you may want to use linkshare _instead of_ bandwidth is if you want to enable a nonlinear service curve.

nonlinear service curve (NLSC or just SC): The directives realtime, upperlimit and linkshare can all take advantage of a NLSC. In our example below we will use this option on our "web" queue. The format for service curve specifications is (m1, d, m2). m2 controls the bandwidth assigned to the queue. m1 and d are optional and can be used to control the initial bandwidth assignment. For the first d milliseconds the queue gets the bandwidth given as m1, after wards the value given in m2.

default: the default queue. As data connections or rules which are not specifically put into any other queue will be put into the default queue rule. This directive must be in only one rule. You can _not_ have two(2) default directives in any two(2) rules.

ecn: In ALTQ, ECN (Explicit Congestion Notification) works in conjunction with RED (Random early detection). ECN allows end-to-end notification of network congestion without dropping packets.

ECN is an optional feature which is used when both endpoints support it and are willing to use it. OpenBSD has ecn disabled by default and Ubuntu has it turned on only if the remote system asks for it first. Traditionally, TCP/IP networks signal congestion by dropping packets. When ECN is successfully negotiated, an ECN-aware router may set a mark in the IP header instead of dropping a packet in order to signal impending congestion. The receiver of the packet echoes the congestion indication to the sender, which must react as though a packet was dropped. ALTQ's version of RED is similar to Weighted RED (WRED) and RED In/Out (RIO) which provide early detection when used with ECN. The end result is a more stable TCP connection over congested networks.

Be very careful when enabling ECN on your machines. Remember that any router or ECN enabled device can notify both the client and server to slow the connection down. If a machine in your path is configured to send ECN when their congestion is low then your connections speed will suffer greatly. For example, telling clients to slow their connections when the link is 90% saturated would be reasonable. The connection would have a 10% safety buffer instead of dropping packets. Some routers are configured incorrectly and will send ECN when they are only 10%-50% utilized. This means your throughput speeds will be painfully low even though there is plenty of base bandwidth available. Truthfully, we do not use ECN or RED due to the ability of routers, misconfigured or not, to abuse congestion notification.

CBQ Details

Research 1

Let us first define some basic terms in CBQ. In CBQ, every class has variables idle and avgidle and parameter maxidle used in computing the limit status for the class, and the parameter offtime used in determining how long to restrict throughput for overlimit classes.[4]


  1. Idle: The variable idle is the difference between the desired time and the measured actual time between the most recent packet transmissions for the last two packets sent from this class. When the connection is sending more than its allocated bandwidth, then idle is negative. When the connection is sending perfectly at its alloted rate, then idle is zero.
  1. avgidle: The variable avgidle is the average of idle, and it computed using an exponential weigted moving average (EWMA). When the avgidle is zero or lower, then the class is overlimit (the class has been exceeding its allocated bandwidth in a recent short time interval).
  1. maxidle: The parameter maxidle gives an upper bound for avgidle. Thus maxidle limits the credit given to a class that has recently been under its allocation.
  1. offtime: The parameter offtime gives the time interval that a overlimit must wait before sending another packet. This parameter determines the steady-state burst size for a class when the class is running over its limit.
  1. minidle: The minidle parameter gives a (negative) lower bound for avgidle. Thus, a negative minidle lets the scheduler remember that a class has recently used more than its allocated bandwidth.

There are three types of classes, namely leaf classes (such as a video class) that have directly assigned connections; nonleaf classes used for link-sharing; and the root class that represents the entire output link.

http://qos.ittc.ku.edu/howto/node43.html

Research 2

Moving to this reference: http://www.lartc.org/lartc.html#LARTC.QDISC.CLASSFUL

Quote:

9.5.4. The famous CBQ qdisc [5]

As said before, CBQ is the most complex qdisc available, the most hyped, the least understood, and probably the trickiest one to get right. This is not because the authors are evil or incompetent, far from it, it's just that the CBQ algorithm isn't all that precise and doesn't really match the way Linux works.

Besides being classful, CBQ is also a shaper and it is in that aspect that it really doesn't work very well. It should work like this. If you try to shape a 10mbit/s connection to 1mbit/s, the link should be idle 90% of the time. If it isn't, we need to throttle so that it IS idle 90% of the time.

This is pretty hard to measure, so CBQ instead derives the idle time from the number of microseconds that elapse between requests from the hardware layer for more data. Combined, this can be used to approximate how full or empty the link is.

This is rather tortuous and doesn't always arrive at proper results. For example, what if the actual link speed of an interface that is not really able to transmit the full 100mbit/s of data, perhaps because of a badly implemented driver? A PCMCIA network card will also never achieve 100mbit/s because of the way the bus is designed - again, how do we calculate the idle time?

It gets even worse if we consider not-quite-real network devices like PPP over Ethernet or PPTP over TCP/IP. The effective bandwidth in that case is probably determined by the efficiency of pipes to userspace - which is huge.

People who have done measurements discover that CBQ is not always very accurate and sometimes completely misses the mark.

In many circumstances however it works well. With the documentation provided here, you should be able to configure it to work well in most cases.

9.5.4.1. CBQ shaping in detail

As said before, CBQ works by making sure that the link is idle just long enough to bring down the real bandwidth to the configured rate. To do so, it calculates the time that should pass between average packets.

During operations, the effective idletime is measured using an exponential weighted moving average (EWMA), which considers recent packets to be exponentially more important than past ones. The UNIX loadaverage is calculated in the same way.

The calculated idle time is subtracted from the EWMA measured one, the resulting number is called 'avgidle'. A perfectly loaded link has an avgidle of zero: packets arrive exactly once every calculated interval.

An overloaded link has a negative avgidle and if it gets too negative, CBQ shuts down for a while and is then 'overlimit'.

Conversely, an idle link might amass a huge avgidle, which would then allow infinite bandwidths after a few hours of silence. To prevent this, avgidle is capped at maxidle.

If overlimit, in theory, the CBQ could throttle itself for exactly the amount of time that was calculated to pass between packets, and then pass one packet, and throttle again. But see the 'minburst' parameter below.

These are parameters you can specify in order to configure shaping:

avpkt

   Average size of a packet, measured in bytes. Needed for calculating maxidle, which is derived from maxburst, which is specified in packets.

bandwidth

   The physical bandwidth of your device, needed for idle time calculations.

cell

   The time a packet takes to be transmitted over a device may grow in steps, based on the packet size. An 800 and an 806 size packet may take just as long to send, for example - this sets the granularity. Most often set to '8'. Must be an integral power of two.

maxburst

   This number of packets is used to calculate maxidle so that when avgidle is at maxidle, this number of average packets can be burst before avgidle drops to 0. Set it higher to be more tolerant of bursts. You can't set maxidle directly, only via this parameter.

minburst

   As mentioned before, CBQ needs to throttle in case of overlimit. The ideal solution is to do so for exactly the calculated idle time, and pass 1 packet. For Unix kernels, however, it is generally hard to schedule events shorter than 10ms, so it is better to throttle for a longer period, and then pass minburst packets in one go, and then sleep minburst times longer.
   The time to wait is called the offtime. Higher values of minburst lead to more accurate shaping in the long term, but to bigger bursts at millisecond timescales.

minidle

   If avgidle is below 0, we are overlimits and need to wait until avgidle will be big enough to send one packet. To prevent a sudden burst from shutting down the link for a prolonged period of time, avgidle is reset to minidle if it gets too low.
   Minidle is specified in negative microseconds, so 10 means that avgidle is capped at -10us.

mpu

   Minimum packet size - needed because even a zero size packet is padded to 64 bytes on ethernet, and so takes a certain time to transmit. CBQ needs to know this to accurately calculate the idle time.

rate

   Desired rate of traffic leaving this qdisc - this is the 'speed knob'!

Internally, CBQ has a lot of fine tuning. For example, classes which are known not to have data enqueued to them aren't queried. Overlimit classes are penalized by lowering their effective priority. All very smart & complicated. 9.5.4.2. CBQ classful behaviour

Besides shaping, using the aforementioned idletime approximations, CBQ also acts like the PRIO queue in the sense that classes can have differing priorities and that lower priority numbers will be polled before the higher priority ones.

Each time a packet is requested by the hardware layer to be sent out to the network, a weighted round robin process ('WRR') starts, beginning with the lower-numbered priority classes.

These are then grouped and queried if they have data available. If so, it is returned. After a class has been allowed to dequeue a number of bytes, the next class within that priority is tried.

The following parameters control the WRR process:

allot

   When the outer CBQ is asked for a packet to send out on the interface, it will try all inner qdiscs (in the classes) in turn, in order of the 'priority' parameter. Each time a class gets its turn, it can only send out a limited amount of data. 'Allot' is the base unit of this amount. See the 'weight' parameter for more information.

prio

   The CBQ can also act like the PRIO device. Inner classes with higher priority are tried first and as long as they have traffic, other classes are not polled for traffic.

weight

   Weight helps in the Weighted Round Robin process. Each class gets a chance to send in turn. If you have classes with significantly more bandwidth than other classes, it makes sense to allow them to send more data in one round than the others.
   A CBQ adds up all weights under a class, and normalizes them, so you can use arbitrary numbers: only the ratios are important. People have been using 'rate/10' as a rule of thumb and it appears to work well. The renormalized weight is multiplied by the 'allot' parameter to determine how much data can be sent in one round. 

Please note that all classes within an CBQ hierarchy need to share the same major number!


The Gist

Reading the information from resource 2 you get the gist that CPQ is a standard type QoS and that "CBQ is merely the oldest kid on the block - and also the most complex one. It may not always do what you want. This may come as something of a shock to many who fell for the 'sendmail effect', which teaches us that any complex technology which doesn't come with documentation must be the best available."[5]

It seems to work off the concept of idle and the fact that packets should be spread out more if they need to be slowed down. It uses averages to figure out if your interfaces are meeting the limits and adjusts the flow to your settings.

The info doc states that CBQ has a lot of knobs to tune and it certainly seems so. The doc also talk about the type of links that it will work on correctly. Quote:

This is pretty hard to measure, so CBQ instead derives the idle time from the number of microseconds that elapse between requests from the hardware layer for more data. Combined, this can be used to approximate how full or empty the link is.

This is rather tortuous and doesn't always arrive at proper results. For example, what if the actual link speed of an interface that is not really able to transmit the full 100mbit/s of data, perhaps because of a badly implemented driver? A PCMCIA network card will also never achieve 100mbit/s because of the way the bus is designed - again, how do we calculate the idle time?

It gets even worse if we consider not-quite-real network devices like PPP over Ethernet or PPTP over TCP/IP. The effective bandwidth in that case is probably determined by the efficiency of pipes to userspace - which is huge.

End Quote.

ALTQ_FAIRQ or just FAIRQ

Fair Queuing For ALTQ[6]

"I have a question for the PF/ALTQ masters out there," Matthew Dillon began on the DragonFlyBSD kernel mailing list, having recently switched from using a Cisco router to a DragonFlySD server running PF. "I am trying to configure PF in a manner similar to what Cisco's fair-queue algorithm does. Cisco's algorithm basically hashes TCP and UDP traffic based on the port/IP pairs, creating a bunch of lists of backlogged packets and then schedules the packets at the head of each list." He went on to explain that he was unsuccessfully trying to configure the same thing with PF, "neither CBQ nor HFSC seem to work well. I can separate certain types of traffic but the real problem is when there are multiple TCP connections that are essentially classified the same, and one is hogging the outgoing bandwidth. So the question is, is there a PF solution for that or do I need to write a new ALTQ mechanic to implement fair queueing?"

Not finding a solution, he followed with a series of patches implementing what he needed. He explained the resulting logic noting, "unless something comes up I am going to commit this to DragonFly on Friday and call it done. I would be pleased if other projects picked up some or all of the work":

"The queues are scanned from highest priority to lowest priority; if the packet bandwidth on the queue does not exceed the bandwidth parameter and a packet is available, a packet will be chosen fro that queue; if a packet is available but the queue has exceeded the specified bandwidth, the next lower priority queue is scanned (and so forth); if NO lower priority queues either have packets or are all over the bandwidth limit, then a packet will be taken from the highest priority queue with a packet ready; packet rate can exceed the queue bandwidth specification (but will not exceed the interface bandwidth specification, of course), but under full saturation the average bandwidth for any given queue will be limited to the specified value."

CoDel

http://en.wikipedia.org/wiki/CoDel

CoDel distinguishes between two "types" of a queue (or rather, the effects produced by a queue) Good and bad queues:

Good queue

  • Defined as a queue that exhibits no buffer bloat, i.e. catches and handles communications bursts with no more than a temporary increase in queue delay and which maximizes utilization of the network link.

Bad queue

  • Defined as a queue that exhibits buffer bloat, i.e. where a communications burst has caused the buffer to fill up and stay filled, resulting in low utilization and a constantly high buffer delay.


The CoDel algorithm

Based on Jacobson's notion from 2006, CoDel was developed to manage queues under control of the minimum delay experienced by packets in the running buffer window. The goal is to keep this minimum delay below 5 milliseconds. If the minimum delay rises to too high a value, packets are dropped from the window until the delay drops below the maximum level. [7]

Nichols and Jacobson cite several advantages to using nothing more than this metric:

  • CoDel is parameterless. One of the weaknesses in the RED algorithm (according to Jacobson) is that it is too difficult to configure (and too difficult to configure correctly, especially in an environment with dynamic link rates). CoDel has no parameters to set at all.
  • CoDel treats good queue and bad queue differently. A good queue has low delays by nature, so the management algorithm can ignore it, while a bad queue is susceptible to management intervention in the form of dropping packets.
  • CoDel works off of a parameter that is determined completely locally, so it is independent of round-trip delays, link rates, traffic loads and other factors that cannot be controlled or predicted by the local buffer.
  • The local minimum delay can only be determined when a packet leaves the buffer, so no extra delay is needed to run the queue to collect statistics to manage the queue.
  • CoDel adapts to dynamically changing link rates with no negative impact on utilization.
  • CoDel can be implemented relatively simply and therefore can span the spectrum from low-end home routers to high-end routing solutions.

CoDel does nothing to manage the buffer if the minimum delay for the buffer window is below the maximum allowed value. It also does nothing if the buffer is relatively empty (if there are fewer than one MTU's worth of bytes in the buffer). If these conditions do not hold, then CoDel drops packets probabilistically.

Some talk about CoDel: http://www.dd-wrt.com/phpBB2/viewtopic.php?p=752578

Answer about limiting upstream vs downstream[8]:

The downstream buffers are in the hardware at your provider so it is not possible to implement codel for downstream on your side, this must be done by your provider.

You can only limit connections via QoS to not use your full bandwich to prevent filling the buffers. But this slow down the maximal speed but you get lower pings.

More info: https://tools.ietf.org/html/draft-aqm-codel-00

PRIQ

[9]Priority Queueing (PRIQ) assigns multiple queues to a network interface with each queue being given a priority level. A queue with a higher priority is always processed ahead of a queue with a lower priority. If two or more queues are assigned the same priority then those queues are processed in a round-robin fashion.

The queueing structure in PRIQ is flat -- you cannot define queues within queues. The root queue is defined, which sets the total amount of bandwidth that is available, and then sub queues are defined under the root. Consider the following example:

   Root Queue (2Mbps)
       Queue A (priority 1) 
       Queue B (priority 2) 
       Queue C (priority 3) 

The root queue is defined as having 2Mbps of bandwidth available to it and three subqueues are defined. The queue with the highest priority (the highest priority number) is served first. Once all the packets in that queue are processed, or if the queue is found to be empty, PRIQ moves onto the queue with the next highest priority. Within a given queue, packets are processed in a First In First Out (FIFO) manner.

It is important to note that when using PRIQ you must plan your queues very carefully. Because PRIQ always processes a higher priority queue before a lower priority one, it's possible for a high priority queue to cause packets in a lower priority queue to be delayed or dropped if the high priority queue is receiving a constant stream of packets.

Random Early Detection

Random Early Detection (RED)[10] is a congestion avoidance algorithm. Its job is to avoid network congestion by making sure that the queue doesn't become full. It does this by continually calculating the average length (size) of the queue and comparing it to two thresholds, a minimum threshold and a maximum threshold. If the average queue size is below the minimum threshold then no packets will be dropped. If the average is above the maximum threshold then all newly arriving packets will be dropped. If the average is between the threshold values then packets are dropped based on a probability calculated from the average queue size. In other words, as the average queue size approaches the maximum threshold, more and more packets are dropped. When dropping packets, RED randomly chooses which connections to drop packets from. Connections using larger amounts of bandwidth have a higher probability of having their packets dropped.

RED is useful because it avoids a situation known as global synchronization and it is able to accommodate bursts of traffic. Global synchronization refers to a loss of total throughput due to packets being dropped from several connections at the same time. For example, if congestion occurs at a router carrying traffic for 10 FTP connections and packets from all (or most) of these connections are dropped (as is the case with FIFO queueing), overall throughput will drop sharply. This isn't an ideal situation because it causes all of the FTP connections to reduce their throughput and also means that the network is no longer being used to its maximum potential. RED avoids this by randomly choosing which connections to drop packets from instead of choosing all of them. Connections using large amounts of bandwidth have a higher chance of their packets being dropped. In this way, high bandwidth connections will be throttled back, congestion will be avoided, and sharp losses of overall throughput will not occur. In addition, RED is able to handle bursts of traffic because it starts to drop packets before the queue becomes full. When a burst of traffic comes through there will be enough space in the queue to hold the new packets.

RED should only be used when the transport protocol is capable of responding to congestion indicators from the network. In most cases this means RED should be used to queue TCP traffic and not UDP or ICMP traffic.

For a more detailed look at the theory behind RED, please see References on RED.

Explicit Congestion Notification

Explicit Congestion Notification (ECN)[11] works in conjunction with RED to notify two hosts communicating over the network of any congestion along the communication path. It does this by enabling RED to set a flag in the packet header instead of dropping the packet. Assuming the sending host has support for ECN, it can then read this flag and throttle back its network traffic accordingly.

For more information on ECN, please refer to RFC 3168.

MISC

References

  1. http://www.lartc.org/lartc.html#LARTC.SFQ
  2. https://forum.pfsense.org/index.php?topic=80939.0
  3. 3.0 3.1 3.2 3.3 https://calomel.org/pf_hfsc.html
  4. http://qos.ittc.ku.edu/howto/node43.html
  5. 5.0 5.1 http://www.lartc.org/lartc.html#LARTC.QDISC.CLASSFUL
  6. https://forum.pfsense.org/index.php?topic=42690.0
  7. http://en.wikipedia.org/wiki/CoDel
  8. http://forum.ipfire.org/viewtopic.php?t=7826
  9. http://www.openbsd.org/faq/pf/queueing.html
  10. http://www.openbsd.org/faq/pf/queueing.html
  11. http://www.openbsd.org/faq/pf/queueing.html