Difference between revisions of "QoS"

From Hack Sphere Labs Wiki
Jump to: navigation, search
(HFSC General Information)
(HFSC General Information)
Line 40: Line 40:
  
 
Be very careful when enabling ECN on your machines. Remember that any router or ECN enabled device can notify both the client and server to slow the connection down. If a machine in your path is configured to send ECN when their congestion is low then your connections speed will suffer greatly. For example, telling clients to slow their connections when the link is 90% saturated would be reasonable. The connection would have a 10% safety buffer instead of dropping packets. Some routers are configured incorrectly and will send ECN when they are only 10%-50% utilized. This means your throughput speeds will be painfully low even though there is plenty of base bandwidth available. Truthfully, we do not use ECN or RED due to the ability of routers, misconfigured or not, to abuse congestion notification.
 
Be very careful when enabling ECN on your machines. Remember that any router or ECN enabled device can notify both the client and server to slow the connection down. If a machine in your path is configured to send ECN when their congestion is low then your connections speed will suffer greatly. For example, telling clients to slow their connections when the link is 90% saturated would be reasonable. The connection would have a 10% safety buffer instead of dropping packets. Some routers are configured incorrectly and will send ECN when they are only 10%-50% utilized. This means your throughput speeds will be painfully low even though there is plenty of base bandwidth available. Truthfully, we do not use ECN or RED due to the ability of routers, misconfigured or not, to abuse congestion notification.
 +
 +
=CBQ=
 +
 +
==Research 1==
 +
 +
Let us first define some basic terms in CBQ. In CBQ, every class has variables idle and avgidle and parameter maxidle used in computing the limit status for the class, and the parameter offtime used in determining how long to restrict throughput for overlimit classes.<ref name="ku">http://qos.ittc.ku.edu/howto/node43.html</ref>
 +
 +
 +
#Idle: The variable idle is the difference between the desired time and the measured actual time between the most recent packet transmissions for the last two packets sent from this class. When the connection is sending more than its allocated bandwidth, then idle is negative. When the connection is sending perfectly at its alloted rate, then idle is zero.
 +
 +
#avgidle: The variable avgidle is the average of idle, and it computed using an exponential weigted moving average (EWMA). When the avgidle is zero or lower, then the class is overlimit (the class has been exceeding its allocated bandwidth in a recent short time interval).
 +
 +
#maxidle: The parameter maxidle gives an upper bound for avgidle. Thus maxidle limits the credit given to a class that has recently been under its allocation.
 +
 +
#offtime: The parameter offtime gives the time interval that a overlimit must wait before sending another packet. This parameter determines the steady-state burst size for a class when the class is running over its limit.
 +
 +
#minidle: The minidle parameter gives a (negative) lower bound for avgidle. Thus, a negative minidle lets the scheduler remember that a class has recently used more than its allocated bandwidth.
 +
 +
There are three types of classes, namely leaf classes (such as a video class) that have directly assigned connections; nonleaf classes used for link-sharing; and the root class that represents the entire output link.
 +
 +
http://qos.ittc.ku.edu/howto/node43.html
 +
 +
==Research 2==
 +
 +
Moving to this reference:  http://www.lartc.org/lartc.html#LARTC.QDISC.CLASSFUL
 +
 +
Quote:
 +
 +
9.5.4. The famous CBQ qdisc <ref name="lartc">http://www.lartc.org/lartc.html#LARTC.QDISC.CLASSFUL</ref>
 +
 +
As said before, CBQ is the most complex qdisc available, the most hyped, the least understood, and probably the trickiest one to get right. This is not because the authors are evil or incompetent, far from it, it's just that the CBQ algorithm isn't all that precise and doesn't really match the way Linux works.
 +
 +
Besides being classful, CBQ is also a shaper and it is in that aspect that it really doesn't work very well. It should work like this. If you try to shape a 10mbit/s connection to 1mbit/s, the link should be idle 90% of the time. If it isn't, we need to throttle so that it IS idle 90% of the time.
 +
 +
This is pretty hard to measure, so CBQ instead derives the idle time from the number of microseconds that elapse between requests from the hardware layer for more data. Combined, this can be used to approximate how full or empty the link is.
 +
 +
This is rather tortuous and doesn't always arrive at proper results. For example, what if the actual link speed of an interface that is not really able to transmit the full 100mbit/s of data, perhaps because of a badly implemented driver? A PCMCIA network card will also never achieve 100mbit/s because of the way the bus is designed - again, how do we calculate the idle time?
 +
 +
It gets even worse if we consider not-quite-real network devices like PPP over Ethernet or PPTP over TCP/IP. The effective bandwidth in that case is probably determined by the efficiency of pipes to userspace - which is huge.
 +
 +
People who have done measurements discover that CBQ is not always very accurate and sometimes completely misses the mark.
 +
 +
In many circumstances however it works well. With the documentation provided here, you should be able to configure it to work well in most cases.
 +
 +
9.5.4.1. CBQ shaping in detail
 +
 +
As said before, CBQ works by making sure that the link is idle just long enough to bring down the real bandwidth to the configured rate. To do so, it calculates the time that should pass between average packets.
 +
 +
During operations, the effective idletime is measured using an exponential weighted moving average (EWMA), which considers recent packets to be exponentially more important than past ones. The UNIX loadaverage is calculated in the same way.
 +
 +
The calculated idle time is subtracted from the EWMA measured one, the resulting number is called 'avgidle'. A perfectly loaded link has an avgidle of zero: packets arrive exactly once every calculated interval.
 +
 +
An overloaded link has a negative avgidle and if it gets too negative, CBQ shuts down for a while and is then 'overlimit'.
 +
 +
Conversely, an idle link might amass a huge avgidle, which would then allow infinite bandwidths after a few hours of silence. To prevent this, avgidle is capped at maxidle.
 +
 +
If overlimit, in theory, the CBQ could throttle itself for exactly the amount of time that was calculated to pass between packets, and then pass one packet, and throttle again. But see the 'minburst' parameter below.
 +
 +
These are parameters you can specify in order to configure shaping:
 +
 +
avpkt
 +
 +
    Average size of a packet, measured in bytes. Needed for calculating maxidle, which is derived from maxburst, which is specified in packets.
 +
bandwidth
 +
 +
    The physical bandwidth of your device, needed for idle time calculations.
 +
cell
 +
 +
    The time a packet takes to be transmitted over a device may grow in steps, based on the packet size. An 800 and an 806 size packet may take just as long to send, for example - this sets the granularity. Most often set to '8'. Must be an integral power of two.
 +
maxburst
 +
 +
    This number of packets is used to calculate maxidle so that when avgidle is at maxidle, this number of average packets can be burst before avgidle drops to 0. Set it higher to be more tolerant of bursts. You can't set maxidle directly, only via this parameter.
 +
minburst
 +
 +
    As mentioned before, CBQ needs to throttle in case of overlimit. The ideal solution is to do so for exactly the calculated idle time, and pass 1 packet. For Unix kernels, however, it is generally hard to schedule events shorter than 10ms, so it is better to throttle for a longer period, and then pass minburst packets in one go, and then sleep minburst times longer.
 +
 +
    The time to wait is called the offtime. Higher values of minburst lead to more accurate shaping in the long term, but to bigger bursts at millisecond timescales.
 +
minidle
 +
 +
    If avgidle is below 0, we are overlimits and need to wait until avgidle will be big enough to send one packet. To prevent a sudden burst from shutting down the link for a prolonged period of time, avgidle is reset to minidle if it gets too low.
 +
 +
    Minidle is specified in negative microseconds, so 10 means that avgidle is capped at -10us.
 +
mpu
 +
 +
    Minimum packet size - needed because even a zero size packet is padded to 64 bytes on ethernet, and so takes a certain time to transmit. CBQ needs to know this to accurately calculate the idle time.
 +
rate
 +
 +
    Desired rate of traffic leaving this qdisc - this is the 'speed knob'!
 +
 +
Internally, CBQ has a lot of fine tuning. For example, classes which are known not to have data enqueued to them aren't queried. Overlimit classes are penalized by lowering their effective priority. All very smart & complicated.
 +
9.5.4.2. CBQ classful behaviour
 +
 +
Besides shaping, using the aforementioned idletime approximations, CBQ also acts like the PRIO queue in the sense that classes can have differing priorities and that lower priority numbers will be polled before the higher priority ones.
 +
 +
Each time a packet is requested by the hardware layer to be sent out to the network, a weighted round robin process ('WRR') starts, beginning with the lower-numbered priority classes.
 +
 +
These are then grouped and queried if they have data available. If so, it is returned. After a class has been allowed to dequeue a number of bytes, the next class within that priority is tried.
 +
 +
The following parameters control the WRR process:
 +
 +
allot
 +
 +
    When the outer CBQ is asked for a packet to send out on the interface, it will try all inner qdiscs (in the classes) in turn, in order of the 'priority' parameter. Each time a class gets its turn, it can only send out a limited amount of data. 'Allot' is the base unit of this amount. See the 'weight' parameter for more information.
 +
prio
 +
 +
    The CBQ can also act like the PRIO device. Inner classes with higher priority are tried first and as long as they have traffic, other classes are not polled for traffic.
 +
weight
 +
 +
    Weight helps in the Weighted Round Robin process. Each class gets a chance to send in turn. If you have classes with significantly more bandwidth than other classes, it makes sense to allow them to send more data in one round than the others.
 +
 +
    A CBQ adds up all weights under a class, and normalizes them, so you can use arbitrary numbers: only the ratios are important. People have been using 'rate/10' as a rule of thumb and it appears to work well. The renormalized weight is multiplied by the 'allot' parameter to determine how much data can be sent in one round.
 +
 +
Please note that all classes within an CBQ hierarchy need to share the same major number!

Revision as of 11:11, 4 February 2015

pfSense

HFSC burst is broken in pfSense. "It's a kernel issue with dummynet in pf."

HFSC General Information

If you put ack packets in a high bw queue, they will confirm with the remote system that data was received.

You can give certain services priority and keep speed and latency low.

You can serve x amount of data out quickly while slowing long term objects. You decide you want to serve out data quickly in the beginning of the connection and slow down after a few seconds. This is called a nonlinear service curve (NLSC or just SC).[1]

bandwidth

  • parent queue is max bandwidth on entire interface
  • child is percent or hard number that cannot exceed parent queue

priority level specifies the order in which a service is to occur relative to other queues and is used in CBQ and PRIQ, but not HFSC. Priority is does _not_ define an amount of bandwidth, but the order in which packets are buffered before being set out of the interface. Default (1)[1]

qlimit: the amount of packets to buffer and queue when the amount of available bandwidth has been exceeded. This value is 50 packets by default. When the total amount of upload bandwidth has been reached on the outgoing interface or higher queues are taking up all of the bandwidth then no more data can be sent. The qlimit will put the packets the queue can not send out into slots in memory in the order that they arrive. When bandwidth is available the qlimit slots will be emptied in the order they arrived; first in, first out (FIFO). If the qlimit reaches the maximum value of qlimit, the packets will be dropped.[1]

Look at qlimit slots as "emergency use only," but as a better alternative to dropping the packets out right. Understand dropping packets is the proper way TCP knows it needs to reduce bandwidth; so dropping packets are not bad. The problem is TCP Tahoe or Reno methods will slow down the connection too severely and it takes a while to ramp back up after a dropped packet. A small qlimit buffer helps smooth out the connection, but "buffer bloat" works against TCP's congestion control. Also, do not think that setting the qlimit really high will solve the problem of bandwidth starvation and packet drops. What you want to do is setup a queue with the proper bandwidth boundaries so that packets only go into the qlimit slots for a short time (no more than a second), if ever.[1]

Calculating qlimit: If the qlimit is too large then you will run into a common issue called buffer bloat. Search on Google for "buffer bloat" for more information. A good idea is to set the qlimit to the amount of packets you want to buffer (not drop) in no more then a given amount of time. Take the total amount of upload bandwidth you have for your connection. Lets say that is 25 megabit upload speed. Now decide how much time you are willing to buffer packets before they get sent out. Lets say we will buffer 0.5 seconds which is quite long. So, 25 megabit divided by 8 is 3.125 megabytes per second. The average maximum segment size is 1460 bytes. 3.125 MB/sec divided by 0.001460 MB is 2140.41 packets per second. Now, we decided that we want to queue 0.5 seconds which is 2140.41 packets per second time 0.5 seconds which is 1070 packets. Thus, we set the qlimit at 1070. 1070 packets at a MSS of 1460 bytes is a 1.562 megabyte buffer. This is just a rough model, but you get the idea. We prefer to set our buffer a little high so that network spikes get buffered for 0.5 to one(1) second and then sent out. This method smooths out upload spikes, but does add some buffer bloat to our external network connection. In _our_tests on _our_ network a larger buffer worked better in the real world then the default qlimit of 50 packets set by OpenBSD. Do your own tests and make an informed decision.

realtime: the amount of bandwidth that is guaranteed to the queue no matter what any other queue needs. Realtime can be set from 0% to 80% of total connection bandwidth. Lets say you want to make sure that your web server gets 25KB/sec of bandwidth no matter what. Setting the realtime value will give the web server queue the bandwidth it needs even if other queues want to share its bandwidth.

upperlimit: the amount of bandwidth the queue can _never_ exceed. For example, say you want to setup a new mail server and you want to make sure that the server never takes up more than 50% of your available bandwidth. Or lets say you have a p2p user you need the limit. Using the upperlimit value will keep them from abusing the connection.

linkshare (m2): this value has the exact same use as "bandwidth" above. If you decide to use both "bandwidth" and "linkshare" in the same rule, pf (OpenBSD) will override the bandwidth directive and use "linkshare m2". This may cause more confusion than it is worth especially if you have two different settings in each. For this reason we are not going to use linkshare in our rules. The only reason you may want to use linkshare _instead of_ bandwidth is if you want to enable a nonlinear service curve.

nonlinear service curve (NLSC or just SC): The directives realtime, upperlimit and linkshare can all take advantage of a NLSC. In our example below we will use this option on our "web" queue. The format for service curve specifications is (m1, d, m2). m2 controls the bandwidth assigned to the queue. m1 and d are optional and can be used to control the initial bandwidth assignment. For the first d milliseconds the queue gets the bandwidth given as m1, after wards the value given in m2.

default: the default queue. As data connections or rules which are not specifically put into any other queue will be put into the default queue rule. This directive must be in only one rule. You can _not_ have two(2) default directives in any two(2) rules.

ecn: In ALTQ, ECN (Explicit Congestion Notification) works in conjunction with RED (Random early detection). ECN allows end-to-end notification of network congestion without dropping packets.

ECN is an optional feature which is used when both endpoints support it and are willing to use it. OpenBSD has ecn disabled by default and Ubuntu has it turned on only if the remote system asks for it first. Traditionally, TCP/IP networks signal congestion by dropping packets. When ECN is successfully negotiated, an ECN-aware router may set a mark in the IP header instead of dropping a packet in order to signal impending congestion. The receiver of the packet echoes the congestion indication to the sender, which must react as though a packet was dropped. ALTQ's version of RED is similar to Weighted RED (WRED) and RED In/Out (RIO) which provide early detection when used with ECN. The end result is a more stable TCP connection over congested networks.

Be very careful when enabling ECN on your machines. Remember that any router or ECN enabled device can notify both the client and server to slow the connection down. If a machine in your path is configured to send ECN when their congestion is low then your connections speed will suffer greatly. For example, telling clients to slow their connections when the link is 90% saturated would be reasonable. The connection would have a 10% safety buffer instead of dropping packets. Some routers are configured incorrectly and will send ECN when they are only 10%-50% utilized. This means your throughput speeds will be painfully low even though there is plenty of base bandwidth available. Truthfully, we do not use ECN or RED due to the ability of routers, misconfigured or not, to abuse congestion notification.

CBQ

Research 1

Let us first define some basic terms in CBQ. In CBQ, every class has variables idle and avgidle and parameter maxidle used in computing the limit status for the class, and the parameter offtime used in determining how long to restrict throughput for overlimit classes.[2]


  1. Idle: The variable idle is the difference between the desired time and the measured actual time between the most recent packet transmissions for the last two packets sent from this class. When the connection is sending more than its allocated bandwidth, then idle is negative. When the connection is sending perfectly at its alloted rate, then idle is zero.
  1. avgidle: The variable avgidle is the average of idle, and it computed using an exponential weigted moving average (EWMA). When the avgidle is zero or lower, then the class is overlimit (the class has been exceeding its allocated bandwidth in a recent short time interval).
  1. maxidle: The parameter maxidle gives an upper bound for avgidle. Thus maxidle limits the credit given to a class that has recently been under its allocation.
  1. offtime: The parameter offtime gives the time interval that a overlimit must wait before sending another packet. This parameter determines the steady-state burst size for a class when the class is running over its limit.
  1. minidle: The minidle parameter gives a (negative) lower bound for avgidle. Thus, a negative minidle lets the scheduler remember that a class has recently used more than its allocated bandwidth.

There are three types of classes, namely leaf classes (such as a video class) that have directly assigned connections; nonleaf classes used for link-sharing; and the root class that represents the entire output link.

http://qos.ittc.ku.edu/howto/node43.html

Research 2

Moving to this reference: http://www.lartc.org/lartc.html#LARTC.QDISC.CLASSFUL

Quote:

9.5.4. The famous CBQ qdisc [3]

As said before, CBQ is the most complex qdisc available, the most hyped, the least understood, and probably the trickiest one to get right. This is not because the authors are evil or incompetent, far from it, it's just that the CBQ algorithm isn't all that precise and doesn't really match the way Linux works.

Besides being classful, CBQ is also a shaper and it is in that aspect that it really doesn't work very well. It should work like this. If you try to shape a 10mbit/s connection to 1mbit/s, the link should be idle 90% of the time. If it isn't, we need to throttle so that it IS idle 90% of the time.

This is pretty hard to measure, so CBQ instead derives the idle time from the number of microseconds that elapse between requests from the hardware layer for more data. Combined, this can be used to approximate how full or empty the link is.

This is rather tortuous and doesn't always arrive at proper results. For example, what if the actual link speed of an interface that is not really able to transmit the full 100mbit/s of data, perhaps because of a badly implemented driver? A PCMCIA network card will also never achieve 100mbit/s because of the way the bus is designed - again, how do we calculate the idle time?

It gets even worse if we consider not-quite-real network devices like PPP over Ethernet or PPTP over TCP/IP. The effective bandwidth in that case is probably determined by the efficiency of pipes to userspace - which is huge.

People who have done measurements discover that CBQ is not always very accurate and sometimes completely misses the mark.

In many circumstances however it works well. With the documentation provided here, you should be able to configure it to work well in most cases.

9.5.4.1. CBQ shaping in detail

As said before, CBQ works by making sure that the link is idle just long enough to bring down the real bandwidth to the configured rate. To do so, it calculates the time that should pass between average packets.

During operations, the effective idletime is measured using an exponential weighted moving average (EWMA), which considers recent packets to be exponentially more important than past ones. The UNIX loadaverage is calculated in the same way.

The calculated idle time is subtracted from the EWMA measured one, the resulting number is called 'avgidle'. A perfectly loaded link has an avgidle of zero: packets arrive exactly once every calculated interval.

An overloaded link has a negative avgidle and if it gets too negative, CBQ shuts down for a while and is then 'overlimit'.

Conversely, an idle link might amass a huge avgidle, which would then allow infinite bandwidths after a few hours of silence. To prevent this, avgidle is capped at maxidle.

If overlimit, in theory, the CBQ could throttle itself for exactly the amount of time that was calculated to pass between packets, and then pass one packet, and throttle again. But see the 'minburst' parameter below.

These are parameters you can specify in order to configure shaping:

avpkt

   Average size of a packet, measured in bytes. Needed for calculating maxidle, which is derived from maxburst, which is specified in packets.

bandwidth

   The physical bandwidth of your device, needed for idle time calculations.

cell

   The time a packet takes to be transmitted over a device may grow in steps, based on the packet size. An 800 and an 806 size packet may take just as long to send, for example - this sets the granularity. Most often set to '8'. Must be an integral power of two.

maxburst

   This number of packets is used to calculate maxidle so that when avgidle is at maxidle, this number of average packets can be burst before avgidle drops to 0. Set it higher to be more tolerant of bursts. You can't set maxidle directly, only via this parameter.

minburst

   As mentioned before, CBQ needs to throttle in case of overlimit. The ideal solution is to do so for exactly the calculated idle time, and pass 1 packet. For Unix kernels, however, it is generally hard to schedule events shorter than 10ms, so it is better to throttle for a longer period, and then pass minburst packets in one go, and then sleep minburst times longer.
   The time to wait is called the offtime. Higher values of minburst lead to more accurate shaping in the long term, but to bigger bursts at millisecond timescales.

minidle

   If avgidle is below 0, we are overlimits and need to wait until avgidle will be big enough to send one packet. To prevent a sudden burst from shutting down the link for a prolonged period of time, avgidle is reset to minidle if it gets too low.
   Minidle is specified in negative microseconds, so 10 means that avgidle is capped at -10us.

mpu

   Minimum packet size - needed because even a zero size packet is padded to 64 bytes on ethernet, and so takes a certain time to transmit. CBQ needs to know this to accurately calculate the idle time.

rate

   Desired rate of traffic leaving this qdisc - this is the 'speed knob'!

Internally, CBQ has a lot of fine tuning. For example, classes which are known not to have data enqueued to them aren't queried. Overlimit classes are penalized by lowering their effective priority. All very smart & complicated. 9.5.4.2. CBQ classful behaviour

Besides shaping, using the aforementioned idletime approximations, CBQ also acts like the PRIO queue in the sense that classes can have differing priorities and that lower priority numbers will be polled before the higher priority ones.

Each time a packet is requested by the hardware layer to be sent out to the network, a weighted round robin process ('WRR') starts, beginning with the lower-numbered priority classes.

These are then grouped and queried if they have data available. If so, it is returned. After a class has been allowed to dequeue a number of bytes, the next class within that priority is tried.

The following parameters control the WRR process:

allot

   When the outer CBQ is asked for a packet to send out on the interface, it will try all inner qdiscs (in the classes) in turn, in order of the 'priority' parameter. Each time a class gets its turn, it can only send out a limited amount of data. 'Allot' is the base unit of this amount. See the 'weight' parameter for more information.

prio

   The CBQ can also act like the PRIO device. Inner classes with higher priority are tried first and as long as they have traffic, other classes are not polled for traffic.

weight

   Weight helps in the Weighted Round Robin process. Each class gets a chance to send in turn. If you have classes with significantly more bandwidth than other classes, it makes sense to allow them to send more data in one round than the others.
   A CBQ adds up all weights under a class, and normalizes them, so you can use arbitrary numbers: only the ratios are important. People have been using 'rate/10' as a rule of thumb and it appears to work well. The renormalized weight is multiplied by the 'allot' parameter to determine how much data can be sent in one round. 
Please note that all classes within an CBQ hierarchy need to share the same major number!
  1. 1.0 1.1 1.2 1.3 https://calomel.org/pf_hfsc.html
  2. http://qos.ittc.ku.edu/howto/node43.html
  3. http://www.lartc.org/lartc.html#LARTC.QDISC.CLASSFUL