Our front end varnish servers expierence a lot of traffic and we currently
must have a iptables NAT that routes from port 80 to port 81, where the varnish daemon listen to traffic. Doing this we end up using the iptables conntrack feature.
Our infrastructure uses DSR, thus we have this iptables NAT running. https://kemptechnologies.com/white-papers/what-is-direct-server-return/
And this is our current bottleneck as the conntrack queue fills up and slow down connections.
I wrote a simple script to detect when this behaviour is about to affect us.
The biggest problem is when the load balancer’s healthcheck starts to fail causing the varnish to be put offline and no traffic is sent untill the healthcheck is OK.
This is a really nice information about the topic:
http://www.iptables.info/en/connection-state.html
Even better explanation:
http://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux.html
An interesting command is slabtop
The output is like this:
The check only for all the connections:
# ./check_conntrack_unreplied -w 1000 -c 2000 -t total
OK – Existem 46 conexoes em estado UNREPLIED total | unreplied_total=46
The check only for the load balance traffic
# ./check_conntrack_unreplied -w 1000 -c 2000 -t bl
OK – Existem 0 conexoes em estado UNREPLIED do balanceador | unreplied_bl=0
The script has been tested in RedHat/CentOS 5.x and 6.x
So here is the script:
#!/bin/bash #Get current UNREPLIED from within the iptables conntrack #Version 20 #By Felipe Ferreira October 2012 # Exit codes STATE_OK=0 STATE_WARNING=1 STATE_CRITICAL=2 STATE_UNKNOWN=3 BALANCER_NET='192.168.34.(2(4[0-9]||5[0-3]^))' # in my case the load balancers healthcheck always comes from 251 and 250 #Arguments maxconwarn=10 maxconcrti=30 type="total" HELP="Usage: $0 [-w|--warning] [-c|--critical ] [-t ]" #EDIT HERE: BALANCER_NET='192.168.34.(2(4[0-9]||5[0-3]^))' #In my case the load balancers healthcheck always comes from 240 to 253 #Get ur regex here: http://www.analyticsmarket.com/freetools/ipregex #this file location changed from redhat 5 to redhat 6 tst=`grep " 6." /etc/issue.net` if [ -z "$tst" ]; then CONNTRACK_FILE="/proc/net/ip_conntrack" else CONNTRACK_FILE="/proc/net/nf_conntrack" fi #VERIFY IF FOUND THE FILE if ! [ -f "$CONNTRACK_FILE" ]; then echo "ERROR CONNTRACK FILE NOT FOUND" exit $STATE_UNKNOWN fi #Check arguments print help if [ $# -lt 4 ]; then echo $HELP exit $STATE_UNKNOWN fi #GET ARGUMENTS while test -n "$1"; do case "$1" in --help) echo $HELP exit $STATE_OK ;; -h) echo $HELP exit $STATE_OK ;; --warning) maxconwarn=$2 shift ;; -w) maxconwarn=$2 shift ;; -c) maxconcrit=$2 shift ;; -t) type=$2 ;; esac shift done #COUNTS EITHER TOTAL OR ONLY THE BALANCERS HEALTHCHECK if [ $type = "total" ]; then CON=`grep -c UNREPLIED $CONNTRACK_FILE` MSG="Existem $CON conexoes em estado UNREPLIED total | unreplied_total=$CON" elif [ $type = "bl" ]; then CON=`egrep "${BALANCER_NET}" $CONNTRACK_FILE | grep -c UNREPLIED` MSG="Existem $CON conexoes em estado UNREPLIED do balanceador | unreplied_bl=$CON" else echo "Usage: $0 [-w ] [-c ] -t(total|bl)" exit $STATE_UNKNOWN fi if [ $CON -ge $maxconcrit ]; then OUTPUT="CRITICAL- $MSG" exitstatus=$STATE_CRITICAL elif [ $CON -ge $maxconwarn ]; then OUTPUT="WARNING - $MSG" exitstatus=$STATE_WARNING elif [ $CON -lt $maxconwarn ];then OUTPUT="OK - $MSG" exitstatus=$STATE_OK fi rm -f $TMP_FILE echo $OUTPUT exit $exitstatus