Linux IP Virtual Server for Netfilter and Linux 2.4

TODO:

- redesign  LVS to  work in  setups with  multiple default  routes (this
requires  changes in the kernels,  calling ip_route_input with different
arguments).  The end goal: one routing call in any direction (as before)
but do correct routing in in->out direction. The problems:

	* fwmark   virtual  services   and  the  need   for  working  at
	prerouting.   Solution: hook at PREROUTING  after the filter and
	do  there connection  creation (after QoS,  fwmark setup).  Hook
	at  prerouting, listen  for traffic  for established connections
	and  call ip_route_input  with the right  arguments (possibly in
	the  routing chain). Goal: always pass  one filter chain in each
	direction  (FORWARD).  The fwmark  is  used only  for connection
	setup and then is ignored.

	* hash  twice the NAT  connections in same  table (at prerouting
	we can see both requests and replies), compare with cp->vaddr to
	detect the right direction

- help from Netfilter to redesign the kernel hooks:

	* ROUTING  hook (used  from netfilter's NAT,  LVS-DR and in->out
	LVS-NAT)

	* fixed  ip_route_input to do source routing with the masquerade
	address as source (lsrc argument)

	* more control over what to walk in the netfilter hooks?

- different  timeouts  for each  virtual server  (more control  over the
connection timeouts)

- Allow LVS to be used as NAT router/balancer for outgoing traffic


CURRENT STATE:

Running variants:

1. Only lvs - the fastest
2. lvs + ipfw NAT
3. lvs + iptables NAT

Where is LVS placed:

LOCAL_IN:100 ip_vs_in

FORWARD:99 ip_vs_forward_icmp
FORWARD:100 ip_vs_out

POST_ROUTING:NF_IP_PRI_NAT_SRC-1 ip_vs_post_routing


The chains:

The out->in LVS packets (for any forwarding method) walk:

pre_routing -> LOCAL_IN -> ip_route_output or dst cache -> POST_ROUTING


	LOCAL_IN
	ip_vs_in	-> ip_route_output/dst cache
			-> mark skb->nfcache with special bit value
			-> ip_send -> POST_ROUTING

	POST_ROUTING
	ip_vs_post_routing
			- check skb->nfcache and exit from the
			chain if our bit is set


The in->out LVS packets (for LVS/NAT) walk:

pre_routing -> FORWARD -> POST_ROUTING

	FORWARD (check for related ICMP):
	ip_vs_forward_icmp	-> local delivery -> mark
				skb->nfcache -> POST_ROUTING

	FORWARD
	ip_vs_out		-> NAT -> mark skb->nfcache -> NF_ACCEPT

	POST_ROUTING
	ip_vs_post_routing
			- check skb->nfcache and exit from the
			chain if our bit is set

Why LVS is placed there:

- LVS  creates  connections after  the  packets are  marked,  i.e. after
PRE_ROUTING:MANGLE:-150   or  PRE_ROUTING:FILTER:0.  LVS   can  use  the
skb->nfmark as a virtual service ID.

- LVS must be after PRE_ROUTING:FILTER+1:sch_ingress.c - QoS setups.  By
this way the incoming traffic can be policed before reaching LVS.

- LVS  creates connections after  the input routing  because the routing
can  decide to deliver locally packets  that are marked or other packets
specified  with  routing rules.  Transparent  proxying handled  from the
netfilter NAT code is not always a good solution.

- LVS  needs to  forward packets  not looking  in the  IP header (direct
routing  method), so calling  ip_route_input with arguments  from the IP
header only is not useful for LVS

- LVS is after any firewall rules in LOCAL_IN and FORWARD


*** Requirements for the PRE_ROUTING chain ***

	Sorry,  we  can't  waste  time  here.  The  netfilter connection
tracking  can mangle packets  here and we  don't know at  this time if a
packet  is  for our  virtual service  (new  connection) or  for existing
connection  (needs lookup in the LVS connection table). We are sure that
we  can't make decisions whether to create new connections at this place
but  lookup for existing connections  is possible under some conditions:
the packets must be defragmented, etc.

There  are so  many nice modules  in this  chain that can  feed LVS with
packets (probably modified)


*** Requirements for the LOCAL_IN chain ***

The conditions when sk_buff comes:

- ip_local_deliver() defragments the packets (ip_defrag) for us

- the incoming sk_buff can be non-linear

- when the incoming sk_buff comes only the read access is guaranteed

What we do:

- packets  generated  locally are  not  considered because  there  is no
known  forwarding method  that can  establish connection  initiated from
the director.

- only TCP, UDP and related to them ICMP packets are considered

- the  protocol header must  be present before making  any work based on
fields from the IP or protocol header.

- we  detect  here  packets  for the  virtual  services  or  packets for
existing  connections  and then  the transmitter  function for  the used
forwarding method is called

- the NAT transmitter performs the following actions:

		We  try to make  some optimizations for  the most of the
	traffic  we see:  the normal  traffic that  is not  bound to any
	application  helper, i.e.  when the  data part  (payload) in the
	packets is not written or even not read at all. In such case, we
	change the addresses and the ports in the IP and in the protocol
	header  but we  don't make any  checksum checking  for them.  We
	perform  incremental checksum update after the packet is mangled
	and  rely on the real server  to perform the full check (headers
	and payload).

		If  the connection  is bound to  some application helper
	(FTP  for example) we always  perform checksum checking with the
	assumption  that  the  data  is  usually  changed  and  with the
	additional   assumption  that  the   traffic  using  application
	helpers  is low. To perform such  check the whole payload should
	be  present in the provided sk_buff. For this, we call functions
	to  linearize  the  sk_buff  data  by  assembling  all  its data
	fragments.

		Before  the addresses  and ports  are changed  we should
	have  write  access to  the packet  data (headers  and payload).
	This  guarantees that  the packet data  should be  seen from any
	other  readers unchanged.   The copy-on-write  is performed from
	the  linearization function for the  packets that were with many
	fragments.  For all other packets we should copy the packet data
	(headers  and  payload) if  it is  used  from someone  else (the
	sk_buff  was  cloned).   The packets  not  bound  to application
	helpers  need  such write  access  only for  the  first fragment
	because  for  them  only the  IP  and the  protocol  headers are
	changed  and we guarantee  that they are  in the first fragment.
	For  the packets using application  helpers the linearization is
	already done and we are sure that there is only one fragment. As
	result, we need write access (copy if cloned) only for the first
	fragment.   After the application helper is called to update the
	packet data we perform full checksum calculation.

- the DR transmitter performs the following actions:

		Nothing special, may be it is the shortest function. The
	only  action is to reroute the  packet to the bound real server.
	If  the  packet  is fragmented  then  ip_send_check()  should be
	called to refresh the checksum.

- the TUN transmitter performs the following actions:

		Copies  the packet  if is already  referred from someone
	else  or when there is no space for the IPIP prefix header.  The
	packet  is  rerouted  to  the real  server.   If  the  packet is
	fragmented  then ip_send_check() should be called to refresh the
	checksum in the old IP header.

- if  the packets must  leave the box  we send them  to POST_ROUTING via
ip_send  and return  NF_STOLEN.  This  means that  we remove  the packet
from  the LOCAL_IN chain before reaching priority LAST-1.  The LocalNode
feature just returns NF_ACCEPT without mangling the packet.

	In  this chain  if a  packet is  for LVS  connection (even newly
created)  the LVS calls  ip_route_output (or uses  a destination cache),
marks  the packet as a LVS property (sets bit in skb->nfcache) and calls
ip_send()    to   jump   to   the   POST_ROUTING   chain.    There   our
ip_vs_post_routing  hook must  call the  okfn for  the packets  with our
special nfcache bit value (Is skb->nfcache used after the routing calls?
We rely on the fact that it is not used) and to return NF_STOLEN.

	One  side effect: LVS can forward packet even when ip_forward=0,
only  for  DR  and  TUN  methods. For  these  methods  even  TTL  is not
decremented nor data checksum is checked.


*** Requirements for the FORWARD chain ***

	LVS  checks  first  for  ICMP  packets  related  to  TCP  or UDP
connections.   Such  packets are  handled as  they  are received  in the
LOCAL_IN  chain - they are localy  delivered. Used for transparent proxy
setups.

	LVS  looks in  this chain for  in->out packets but  only for the
LVS/NAT  method. In any  case new connections are  not created here, the
lookup is for existing connections only.

	In  this chain  the ip_vs_out function  can be  called from many
places:

	FORWARD:0 - the  ipfw  compat mode  calls ip_vs_out  between the
forward  firewall and  the masquerading.  By  this way LVS  can grab the
outgoing  packets for its connection  and to avoid they  to be used from
the netfilter's NAT code.

	FORWARD:100  - ip_vs_out  is registered after  the FILTER=0.  We
can  come here twice if the ipfw compat module is used because ip_vs_out
is  called once from FORWARD:0 (fw_in) and after that from pri=100 where
LVS always registers the ip_vs_out function.  We detect this second call
by  looking in the skb->nfcache bit value.   If the bit is set we return
NF_ACCEPT.   In fact, the second ip_vs_out  call is avoided if the first
returns NF_STOLEN and after calling the okfn function.

The  actions we perform  are the same  as in the  LOCAL_IN chain for the
NAT  transmitter with the exception that we should call ip_defrag(). The
other  difference is that we have write access to the first fragment (it
is not referred from someone else) after ip_forward() calls skb_cow().


*** Requirements for the POST_ROUTING chain ***

	LVS marks the packets for debugging and they appear to come from
LOCAL_OUT  but this  chain is not  traversed. The  LVS requirements from
the  POST_ROUTING chain include  the fragmentation code  only.  But even
the  ICMP  messages are  generated and  mangled  ready for  sending long
before the POST_ROUTING chain: ip_send() does not call ip_fragment() for
the  LVS packets  because LVS returns  ICMP_FRAG_NEEDED when  the mtu is
shorter.

	LVS  makes MTU checks  when accepting packets  and selecting the
output device. So, the ip_refrag POST_ROUTING hook is not used from LVS.

	The  result is: LVS must hook POST_ROUTING as first (may be only
after  the ipfw compat  filter) and to return  NF_STOLEN for its packets
(detected by checking the special skb->nfcache bit value).


The Netfilter hooks:

Priorities:
        NF_IP_PRI_FIRST = INT_MIN,
        NF_IP_PRI_CONNTRACK = -200,
        NF_IP_PRI_MANGLE = -150,
        NF_IP_PRI_NAT_DST = -100,
        NF_IP_PRI_FILTER = 0,
        NF_IP_PRI_NAT_SRC = 100,
        NF_IP_PRI_LAST = INT_MAX,


PRE_ROUTING (ip_input.c:ip_rcv):
	CONNTRACK=-200, ip_conntrack_core.c:ip_conntrack_in
	MANGLE=-150, iptable_mangle.c:ipt_hook
	NAT_DST=-100, ip_nat_standalone.c:ip_nat_fn
	FILTER=0, ip_fw_compat.c:fw_in, defrag, firewall, demasq, redirect
	FILTER+1=1, net/sched/sch_ingress.c:ing_hook

LOCAL_IN (ip_input.c:ip_local_deliver):
	FILTER=0, iptable_filter.c:ipt_hook
	LVS=100, ip_vs_in
	LAST-1, ip_fw_compat.c:fw_confirm
	CONNTRACK=LAST-1, ip_conntrack_standalone.c:ip_confirm

FORWARD (ip_forward.c:ip_forward):
	FILTER=0, iptable_filter.c:ipt_hook
	FILTER=0, ip_fw_compat.c:fw_in, firewall, LVS:check_for_ip_vs_out,
		masquerade
	LVS=99, ip_vs_forward_icmp
	LVS=100, ip_vs_out

LOCAL_OUT (ip_output.c):
	CONNTRACK=-200, ip_conntrack_standalone.c:ip_conntrack_local
	MANGLE=-150, iptable_mangle.c:ipt_local_out_hook
	NAT_DST=-100, ip_nat_standalone.c:ip_nat_local_fn
	FILTER=0, iptable_filter.c:ipt_local_out_hook

POST_ROUTING (ip_output.c:ip_finish_output):
	FILTER=0, ip_fw_compat.c:fw_in, firewall, unredirect,
		mangle ICMP replies
	LVS=NAT_SRC-1, ip_vs_post_routing
	NAT_SRC=100, ip_nat_standalone.c:ip_nat_out
	CONNTRACK=LAST, ip_conntrack_standalone.c:ip_refrag


CONNTRACK:
	PRE_ROUTING, LOCAL_IN, LOCAL_OUT, POST_ROUTING

FILTER:
	LOCAL_IN, FORWARD, LOCAL_OUT

MANGLE:
	PRE_ROUTING, LOCAL_OUT

NAT:
	PRE_ROUTING, LOCAL_OUT, POST_ROUTING