Linux IP Virtual Server for Netfilter and Linux 2.4 TODO: - redesign LVS to work in setups with multiple default routes (this requires changes in the kernels, calling ip_route_input with different arguments). The end goal: one routing call in any direction (as before) but do correct routing in in->out direction. The problems: * fwmark virtual services and the need for working at prerouting. Solution: hook at PREROUTING after the filter and do there connection creation (after QoS, fwmark setup). Hook at prerouting, listen for traffic for established connections and call ip_route_input with the right arguments (possibly in the routing chain). Goal: always pass one filter chain in each direction (FORWARD). The fwmark is used only for connection setup and then is ignored. * hash twice the NAT connections in same table (at prerouting we can see both requests and replies), compare with cp->vaddr to detect the right direction - help from Netfilter to redesign the kernel hooks: * ROUTING hook (used from netfilter's NAT, LVS-DR and in->out LVS-NAT) * fixed ip_route_input to do source routing with the masquerade address as source (lsrc argument) * more control over what to walk in the netfilter hooks? - different timeouts for each virtual server (more control over the connection timeouts) - Allow LVS to be used as NAT router/balancer for outgoing traffic CURRENT STATE: Running variants: 1. Only lvs - the fastest 2. lvs + ipfw NAT 3. lvs + iptables NAT Where is LVS placed: LOCAL_IN:100 ip_vs_in FORWARD:99 ip_vs_forward_icmp FORWARD:100 ip_vs_out POST_ROUTING:NF_IP_PRI_NAT_SRC-1 ip_vs_post_routing The chains: The out->in LVS packets (for any forwarding method) walk: pre_routing -> LOCAL_IN -> ip_route_output or dst cache -> POST_ROUTING LOCAL_IN ip_vs_in -> ip_route_output/dst cache -> mark skb->nfcache with special bit value -> ip_send -> POST_ROUTING POST_ROUTING ip_vs_post_routing - check skb->nfcache and exit from the chain if our bit is set The in->out LVS packets (for LVS/NAT) walk: pre_routing -> FORWARD -> POST_ROUTING FORWARD (check for related ICMP): ip_vs_forward_icmp -> local delivery -> mark skb->nfcache -> POST_ROUTING FORWARD ip_vs_out -> NAT -> mark skb->nfcache -> NF_ACCEPT POST_ROUTING ip_vs_post_routing - check skb->nfcache and exit from the chain if our bit is set Why LVS is placed there: - LVS creates connections after the packets are marked, i.e. after PRE_ROUTING:MANGLE:-150 or PRE_ROUTING:FILTER:0. LVS can use the skb->nfmark as a virtual service ID. - LVS must be after PRE_ROUTING:FILTER+1:sch_ingress.c - QoS setups. By this way the incoming traffic can be policed before reaching LVS. - LVS creates connections after the input routing because the routing can decide to deliver locally packets that are marked or other packets specified with routing rules. Transparent proxying handled from the netfilter NAT code is not always a good solution. - LVS needs to forward packets not looking in the IP header (direct routing method), so calling ip_route_input with arguments from the IP header only is not useful for LVS - LVS is after any firewall rules in LOCAL_IN and FORWARD *** Requirements for the PRE_ROUTING chain *** Sorry, we can't waste time here. The netfilter connection tracking can mangle packets here and we don't know at this time if a packet is for our virtual service (new connection) or for existing connection (needs lookup in the LVS connection table). We are sure that we can't make decisions whether to create new connections at this place but lookup for existing connections is possible under some conditions: the packets must be defragmented, etc. There are so many nice modules in this chain that can feed LVS with packets (probably modified) *** Requirements for the LOCAL_IN chain *** The conditions when sk_buff comes: - ip_local_deliver() defragments the packets (ip_defrag) for us - the incoming sk_buff can be non-linear - when the incoming sk_buff comes only the read access is guaranteed What we do: - packets generated locally are not considered because there is no known forwarding method that can establish connection initiated from the director. - only TCP, UDP and related to them ICMP packets are considered - the protocol header must be present before making any work based on fields from the IP or protocol header. - we detect here packets for the virtual services or packets for existing connections and then the transmitter function for the used forwarding method is called - the NAT transmitter performs the following actions: We try to make some optimizations for the most of the traffic we see: the normal traffic that is not bound to any application helper, i.e. when the data part (payload) in the packets is not written or even not read at all. In such case, we change the addresses and the ports in the IP and in the protocol header but we don't make any checksum checking for them. We perform incremental checksum update after the packet is mangled and rely on the real server to perform the full check (headers and payload). If the connection is bound to some application helper (FTP for example) we always perform checksum checking with the assumption that the data is usually changed and with the additional assumption that the traffic using application helpers is low. To perform such check the whole payload should be present in the provided sk_buff. For this, we call functions to linearize the sk_buff data by assembling all its data fragments. Before the addresses and ports are changed we should have write access to the packet data (headers and payload). This guarantees that the packet data should be seen from any other readers unchanged. The copy-on-write is performed from the linearization function for the packets that were with many fragments. For all other packets we should copy the packet data (headers and payload) if it is used from someone else (the sk_buff was cloned). The packets not bound to application helpers need such write access only for the first fragment because for them only the IP and the protocol headers are changed and we guarantee that they are in the first fragment. For the packets using application helpers the linearization is already done and we are sure that there is only one fragment. As result, we need write access (copy if cloned) only for the first fragment. After the application helper is called to update the packet data we perform full checksum calculation. - the DR transmitter performs the following actions: Nothing special, may be it is the shortest function. The only action is to reroute the packet to the bound real server. If the packet is fragmented then ip_send_check() should be called to refresh the checksum. - the TUN transmitter performs the following actions: Copies the packet if is already referred from someone else or when there is no space for the IPIP prefix header. The packet is rerouted to the real server. If the packet is fragmented then ip_send_check() should be called to refresh the checksum in the old IP header. - if the packets must leave the box we send them to POST_ROUTING via ip_send and return NF_STOLEN. This means that we remove the packet from the LOCAL_IN chain before reaching priority LAST-1. The LocalNode feature just returns NF_ACCEPT without mangling the packet. In this chain if a packet is for LVS connection (even newly created) the LVS calls ip_route_output (or uses a destination cache), marks the packet as a LVS property (sets bit in skb->nfcache) and calls ip_send() to jump to the POST_ROUTING chain. There our ip_vs_post_routing hook must call the okfn for the packets with our special nfcache bit value (Is skb->nfcache used after the routing calls? We rely on the fact that it is not used) and to return NF_STOLEN. One side effect: LVS can forward packet even when ip_forward=0, only for DR and TUN methods. For these methods even TTL is not decremented nor data checksum is checked. *** Requirements for the FORWARD chain *** LVS checks first for ICMP packets related to TCP or UDP connections. Such packets are handled as they are received in the LOCAL_IN chain - they are localy delivered. Used for transparent proxy setups. LVS looks in this chain for in->out packets but only for the LVS/NAT method. In any case new connections are not created here, the lookup is for existing connections only. In this chain the ip_vs_out function can be called from many places: FORWARD:0 - the ipfw compat mode calls ip_vs_out between the forward firewall and the masquerading. By this way LVS can grab the outgoing packets for its connection and to avoid they to be used from the netfilter's NAT code. FORWARD:100 - ip_vs_out is registered after the FILTER=0. We can come here twice if the ipfw compat module is used because ip_vs_out is called once from FORWARD:0 (fw_in) and after that from pri=100 where LVS always registers the ip_vs_out function. We detect this second call by looking in the skb->nfcache bit value. If the bit is set we return NF_ACCEPT. In fact, the second ip_vs_out call is avoided if the first returns NF_STOLEN and after calling the okfn function. The actions we perform are the same as in the LOCAL_IN chain for the NAT transmitter with the exception that we should call ip_defrag(). The other difference is that we have write access to the first fragment (it is not referred from someone else) after ip_forward() calls skb_cow(). *** Requirements for the POST_ROUTING chain *** LVS marks the packets for debugging and they appear to come from LOCAL_OUT but this chain is not traversed. The LVS requirements from the POST_ROUTING chain include the fragmentation code only. But even the ICMP messages are generated and mangled ready for sending long before the POST_ROUTING chain: ip_send() does not call ip_fragment() for the LVS packets because LVS returns ICMP_FRAG_NEEDED when the mtu is shorter. LVS makes MTU checks when accepting packets and selecting the output device. So, the ip_refrag POST_ROUTING hook is not used from LVS. The result is: LVS must hook POST_ROUTING as first (may be only after the ipfw compat filter) and to return NF_STOLEN for its packets (detected by checking the special skb->nfcache bit value). The Netfilter hooks: Priorities: NF_IP_PRI_FIRST = INT_MIN, NF_IP_PRI_CONNTRACK = -200, NF_IP_PRI_MANGLE = -150, NF_IP_PRI_NAT_DST = -100, NF_IP_PRI_FILTER = 0, NF_IP_PRI_NAT_SRC = 100, NF_IP_PRI_LAST = INT_MAX, PRE_ROUTING (ip_input.c:ip_rcv): CONNTRACK=-200, ip_conntrack_core.c:ip_conntrack_in MANGLE=-150, iptable_mangle.c:ipt_hook NAT_DST=-100, ip_nat_standalone.c:ip_nat_fn FILTER=0, ip_fw_compat.c:fw_in, defrag, firewall, demasq, redirect FILTER+1=1, net/sched/sch_ingress.c:ing_hook LOCAL_IN (ip_input.c:ip_local_deliver): FILTER=0, iptable_filter.c:ipt_hook LVS=100, ip_vs_in LAST-1, ip_fw_compat.c:fw_confirm CONNTRACK=LAST-1, ip_conntrack_standalone.c:ip_confirm FORWARD (ip_forward.c:ip_forward): FILTER=0, iptable_filter.c:ipt_hook FILTER=0, ip_fw_compat.c:fw_in, firewall, LVS:check_for_ip_vs_out, masquerade LVS=99, ip_vs_forward_icmp LVS=100, ip_vs_out LOCAL_OUT (ip_output.c): CONNTRACK=-200, ip_conntrack_standalone.c:ip_conntrack_local MANGLE=-150, iptable_mangle.c:ipt_local_out_hook NAT_DST=-100, ip_nat_standalone.c:ip_nat_local_fn FILTER=0, iptable_filter.c:ipt_local_out_hook POST_ROUTING (ip_output.c:ip_finish_output): FILTER=0, ip_fw_compat.c:fw_in, firewall, unredirect, mangle ICMP replies LVS=NAT_SRC-1, ip_vs_post_routing NAT_SRC=100, ip_nat_standalone.c:ip_nat_out CONNTRACK=LAST, ip_conntrack_standalone.c:ip_refrag CONNTRACK: PRE_ROUTING, LOCAL_IN, LOCAL_OUT, POST_ROUTING FILTER: LOCAL_IN, FORWARD, LOCAL_OUT MANGLE: PRE_ROUTING, LOCAL_OUT NAT: PRE_ROUTING, LOCAL_OUT, POST_ROUTING