Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha.

Slides:



Advertisements
Similar presentations
Florida State UniversityCOP Advanced Unix Programming Raw Sockets Datalink Access Chapters 25, 26.
Advertisements

Device Layer and Device Drivers
Device Drivers. Linux Device Drivers Linux supports three types of hardware device: character, block and network –character devices: R/W without buffering.
COMS W6998 Spring 2010 Erich Nahum
IP Forwarding Relates to Lab 3.
The Journey of a Packet Through the Linux Network Stack
Internet Control Protocols Savera Tanwir. Internet Control Protocols ICMP ARP RARP DHCP.
© Janice Regan, CMPT 128, CMPT 371 Data Communications and Networking Socket Programming 0.
CS3505 The Internet and Info Hiway transport layer protocols : TCP/UDP.
Elementary TCP Sockets© Dr. Ayman Abdel-Hamid, CS4254 Spring CS4254 Computer Network Architecture and Programming Dr. Ayman A. Abdel-Hamid Computer.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 TCP/IP Stack Introduction: Looking Under the Hood! Shiv Kalyanaraman Rensselaer Polytechnic Institute.
Lab 4: Simple Router CS144 Lab 4 Screencast May 2, 2008 Ben Nham Based on slides by Clay Collier and Martin Casado.
What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.
Page: 1 Director 1.0 TECHNION Department of Computer Science The Computer Communication Lab (236340) Summer 2002 Submitted by: David Schwartz Idan Zak.
Linux Networking Overview COMS W Spring 2010.
CSE/EE 461 Getting Started with Networking. Basic Concepts  A PROCESS is an executing program somewhere.  Eg, “./a.out”  A MESSAGE contains information.
5-1 Data Link Layer r Today, we will study the data link layer… r This is the last layer in the network protocol stack we will study in this class…
Defining Network Protocols Application Protocols –Application Layer –Presentation Layer –Session Layer Transport Protocols –Transport Layer Network Protocols.
1 IP Forwarding Relates to Lab 3. Covers the principles of end-to-end datagram delivery in IP networks.
IST 228\Ch3\IP Addressing1 TCP/IP and DoD Model (TCP/IP Model)
Introduction to Linux Network 劉德懿
File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.
ICMP (Internet Control Message Protocol) Computer Networks By: Saeedeh Zahmatkesh spring.
1 Networking (Stack and Sockets API). 2 Topic Overview Introduction –Protocol Models –Linux Kernel Support TCP/IP Sockets –Usage –Attributes –Example.
Assignment 3 A Client/Server Application: Chatroom.
Chapter 17 Networking Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William Stallings.
PA3: Router Junxian (Jim) Huang EECS 489 W11 /
1 IP Forwarding Relates to Lab 3. Covers the principles of end-to-end datagram delivery in IP networks.
LWIP TCP/IP Stack 김백규.
LWIP TCP/IP Stack 김백규.
IP Forwarding.
1 IP: putting it all together Part 1 G53ACC Chris Greenhalgh.
Server Sockets: A server socket listens on a given port Many different clients may be connecting to that port Ideally, you would like a separate file descriptor.
CMPT 471 Networking II Address Resolution IPv4 ARP RARP 1© Janice Regan, 2012.
Hyung-Min Lee ©Networking Lab., 2001 Chapter 8 ARP and RARP.
The Socket Interface Chapter 21. Application Program Interface (API) Interface used between application programs and TCP/IP protocols Interface used between.
Network Programming Eddie Aronovich mail:
1 CS 4396 Computer Networks Lab TCP/IP Networking An Example.
An initial study on Multi Path Routing Over Multiple Devices in Linux 2.4.x kernel Towards CS522 term project By Syama Sundar Kosuri.
CPSC 441 TUTORIAL – FEB 13, 2012 TA: RUITNG ZHOU UDP REVIEW.
1 Computer Networks An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr. Nasser Yazdani Lecture 3: Sockets.
CSE/EE 461 Getting Started with Networking. 2 Basic Concepts A PROCESS is an executing program somewhere. –Eg, “./a.out” A MESSAGE contains information.
1 OSI and TCP/IP Models. 2 TCP/IP Encapsulation (Packet) (Frame)
CS 6401 Introduction to Computer Networks 09/21/2010 Outline - UNIX sockets - A simple client-server program - Project 1 - LAN bridges and learning.
The Client-Server Model And the Socket API. Client-Server (1) The datagram service does not require cooperation between the peer applications but such.
Linux Networking Stack 指導老師 李正帆
UNIX Sockets Outline UNIX sockets CS 640.
1 Spring Semester 2008, Dept. of Computer Science, Technion Internet Networking recitation #7 Socket Programming.
FSU CIS 5930 Internet Protocols
TCP/IP Illustrated, Volume 1: The Protocols Chapter 6. ICMP: Internet Control Message Protocol ( 월 ) 김 철 환
1 Kyung Hee University Chapter 11 User Datagram Protocol.
Chapter 4: server services. The Complete Guide to Linux System Administration2 Objectives Configure network interfaces using command- line and graphical.
Lecture 3: Stateless Packet Filtering. 2 Agenda 1 1 Linux file system - networking sk_buff 2 2 Stateless packet filtering 3 3 About next assignment 4.
SOCKET PROGRAMMING Presented By : Divya Sharma.
Chapter 11 User Datagram Protocol
Zero-copy Receive Path in Virtio
Sockets and Beginning Network Programming
LWIP TCP/IP Stack 김백규.
sudo ./snull_load Two interfaces created: sn0, sn1
ARP and RARP Objectives Chapter 7 Upon completion you will be able to:
Chapter 7: The Infamous IP
Magda El Zarki Professor, ICS UC, Irvine
Socket Programming in C
Linux Networks TCP/IP Networking Layers BSD Socket Interface
Chapter 7: The Infamous IP
UNIX Sockets Outline Homework #1 posted by end of day
TCP/IP Networking An Example
Networking and Network Protocols (Part2)
IP Forwarding Relates to Lab 3.
Internet Networking recitation #8
Presentation transcript:

Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha

Goals To help you implement your customized stack by identifying key points of the code structure To point out some tricks and optimizations that evolved after 4.3BSD and that are part of Linux TCP/IP code

TCP/IP source code /usr/src/linux/net/ All relative pathnames in this document are relative to /usr/src/linux/ cross-references all the Linux kernel code You can install and run it locally; I haven’t tried

The various layers (yawn…) IP TCP/UDP INET socket BSD socket AppletalkIPX (Physical) (Link)

Address families supported include/linux/socket.h UNIXUnix domain sockets INETTCP/IP AX25Amateur radio IPXNovell IPX APPLETALKAppletalk X25X.25 More; about 24 in all

Setting things up – socket- side How the INET address family registers itself with BSD socket layer

struct socket BSD socket short type – SOCK_DGRAM, SOCK_STREAM struct proto_ops *ops – TCP/UDP operations for this socket; bind, close, read, write etc. struct inode *inode – the file inode associated with this socket struct sock *sk – the INET socket associated with this socket

BSD socket INET socket? Operations to use? (How to create socket?) No connections

struct sock INET socket struct socket *socket – associated BSD socket struct sock *next, **pprev – socks are in linked lists struct dst_entry *dst_cache – pointer to the route cache entry used by this socket struct sk_buff_head *receive_queue – head of the receive queue struct sk_buff_head *write_queue – head of the send queue

struct sock continued __u32 daddr – foreign IP address __32 rcv_saddr – bound local IP address __u16 dport – destination port unsigned short num – local port struct proto *prot – contains TCP/UDP specific operations (repetition with struct socket’s ops field)

INET socket Reaching transport layer? BSD socket? No connections

protocols vector Array of struct net_proto, which has name, say INET, UNIX, IPX, etc initialization function, say inet_proto_init This protocols array is static in net/protocols.c This file uses conditional compilation to include protocols as chosen in make config

inet_proto_init protocols vector is traversed at system init time, and each init function called Each of these protocol init functions registers itself with BSD sockets by giving its name and socket create function Where does the BSD socket layer store this information?

net_families BSD socket layer stores info for each registering protocol in this array This is an array of struct net_proto_family, which is int family int (*create)(struct socket *sock, int protocol)

BSD socket layer now has INET inet_create() IPX ipx_create() UNIX unix_create()

So in socket() call BSD socket layer looks for specified address family, say INET BSD socket layer calls create function for that family, say inet_create() inet_create() does switch (BSD_socket- >type) case SOCK_DGRAM: fill BSD_socket->proto_ops with UDP operations case SOCK_STREAM: fill BSD_socket->proto_ops with TCP operations

Socket layer is satisfied BSD socket: AF_INET, SOCK_STREAM INET socket TCP’s proto_ops Write queue Receive queue Lots of other TCP data

Reaching sockets through file descriptors Per process file table > inode > BSD socket etc. Not describing here

Setting things up – device side How network interfaces come up and attach themselves to the stack

No connections Network interface card What is my name (since I don’t have a /dev file)? Give packets to whom?

struct device No device file for network devices Why? Design choice, probably because network devices “push” data Each interface is represented by a struct device All struct devices are chained and the chain head is called dev_base

struct device continued char *name – say eth0 unsigned long base_addr – I/O address unsigned int irq – IRQ number struct device *next int (*init)(struct device *dev) int (*hard_start_xmit)(struct sk_buff *skb, struct device *dev) – transmission function

dev_base drivers/net/Space.c cleverly threads struct devices for all possible interfaces into a list starting at dev_base (static data structure declaration, no code execution yet) List includes limited number of devices of each type, i.e. eth0 to eth7 and no more possible

ethif_probe() For each of these 8 struct devices, names are eth0 to eth7 and init funtion is ethif_probe() During system init time the list of struct devices is traversed, and the init function called for each So ethif_probe() called for eth0; calls probe_list()

probe_list() probe_list() goes through a list of all ethernet devices the system has drivers for The probe function for each driver is called, and if success, assign proper function pointers from the driver code to this struct device (ethx) if failure, no more eth devices exist, remove this struct device from the list and return

After all devices in Space.c traversed through lo0 eth0, 3Com card eth1, HP card functions from 3com driver functions from HP driver Give packets to whom? dev_base

Modularized driver Much simpler, because the driver’s probe is executed at module load time If it finds a device, it appends a struct device to the end of the dev_base list

backlog queue Very very distinct from socket listen backlog queue! Systemwide queue that interfaces immediately drop packets onto Device driver writers simply call netif_rx(), which does the actual queueing

Link layer is satisfied lo0 eth0, 3Com card eth1, HP card functions from 3com driver functions from HP driver dev_base backlog queue

Setting things up – between link and network layers How packets reach the correct protocol stack

No connections backlog queue IP?ARP?IPX?BOOTP? Who takes packets off the backlog queue? Who gets these packets?

net_bh() Bottom-half handler for network interrupt interrupt Executes when network interrupt is not masked So the fast handler (actual ISR), is driver code that calls netif_rx() to queue the packet onto backlog queue, and marks net_bh() for execution net_bh() takes packets off backlog and passes to the protocol specified in ethernet header

ptype_base ptype_base is the head of a list of possible packet types the link layer may receive (IP, ARP, IPX, BOOTP, etc.) that the system can handle How is it built? For every protocol in the protocols vector, when its init function is called (inet_proto_init), it calls functions like ip_init(), tcp_init() and arp_init()

dev_add_pack completes the picture Those subprotocols interested in registering a packet type (IP, ARP), get their init functions (ip_init(), arp_init()) to call dev_add_pack(), specifying a handler function This adds the packet type to ptype_base So net_bh( ) hands off packets to the right protocol stack

Setting things up – between network and transport layers How packets reach the correct transport protocol

inet_protos An array of transport layer protocols in INET Built at the time of inet_proto_init() By calling inet_add_protocol() for every transport protocol Registers handlers for transport protocols

Packet movement through stack Transmission and reception, queues, interrupts

struct sk_buff Each packet that arrives on the wire is encased in a buffer called sk_buff An sk_buff is just the data with a lot of additional information about the packet There is a one-to-one relationship between packets and sk_buffs, i.e. one packet, one buffer sk_buffs can be allocated in multiples of 16 bytes

struct sk_buff continued INET sock queues are queues of sk_buffs Data coming from the socket calls are copied into sk_buffs Data arriving from the network is copied into sk_buffs sk_buff picture with fields

struct sk_buff continued

Queues backlog queue INET sock queues TCP has a number of queues for out-of- order, connection backlog, error packets (?)

Packet reception Packet received by hardware Receive interrupt generated Driver handler copies data from hardware into fresh sk_buff Calls netif_rx() to queue on backlog Schedules net_bh() with mark_bh(NET_BH) net_bh() executes the next time the scheduler is run or a system call returns or a slow interrupt handler returns

Packet reception continued net_bh() tries to send any pending packets, then dequeues packets from the backlog and passes them to correct handler, say ip_rcv() ip_rcv() may call ip_local_deliver() or ip_forward() ip_local_deliver() results in call to tcp_v4_rcv() through the inet_protos list tcp_v4_rcv() queues data at the correct socket’s queue

Packet reception continued When the socket’s owner reads, tcp_recvmsg() is invoked through BSD socket’s proto_ops If instead the socket’s owner had blocked on a read, that process will be woken using wake_up (wait queue)

Packet transmission Quite different for TCP and UDP in terms of copying of user data to kernel space TCP does its own checksumming, while IP does checksumming for UDP. Why? Next section. net_bh() again takes care of flushing out packets that have piled up at the device’s queue

Tricks and optimizations TCP/IP enhancements, most due to Van Jacobson, arrived after 4.3BSD

Checksum and copy

Checksum and copy continued Linux goes over every byte of data only once (if the packet does not get fragmented) Uses checksum_and_copy() TCP data from socket gets filled into MSS-sized segments by TCP, so checksum-copying happens here

Checksum and copy continued INET Socket ( struct sock ) write_queue User Buffer ( ubuff ) sk_buff structure partially used sk_buff newly allocated sk_buff

Checksum and copy continued UDP, on the other hand, does not stuff anything into MSS-sized buffers, so there is no need to copy data from user space at UDP layer UDP passes data and a callback function to IP IP copies this data into an sk_buff, using the callback function, which is a checksum_and_copy function Large ping replies from a Linux host srrive in reverse order of frgaments! Why?

This fragment leaves first, the partial checksum for its data calculated and remembered This fragment leaves second, its checksum added to the partial checksum This fragment leaves last, so that final checksum can be written into the UDP header UDP datagram UDP header Why UDP fragmentation happens in reverse order

Fixed size buffer, sk_buff mbufs were potentially very clumsy “There is exactly one, contiguous, packet per pbuf (none of that mbuf chain stupidity).” Van Jacobson Allocation of fixed size buffers at the transport layer implies knowledge of network and link layer header sizes Linux is not shy of such indiscretions

Incremental checksum updates At every hop, TTL changes (is decremented) But IP checksum covers the header, and therefore the TTL also So it needs to be calculated at every hop Linux does this in one step RFCs 1071, 1141, 1624 discusses both copy_and_checksum and this incremental checksum update

Cached hardware headers Routes cache hardware headers for quick construction of outgoing packets.