网络编程系列(二)socket I/O 模型对比

原则13:从硬件、操作系统到你使用的编程语言等多方面深入了解服务器的工作原理。优化 IO 操作的效率是一个良好架构的首要任务。

原则15:如果你的设计是基于事件驱动的非阻塞架构,那就不要阻塞线程或者在线程中执行 IO 操作。一旦这样做,系统将慢如蜗牛。

本文主要介绍和对比 socket 常用的几种 I/O 模型:

Comparison of the five I/O model

同步阻塞 I/O

默认设置下,socket 为同步阻塞 I/O 模型,需阻塞等待新的连接、新的数据报到达:

Blocking I/O model

有两种实现方式:

单线程模型

Single Thread

优点:

  • 编程实现简单。

缺点:

  • 采用先来先服务(FCFS)原则进行排队处理,处理效率低下:因为一次只处理 backlog 中的一个连接,直到对方 close() 发送 EOF 关闭连接后,才能继续处理 backlog 中的下一个连接。如果该连接一直未关闭,则会阻塞主线程使其一直等待该连接新的数据报到达,导致:
    • 客户端的请求响应时间很长,体验感差,因为主线程无法及时处理新到达的连接;
    • 服务端的 CPU 利用率(CPU utilization)低下,因为很大可能,时间都消耗在阻塞等待上了:

Single Thread blocking

样例代码

https://github.com/qidawu/cpp-data-structure/blob/main/socket/tcpIterativeServer.c

流程图

同步阻塞 I/O + 单线程模型的服务端流程如下:

Blocking I/O workflow

多线程模型

多线程模型(Thread-based architecture / Thread per Request Model)

Multi Thread

优点:

  • 线程之间互不干扰、互不影响

缺点:

  • 只适合于并发量不大的场景,建议使用线程池实现,否则:
    • 线程的创建、销毁开销大
    • 线程本身需要占用内存资源(-Xss 栈)
    • 线程数有上限(ulimit -u
    • 线程间的上下文频繁切换开销大
  • 各个工作线程内仍然存在阻塞等待的问题,CPU 利用率(CPU utilization)低下:

Multi Thread blocking

同步非阻塞 I/O

有几种设置方式可以将 socket 设置为同步非阻塞 I/O 模型。设置后,需忙轮询(polling)等待新的连接、新的数据报到达:

Then all operations that would block will (usually) return with EAGAIN or EWOULDBLOCK (operation should be retried later); connect(2) will return EINPROGRESS error. The user can then wait for various events via poll(2) or select(2).

Nonblocking I/O model

优点:

  • 解决了 BIO 的阻塞问题,无需开启过多线程

缺点:

  • 忙等浪费 CPU 资源:主线程需要不断轮询,以判断是否有新的连接到达(以便 accept()),或者各个已建立的连接是否有新的数据报到达(以便 recv())。

样例代码

流程图

同步非阻塞 I/O + 单线程模型的服务端流程如下:

Nonblocking I/O workflow

I/O 多路复用

什么是多路复用(Multiplexing)?

In telecommunications and computer networking, multiplexing (sometimes contracted to muxing) is a method by which multiple analog or digital signals are combined into one signal over a shared medium. The aim is to share a scarce resource – a physical transmission medium.

A device that performs the multiplexing is called a multiplexer (MUX), and a device that performs the reverse process is called a demultiplexer (DEMUX or DMX).

In computing, I/O multiplexing can also be used to refer to the concept of processing multiple input/output events from a single event loop, with system calls like poll[1] and select (Unix).[2]

为了减少忙轮询(polling)带来的 CPU 资源消耗,可以使用 I/O 多路复用,同时监视多个文件描述符:

I/O multiplexing API

一旦一个或多个文件描述符就绪,就通知程序进行相应的操作:

I/O multiplexing model

适用场景:

优点:

  • 减少盲目轮询带来的 CPU 资源消耗:无需 n 次系统调用(accept()recv()),使用 I/O 多路复用只需 1 次系统调用,即可同时监视多个文件描述符是否就绪。

I/O 多路复用在各平台的 API 如下:

The classical functions for I/O multiplexing are select and poll. Due to several limitations, modern operating systems provide advanced (non-portable) variants to these:

I/O 多路复用 API 对比表格:

演进 诞生时间 是否 POSIX 标准 API 易用程度 能够监视的事件 能够监视的文件描述符数量 性能
select() 上世纪 80 年代 低(每次 select() 完都需重置 fd_set readfds
writefds
exceptfds
< 1024 低(轮询实现)
poll() 1997 年 POLLIN
POLLOUT
POLLPRI
POLLERR
POLLHUP
POLLRDHUP
≥ 1024 低(轮询实现)
epoll 2002 年 Linux 2.5.44 EPOLLIN
EPOLLOUT
EPOLLPRI
EPOLLERR
EPOLLHUP
EPOLLRDHUP
≥ 1024 高(红黑树+双向链表)

select()

https://man7.org/linux/man-pages/man2/select.2.html

synchronous I/O multiplexing

SYNOPSIS

1
2
3
4
5
6
7
8
9
10
11
#include <sys/select.h>

// return value: n, 0, -1
// On success, number of file descriptors contained in the three returned descriptor sets
// 0 if the timeout expired before any file descriptors became ready.
// On error, -1 is returned, and errno is set to indicate the error
int select(int nfds,
fd_set *readfds,
fd_set *writefds,
fd_set *exceptfds,
struct timeval *timeout);

DESCRIPTION

⚠️WARNING: select() can monitor only file descriptors numbers that are less than FD_SETSIZE (1024)—an unreasonably low limit for many modern applications—and this limitation will not change. All modern applications should instead use poll(2) or epoll(7), which do not suffer this limitation.

select() allows a program to monitor multiple file descriptors, waiting until one or more of the file descriptors become “ready” for some class of I/O operation (e.g., input possible). A file descriptor is considered ready if it is possible to perform a corresponding I/O operation (e.g., read(2), or a sufficiently small write(2)) without blocking.

Four macros are provided to manipulate the sets

1
2
3
4
5
6
7
8
// clears a set
void FD_ZERO(fd_set *set);
// add a given file descriptor from a set
void FD_SET(int fd, fd_set *set);
// remove a given file descriptor from a set
void FD_CLR(int fd, fd_set *set);
// tests to see if a file descriptor is part of the set
int FD_ISSET(int fd, fd_set *set);

this is useful after select() returns.

样例代码

样例代码参考:

https://man7.org/linux/man-pages/man2/select.2.html#EXAMPLES

<Example: Nonblocking I/O and select() | IBM>

流程图

服务端流程如下:

select workflow

poll()

https://man7.org/linux/man-pages/man2/poll.2.html#EXAMPLES

wait for some event on a file descriptor

SYNOPSIS

1
2
3
#include <poll.h>

int poll(struct pollfd *fds, nfds_t nfds, int timeout);

DESCRIPTION

poll() performs a similar task to select(2): it waits for one of a set of file descriptors to become ready to perform I/O. The Linux-specific epoll(7) API performs a similar task, but offers features beyond those found in poll().

样例代码

样例代码参考:

https://man7.org/linux/man-pages/man2/poll.2.html#EXAMPLES

<Using poll() instead of select() | IBM>

流程图

epoll

https://man7.org/linux/man-pages/man7/epoll.7.html

/proc/sys/fs/epoll/max_user_watches (since Linux 2.6.28)

This specifies a limit on the total number of file descriptors that a user can register across all epoll instances on the system.

API

epoll_create

https://man7.org/linux/man-pages/man2/epoll_create.2.html

open an epoll file descriptor

1
2
3
4
5
6
#include <sys/epoll.h>

// On success, return a file descriptor (a nonnegative integer).
// On error, returns -1 and errno is set to indicate the error.
int epoll_create(int size);
int epoll_create1(int flags);

用法如下:creates a new epoll instance and returns a file descriptor referring to that instance.

1
int epfd = epoll_create1(0);

epoll_ctl

https://man7.org/linux/man-pages/man2/epoll_ctl.2.html

control interface for an epoll file descriptor

1
2
3
4
5
#include <sys/epoll.h>

// On success, returns 0.
// On error, returns -1 and errno is set to indicate the error.
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

op 参数支持三种操作:

op Description
EPOLL_CTL_ADD 增加被监视的文件描述符
EPOLL_CTL_DEL 删除被监视的文件描述符
EPOLL_CTL_MOD 修改被监视的文件描述符

event 参数(结构参考:struct epoll_event)用于描述文件描述符 fdevent 参数的 events 成员变量包含两部分含义:event types 和 input flags

常用的 event types:

Event types Description
EPOLLIN The associated file is available for read(2) operations.
EPOLLOUT The associated file is available for write(2) operations.
EPOLLERR Error condition happened on the associated file descriptor. This event is also reported for the write end of a pipe when the read end has been closed.
epoll_wait(2) will always report for this event; it is not necessary to set it in events when calling epoll_ctl().
EPOLLHUP Hang up happened on the associated file descriptor.
epoll_wait(2) will always wait for this event; it is not necessary to set it in events when calling epoll_ctl().

常用的 input flags:

Input flags Description
EPOLLET Requests edge-triggered notification for the associated file descriptor. The default behavior for epoll is level-triggered. See epoll(7) for more detailed information about edge-triggered and level-triggered notification.

用法如下:adds items to the interest list of the epoll instance

1
2
3
4
struct epoll_event event;
event.data.fd = listenfd;
event.events = EPOLLIN | EPOLLET; // edge-triggered mode
s = epoll_ctl(epfd, EPOLL_CTL_ADD, listenfd, &event);

epoll_wait

https://man7.org/linux/man-pages/man2/epoll_wait.2.html

wait for an I/O event on an epoll file descriptor

1
2
3
4
5
6
7
#include <sys/epoll.h>

// return value: n, 0, -1
// On success, returns the number of file descriptors ready for the requested I/O operation,
// or 0 if no file descriptor became ready during the requested timeout milliseconds.
// On error, returns -1 and errno is set to indicate the error.
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

The buffer pointed to by events is used to return information from the ready list about file descriptors in the interest list that have some events available.

A call to epoll_wait() will block until either:

  • a file descriptor delivers an event;
  • the call is interrupted by a signal handler; or
  • the timeout expires.

用法如下:fetchs items from the ready list of the epoll instance

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
struct epoll_event events[MAXEVENTS];

/* The event loop */
while (1)
{
int n, i;

n = epoll_wait(epfd, events, MAXEVENTS, -1);
for (i = 0; i < n; i++)
{
if (listenfd == events[i].data.fd) {
/* We have a notification on the listening socket, which
means one or more incoming connections. */
while (1) {
// accept() a new connection
// epoll_ctl() add to the *interest* list
}
}
else if (events[i].events & EPOLLIN) {
/* We have data on the fd waiting to be read. Read and
display it. We must read whatever data is available
completely, as we are running in edge-triggered mode
and won't get a notification again for the same
data. */
while (1) {
// recv()
}
}
}
}

样例代码

样例代码参考:《如何使用 epoll? 一个 C 语言实例

底层实现

epoll 实例由两部分组成:

  • The interest list:被监视的文件描述符集合,由 epoll_ctl() 增删改,数据结构为红黑树。
  • The ready list:已就绪的文件描述符集合,由 epoll_wait() 返回,数据结构为双向链表。

The central concept of the epoll API is the epoll instance, an in-kernel data structure which, from a user-space perspective, can be considered as a container for two lists:

  • The interest list (sometimes also called the epoll set): the set of file descriptors that the process has registered an interest in monitoring.
  • The ready list: the set of file descriptors that are “ready” for I/O. The ready list is a subset of (or, more precisely, a set of references to) the file descriptors in the interest list. The ready list is dynamically populated by the kernel as a result of I/O activity on those file descriptors.

eventpoll 结构

epoll 这种设计能显著提高程序在大量并发连接中只有少量活跃的情况下的系统 CPU 利用率(CPU utilization),因为在 epoll_wait() 获取事件的时候,它无须遍历整个被监视的文件描述符集合(The interest list),只要遍历那些被内核 I/O 事件异步唤醒而加入 Ready 队列的文件描述符集合(The ready list)即可。

eventpoll 结构

流程图

服务端流程如下:

epoll workflow

抽象来看,使用 I/O 多路复用可以实现基于单线程的事件循环模型(Single Threaded Event Loop Model):

event loop

站在客户端、服务端的角度,整体交互流程如下:

epoll workflow

使用场景

Netty

  • Spring WebFlux、Dubbo、Vert.x
  • Apache Kafka、RocketMQ、ElasticSearch、Zookeeper

Redis、Nginx、Node.js、…

通过命令 strace 可以查看应用程序使用了哪些系统调用。例如查看 Redis server 如何使用 epoll API:

epoll of redis

Redis 是如何建立连接和处理命令的 | 阿里技术

参考

I/O multiplexing

Event loop

Reactor

http://www.kegel.com/c10k.html

I/O 多路复用系列

Linux 高性能服务 epoll 的本质,真的不简单 | 良许 Linux

从 Linux 源码角度看 epoll,透过现象看本质 | 良许 Linux

《Netty 权威指南》李林峰