NIO之十-实现一个NIO Server的主要思路
Java NIO: Non-blocking Server
- Non-blocking Server - GitHub Repository
- Non-blocking IO Pipelines
- Non-blocking vs. Blocking IO Pipelines
- Basic Non-blocking IO Pipeline Design
- Reading Partial Messages
- Storing Partial Messages
- Writing Partial Messages
- Putting it All Together
- Server Thread Model
Even if you understand how the Java NIO non-blocking features work (Selector
, Channel
, Buffer
etc.), designing a non-blocking server is still hard. Non-blocking IO contains several challenges compared blocking IO. This non-blocking server tutorial will discuss the major challenges of non-blocking servers, and describe some potential solutions for them.
即使你已经理解了nio 的组件(Selector、Channel、Buffer)是怎么工作的,设计一个Non-Blocking server依然比较困难。
NIO相对于Blocking IO 存在一些挑战,这个章节会介绍这些挑战,并提出一些可能的解决方案
Finding good information about designing non-blocking servers is hard. Therefore the solutions provided in this tutorial are based on my own work and ideas. If you have some alternative or even better ideas, I would be happy to hear about them! You can write a comment under the article or send me an email (see our About page), or catch me on Twitter.
The ideas described in this tutorial are designed around Java NIO. However, I believe that the ideas can be reused in other languages as long as they have some kind of Selector
-like construct. As far as I know, such constructs are provided by the underlying OS, so there is a good chance that you can get access to this in other languages too.
这篇文章是围绕着Java NIO来讨论的,但是我相信同样的思想适用于其它支持类似Selector结构的开发语言。
Non-blocking Server - GitHub Repository
I have created a simple proof-of-concept of the ideas presented in this tutorial and put it in a GitRebu repository for you to look at. Here is the GitHub repository:
https://github.com/jjenkov/java-nio-server
Non-blocking IO Pipelines
A non-blocking IO pipeline is a chain of components that process non-blocking IO. This includes both reading and writing IO in a non-blocking fashion.
NIO pipeline 是一个由多个组件组成的链,用于处理nio,包括对IO的读、写操作
Here is an illustration of a simplified non-blocking IO pipeline:
non-blocking-server-1.pngA component uses an Selector to check when a Channel has data to read. Then the component reads the input data and generates some output based on the input. The output is written to a Channel
again.
component使用selector检查channel是否有数据待读取,如果有,component读取输入数据,生成对应的输入数据,并将输出数据重新写入到channel。
A non-blocking IO pipeline does not need to both read and write data. Some pipelines may only read data, and some pipelines may only write data.
pipeline并不需要同时支持读、写数据,有些pipeline只支持读数据,有些则支持写入数据。例如netty中的pipeline。
The above diagram only shows a single component. A non-blocking IO pipeline may have more than one component process incoming data. The length of a non-blocking IO pipeline depends on what the pipeline needs to do.
上图中只有一个component,但是实际上根据需求,可以有多个component存在,pipeline的长度取决于这个pipeline要做什么。
A non-blocking IO pipeline may also be reading from multiple Channel
s at the same time. For instance, reading data from multiple SocketChannel
s.
nio pipeline 可能会同时从多个channel中读取数据,比如从多个socketChannel中读。
The flow of control in the above diagram is also simplified. It is the component that initiates the reading of data from the Channel
via the Selector
. It is not the Channel
that pushes the data into the Selector
and from there into the component, even if that is what the above diagram suggests.
上图中控制流也很简单,component通过selector从channel启动读取数据。
注意,并不是channel主动推送数据到selector中再流转到component,即使上图中是这么画的
Non-blocking vs. Blocking IO Pipelines
The biggest difference between a non-blocking and a blocking IO pipeline is how data is read from the underlying Channel
(socket or file).
NIO pipeline和 BIO pipeline最大的区别在于,数据是怎么从channel中读取的
IO pipelines typically read data from some stream (from a socket or file) and split that data into coherent messages. This is similar to breaking a stream of data into tokens for parsing using a tokenizer. Instead, you break the stream of data into bigger messages. I will call the component breaking the stream into messages for a Message Reader.
BIO pipeline从stream中读取数据,并将数据切分成多个连续的message。
Here is an illustration of a Message Reader breaking a stream into messages:
non-blocking-server-2.pngA blocking IO pipeline can use an InputStream
-like interface where one byte at a time can be read from the underlying Channel
, and where the InputStream
-like interface blocks until there is data ready to read. This results in a blocking Message Reader implementation.
Using a blocking IO interface to a stream simplifies the implementation of a Message Reader a lot. A blocking Message Reader never has to handle situations where no data was read from the stream, or where only a partial message was read from the stream and message parsing needs to be resumed later.
Similarly, a blocking Message Writer (a component that writes messages to a stream) never has to handle the situation where only part of a message was written, and where message writing has to be resumed later.
BIO pipeline实现简单,而且不需要考虑读场景时stream中没有数据,或者仅有一部分数据的情况,同样的写场景也不需要考虑。
Blocking IO Pipeline Drawbacks(缺点)
While a blocking Message Reader is easier to implement, it has the unfortunate drawback of requiring a separate thread for each stream that needs to be split into messages. The reason this is necessary is that the IO interface of each stream blocks until there is some data to read from it. That means that a single thread cannot attempt to read from one stream, and if there is no data, read from another stream. As soon as a thread attempts to read data from a stream, the thread blocks until there is actually some data to read.
虽然Blocking Message Reader容易实现,但是也有缺点,就是需要为每个Stream接收、处理都分配一个单独的线程。而且即使一个stream中没有数据,这个线程也无法去读取另一个stream,because the IO interface of each stream blocks until there is some data to read from it.
If the IO pipeline is part of a server which has to handle lots of concurrent connections, the server will need one thread per active ingoing connection. This may not be a problem if the server only has a few hundred concurrent connections at any time. But, if the server has millions of concurrent connections, this type of design does not scale so well. Each thread will take between 320K (32 bit JVM) and 1024K (64 bit JVM) memory for its stack. So, 1.000.000 threads will take 1 TB memory! And that is before the server has used any memory for processing the incoming messages (e.g. memory allocated for objects used during message processing).
如果IO pipeline是一个服务(server)的组成部分,而这个服务需要处理多个请求连接,那么服务需要为每个请求进来的连接创建一个线程。如果请求量比较小,还不是什么太大的问题,但是如果需要服务百万级别的并发,BIO pipeline就不太合适了。
每个线程需要根据-Xss的指定或者默认值,为线程栈信息分配320K(32位的jvm)或者1024K(64位JVM),100W的线程就需要1TB的内存空间,而且还不算server需要处理数据所用的堆空间。
To keep the number of threads down, many servers use a design where the server keeps a pool of threads (e.g. 100) which reads messages from the inbound connections one at a time. The inbound connections are kept in a queue, and the threads process messages from each inbound connection in the sequence the inbound connections are put into the queue.
为了降低线程数,引入了线程池和队列,stream请求放入队列中,线程池依次读取处理。
This design is illustrated here:
non-blocking-server-3.pngHowever, this design requires that the inbound connections send data reasonably often. If the inbound connections may be inactive for longer periods, a high number of inactive connections may actually block all the threads in the thread pool. That means that the server becomes slow to respond or even unresponsive.
但是这种设计需要请求进来的connection经常发送数据,如果请求的连接休眠(不活跃)一段时间,那么多个不活跃的请求会block掉线程池中所有的线程,也就意味着server的响应会变慢,甚至无法响应请求
Some server designs try to mitigate this problem by having some elasticity in the number of threads in the thread pool. For instance, if the thread pool runs out of threads, the thread pool might start more threads to handle the load. This solution means that it takes a higher number of slow connections to make the server unresponsive. But remember, there is still an upper limit to how many threads you can have running. So, this would not scale well with 1.000.000 slow connections.
有些server设计了更灵活的线程池实现,比如已跑满线程池初始会时指定的线程数,就再启更多线程用于处理请求。不过这种实现方案治标不治本,仅仅表明可以容纳更多不活跃线程而已,并且一个server可以起的线程数是有上限的。
Basic Non-blocking IO Pipeline Design(开始NIO相关设计)
A non-blocking IO pipeline can use a single thread to read messages from multiple streams. This requires that the streams can be switched to non-blocking mode. When in non-blocking mode, a stream may return 0 or more bytes when you attempt to read data from it. The 0 bytes is returned if the stream has no data to read. The 1+ bytes are returned when the stream actually has some data to read.
Non-Blocking pipeline可以使用单个线程读取多个stream中的数据(这里stream相当于channel),前提是stream支持切换为non-blocking模式。
当处于non-blocking模式的时候,一个stream可能返回0,也可能返回N个bytes。 0 bytes表明stream中没有数据可供读取,而N bytes表明当前stream中有可供读取的数据,N 可能为allocate的byteBuffer capacity,也可能小于这个值。
To avoid checking streams that has 0 bytes to read we use a Java NIO Selector. One or more SelectableChannel
instances can be registered with a Selector
. When you call select()
or selectNow()
on the Selector
it gives you only the SelectableChannel
instances that actually has data to read.
为了避免总是检查一个stream是否返回的是0bytes数据,我们使用java NIO Selector。select方法只会返回实际有数据的channel。
This design is illustrated here:
non-blocking-server-4.pngReading Partial Messages(读取部分信息)
When we read a block of data from a SelectableChannel
we do not know if that data block contains less or more than a message. A data block could potentially(可能) contain a partial message (less than a message), a full message, or more than a message, for instance 1.5 or 2.5 messsages. The various partial message possibilities are illustrated here:
There are two challenges(挑战) in handling partial messages:
- Detecting if you have a full message in the data block.
- What to do with partial messages until the rest of the message arrives.
Detecting full messages requires that the Message Reader looks at the data in the data block to see if the data contains at least one full message. If the data block contains one or more full messages, these messages can be sent down the pipeline for processing. The process of looking for full messages will be repeated a lot, so this process has to be as fast as possible.
Whenever there is a partial message in a data block, either by itself or after one or more full messages, that partial message needs to be stored until the rest of that message arrives from the Channel
.
Both detecting full messages and storing partial messages is the responsibility (职责)of the Message Reader. To avoid mixing message data from different Channel
instances we will use one Message Reader per Channel
.( 每个channel有单独的Message Reader) The design looks like this:
After retrieving(检索) a Channel
instance which has data to read from the Selector
, the Message Reader associated(联系) with that Channel
reads data and attempt to break it it into messages. If that results in any full messages being read, these message can be passed down the read pipeline to whatever component needs to process them.
A Message Reader is of course protocol specific. A Message Reader needs to know the message format of the messages it is trying to read. If our server implementation is to be reusable across protocols, it needs to be able to have the Message Reader implementation plugged in - possibly by accepting a Message Reader factory as configuration parameter somehow(以某种方法).
Storing Partial Messages(存储-部分信息)
Now that we have established(确定) that it is the responsibility of the Message Reader to store partial messages until a full message has been received, we need to figure out how this partial message storage should be implemented.
There are two design considerations(注意事项) we should take into consideration(考虑):
- We want to copy message data around as little as possible. The more copying, the lower performance.
- We want full messages to be stored in consecutive(连续的) byte sequences to make parsing messages easier.
A Buffer Per Message Reader
Obviously the partial messages need to be stored in some kind of buffer. The straightforward(简单的) implementation would be to simply have one buffer internally in each Message Reader. However, how big should that buffer be? It would need to be big enough to be able to store even the biggest allowed messages. So, if the biggest allowed message is 1MB, then the internal buffer in each Message Reader would need to be at least 1MB.
Using 1MB per connection doesn't really work when we reach millions of connections. 1.000.000 x 1MB is still 1TB memory! And what if the maximum message size is 16MB? Or 128MB ?
每个Message Reader一个Buffer,但是buffer多大容量是合适的呢?
Resizable Buffers
Another option would be to implement a resizable buffer for use inside each Message Reader. A resizable buffer will start small, and if a message gets too big for the buffer, the buffer is expanded(扩展). This way each connection will not necessarily require an e.g. 1MB buffer. Each connection only takes as much memory as they need to hold the next message.
There are several ways to implement a resizable buffer. All of them have advantages and disadvantages, so I will discuss them all in the following sections.
Resize by Copy
The first way to implement a resizable buffer is to start with a small buffer of e.g. 4KB. If a message cannot fit into the 4KB buffer, a larger buffer of e.g. 8KB could be allocated, and the data from the 4KB buffer copied into the bigger buffer.(扩展的时候,将原来的buffer数据复制到新buffer中)
The advantage of the resize-by-copy buffer implementation is that all data for a message is kept together in a single consecutive(连续) byte array. This makes parsing the message much easier.
The disadvantage of the resize-by-copy buffer implementation is that it will lead to a lot of data copying for bigger messages.(需要占用额外空间)
To reduce data copying you could analyze the size of the messages flowing through your system to find some buffer sizes that would reduce the amount of copying. For instance, you might see that most messages are less than 4KB because they only contain very small requests / responses. That means that the first buffer size should be 4KB.
Then you might see that if a message is larger than 4KB it is often because it contains a file. You might then notice that most of the files flowing through the system is less than 128KB. Then it makes sense to make the second buffer size 128KB.
Finally you might see that once a message is above 128KB there is no real pattern(no real pattern 没有固定的模式) in how large the message is, so perhaps the final buffer size should just be the maximum message size.
With these 3 buffer sizes based on the size of messages flowing through your system, you will have reduced data copying somewhat. Messages below 4KB will never be copied. For 1.000.000 concurrent connections that results in 1.000.000 x 4KB = 4GB which is possible in most servers today (2015). Messages between 4KB and 128KB will get copied once, and only 4KB data will need to be copied into the 128KB buffer. Messages between 128KB and maximum message size will be copied twice. First time 4KB will get copied, second time 128KB will get copied, so a total of 132KB copying for the biggest messages. Assuming that there are not that many messages above 128KB this might be acceptable.
Once a message has been fully processed the allocated memory should be freed again. That way the next message received from the same connection starts with the smallest buffer size again. This is necessary to make sure that the memory can be shared more efficiently between connections. Most likely not all connections will need big buffers at the same time.
I have a complete tutorial about how to implement such a memory buffer that supports resizable arrays here: Resizable Arrays . The tutorial also contains a link to a GitHub repository with code showing a working implementation.
Resize by Append
Another way to resize a buffer is to make the buffer consist of multiple arrays. When you need to resize the buffer you simply allocate another byte array and write the data into that.
There are two ways to grow such a buffer. One way is to allocate separate byte arrays and keep a list of these byte arrays. Another way is to allocate slices of a larger, shared byte array, and then keep a list of the slices allocated to the buffer. Personally, I feel the slices approach is slightly better, but the difference is small.(数组和链表)
The advantage of growing a buffer by appending separate arrays or slices to it is that no data needs to be copied during writing. All data can be copied directly from a socket (Channel
) directly into an array or slice.
The disadvantage of growing a buffer this way is that the data is not stored in a single, consecutive array. This makes message parsing harder, since the parsers need to look out for both the end of every individual(个别的) array and the end of all arrays at the same time. Since you need to look for the end of a message in the written data, this model is not too easy to work with.(链表避免了数据copy,但是会导致解析困难)
TLV Encoded Messages
Some protocol message formats are encoded using a TLV format (Type, Length, Value). That means, that when a message arrives the total length of the message is stored in the beginning of the message. That way you know immediately how much memory to allocate for the whole message.(在头部把长度、类型都记录下来,方便server端处理)
TLV encodings make memory management much easier. You know immediately how much memory to allocate for the message. No memory is wasted at the end of a buffer that is only partially used.
One disadvantage of TLV encodings is that you allocate all the memory for a message before all the data of the message has arrived. A few, slow connections sending big messages can thus allocate all the memory you have available, making your server unresponsive.
A workaround(解决方案) for this problem would be to use a message format that contains multiple TLV fields inside. Thus, memory is allocated for each field, not for the whole message, and memory is only allocated as the fields arrive. Still, a large field can have the same effect on your memory management as a large message.
Another workaround is to time out messages which have not been received within e.g. 10-15 seconds. This can make your server recover from a coincidental(巧合的,一致的), simultaneous(同时发生的) arrival of many big messages, but it will still make the server unresponsive for a while. Additionally, an intentional(故意的,蓄谋的) DoS (Denial(否认、拒绝) of Service) attack can still result in full allocation of the memory for your server.
TLV encodings exist in different variations(变更,变种). Exactly how many bytes is used so specify the type and length of a field depends on each individual TLV encoding. There are also TLV encodings that put the length of the field first, then the type, and then the value (an LTV encoding). While the sequence of the fields is different, it is still a TLV variation.
The fact that TLV encodings makes memory management easier is one of the reasons why HTTP 1.1 is such a terrible protocol. That is one of the problems they are trying to fix in HTTP 2.0 where data is transported in LTV encoded frames. This is also why we have designed our own network protocol for our VStack.co project that uses a TLV encoding.
Writing Partial Messages(写-部分)
In a non-blocking IO pipeline writing data is also a challenge. When you call write(ByteBuffer)
on a Channel
in non-blocking mode there is no guarantee(保证) about how many of the bytes in the ByteBuffer
is being written. The write(ByteBuffer)
method returns how many bytes were written, so it is possible to keep track of the number of written bytes. And that is the challenge: Keeping track of partially written messages so that in the end all bytes of a message have been sent.(一条消息会分成多次被写入)
To manage the writing of partial messages to a Channel
we will create a Message Writer. Just like with the Message Reader, we will need a Message Writer per Channel
we write messages to. Inside each Message Writer we keep track of exactly how many bytes have been written of the message it is currently writing.
In case(假设,万一) more messages arrives for a Message Writer than it can write directly out to the Channel
, the messages needs to be queued up internally in the Message Writer. The Message Writer then writes the messages as fast as it can to the Channel
.
Here is a diagram showing how the partial message writing is designed so far:
non-blocking-server-8.pngFor the Message Writer to be able to send messages that were only partially sent earlier, the Message Writer needs to be called from time to time, so it can send more data.
If you have a lot of connections you will have a lot of Message Writer instances. Checking e.g. a million Message Writer instances to see if they can write any data is slow. First of all, many of the Message Writer instance many not have any messages to send. We don't want to check those Message Writer instances. Second, not all Channel
instances may be ready to write data to. We don't want to waste time trying to write data to a Channel
that cannot accept any data anyways.
To check if a Channel
is ready for writing you can register the channel with a Selector
. However, we do not want to register all Channel
instances with the Selector
. Imagine if you have 1.000.000 connections which are mostly idle and all 1.000.000 connections were registered with the Selector
. Then, when you call select()
most of these Channel
instances would be write-ready (they are mostly idle, remember?). You would then have to check the Message Writer of all those connections to see if they had any data to write.
To avoid checking all Message Writer instances for messages, and all Channel
instances which anyways do not have any messages to be sent to them, we use this two-step approach(方法):
-
When a message is written to a Message Writer, the Message Writer registers its associated
Channel
with theSelector
(if it is not already registered).有数据写入时,才将Message Write分配到的channel注册到Selector中
-
When your server has time, it checks the
Selector
to see which of the registeredChannel
instances are ready for writing. For each write-readyChannel
its associated Message Writer is requested to write data to theChannel
. If a Message Writer writes all its messages to itsChannel
, theChannel
is unregistered from theSelector
again.如果数据写入完毕,unregister对应的channel,这样selector中注册的channel能保证都是有用的
This little two-step approach makes sure that only Channel
instances that have messages to be written to them are actually registered with the Selector
.
Putting it All Together
As you can see, a non-blocking server needs to check for incoming data from time to time to see if there are any new full messages received. The server might need to check multiple times until one or more full messages have been received. Checking one time is not enough.(需要多次检查,所以要有循环)
Similarly, a non-blocking server needs to check from time to time if there is any data to write. If yes, the server needs to check if any of the corresponding connections are ready to have that data written to them. Checking only when a message is queued up the first time is not enough, since the message might be written partially.
All in all a non-blocking server ends up with three "pipelines" it needs to execute regularly:
- The read pipeline which checks for new incoming data from the open connections. 读入--接收
- The process pipeline which processes any full messages received. 业务处理
- The write pipeline which checks if it can write any outgoing messages to any of the open connections. 写出--发送
These three pipelines are executed repeatedly in a loop. You might be able to optimize the execution somewhat. For instance, if there are no messages queued up(排队等候) you can skip the write pipeline. Or, if there we no new, full messages received, perhaps you can skip the process pipeline.
Here is a diagram illustrating the full server loop:
non-blocking-server-9.pngIf you still find this a bit complicated(难懂、复杂), remember to check out the GitHub repository:
https://github.com/jjenkov/java-nio-server
Maybe seeing the code in action might help you understand how to implement this.
Server Thread Model
The non-blocking server implementation in the GitHub repository uses a thread model with 2 threads. The first thread accepts incoming connections from a ServerSocketChannel
. The second thread processes the accepted connections, meaning reading messages, processing the messages and writing responses back to the connections. This 2 thread model is illustrated here:
The server processing loop explained in the previous section is executed by the processing thread.