Speed Up Shadowsocks

This article also has a Chinese version.

This article briefly summarizes current methods for accelerating internet access and details an acceleration project I developed based on Shadowsocks: go-shadowsocks-magic.

UPDATE 2019-10: Building on the ideas from this project, I extended it to support arbitrary TCP connections by decoupling the multi-connection acceleration: Rabbit TCP, which shows some improvements in stability and latency.

Introduction

The working methods of most “ladders” (a colloquial term for VPN-like services in China) are similar.

Typically, traffic has to go through an overseas server for relay before reaching the intended destination server. Such proxy services include Shadowsocks, OpenVPN, ShadowVPN, and GoAgent. Another type involves a more complex topology that supports traffic transmission through a network of multiple nodes, with a specific node acting as a proxy, like V2Ray, Tinc, and Tor.

Based on the types of traffic that can be proxied, VPNs can relay all IP packets, covering common protocols like TCP, UDP, and ICMP, working in a global routing mode. SOCKET proxies, which are commonly represented by Shadowsocks and V2Ray, cannot proxy ICMP packets and usually provide services in the form of a SOCKS5 proxy, making them less convenient for proxying UDP traffic (requiring tools like SSTap to forcibly forward).

Currently, there are several methods for accelerating “ladders.”

Implement Congestion Control and Reliability by Yourself

Typical representatives: KCPTun, UDPSpeeder, FinalSpeed

These accelerations generally use UDP for communication and require deployment on both client and server ends. By dynamically detecting available bandwidth and aggressively sending multiple packets to reduce packet loss rates, they aim to speed up the connection. However, due to their inefficient bandwidth occupation, they’re often considered selfish in terms of acceleration algorithms. In my test over a year ago, KCPTun used 8 times the traffic to double the speed (which was still quite slow).

TCP Congestion Control on One Side Only

Typical representatives: BBR, SharpSpeed

This acceleration requires modifying the kernel and is not suitable for OVZ virtual machines (though there are strange ways to move it to the user space). These algorithms aim to fully utilize bandwidth on links with certain packet loss rates.

Pre-BBR congestion controls (CUBIC, Reno, etc.) used additive increase and multiplicative decrease to probe for a larger send window, reducing it upon detecting packet loss (including congestion loss and error loss). Since public network latency is often high (with similarly large bandwidth), we get a large send window, leading to high packet loss and subsequent window shrinking. On high-latency, high-loss lines, these algorithms will converge to a relatively small send window.

BBR achieves more efficient bandwidth utilization by estimating bandwidth and latency and actively adjusting the send window.

Avoiding QoS

Typical representatives: SSR, UDP2RAW

To ensure Quality of Service (QoS) for users, ISPs assign different priorities to different types of traffic. After all, if one person monopolizes the “pipe,” others will have no quality experience.

The methods ISPs use to determine priority are not public. Through some black-box observations, we can conclude there are port priority (preferential treatment for ports like 80 and 443), traffic type priority (large amounts of UDP traffic can lead to disconnection), and line priority (you can pay for a better global connection).

SSR disguises encrypted Shadowsocks traffic with HTTP headers, pretending to be HTTP traffic and thus fooling certain traffic type filters to some extent. UDP2RAW tricks hardware switches into treating UDP packets as TCP by adding TCP headers to UDP packets and simulating a TCP handshake, which can reduce packet loss.

Other software, like V2Ray, also supports dynamic ports to avoid excessive traffic on a single port. For unsupported software, iptables can achieve similar multiplexing.

Accelerating Shadowsocks with Multiple Connections

Project URL: https://github.com/ihciah/go-shadowsocks-magic

The communication mode of existing Shadowsocks and the version initially published by clowwindy hasn’t changed much.

The browser communicates with the ss-server through the SOCKS5 protocol of ss-local. After the SOCKS5 handshake is completed, ss-local establishes a connection with ss-server. Following connection establishment, all communications are encrypted and decrypted with predefined algorithms and keys. Here, ss-local sends the IP/Hostname and Port of the destination machine to ss-server according to a specific protocol. After receiving and decrypting the message, ss-server connects proactively. After the connection is established, the connections are just linked directly (with encryption and decryption, of course).

For some subpar connections with high packet loss rates, the actual bandwidth can be quite low, not even capable of handling YouTube at 720P. In these cases, the bandwidth bottleneck is not on the ss-server to the destination server connection but rather between the ss-local and ss-server.

Comparing the speed of a browser downloading files with a downloader’s (like IDM), if a single connection converges to a smaller window, then multiple connections can make the most out of the available bandwidth. I believe the same principle can be applied to the bottlenecked segment: ss-local <====> ss-server.

Protocol Design

First, the server side needs to cache more data, which is then passed back through multiple connections initiated by the client.

This can be done in two ways:

Inform the client of the range of optional data and give the client the right to choose the data segment; the client sends the target data segment, and the server sends back these segments.
Don’t worry about the client and let the server actively distribute data.

IDM’s download method is similar to the first one. However, there are clear differences between file downloads and data proxying: files have a fixed size, and servers have all the data.

Since the server’s cache is changing, the first method is inefficient and logically complex, so I chose the second one. Similarly, the protocol has similar issues in managing multiple connections, and I’ve given the client the power to decide whether to create connections or not in the protocol.

Protocol Details

In the following description, “client” refers to ss-client, and “server” refers to ss-server. I have omitted the parts about encryption and decryption.

First, the client establishes a connection with the server.
Send Address. The first byte of the Address represents the address type. Here, I’ve added two bits to indicate the acceleration status: magic-main (0b01000) and magic-child (0b10000), representing the main connection and secondary connection respectively.

Mark the Address with magic-main and send it. At this point, the server will respond with a 16-byte dataKey. Other connections holding this Key can legitimately request that data.
At this point, we only have one connection, which will continuously send back data blocks. The data block format is:

[BlockID(uint32)][BlockSize(uint32)][Data([BlockSize]byte)]
We can create additional connections to accelerate data retrieval. In a new thread, we assemble the “Address”: [Type([1]byte)][dataKey([16]byte)], where Type is magic-child.

Create an additional connection to the server, then send this Address, and then receive data blocks as needed.

Ultimately, all received data blocks must be sorted by BlockID and returned to the SOCKS5 connection.
If the main connection breaks, all secondary connections become invalid. However, if a secondary connection is interrupted, it does not affect the others.

Code Implementation

My code is based on go-shadowsocks2.

Project URL: https://github.com/ihciah/go-shadowsocks-magic

In this initial version, the bufferSize is set to 64 KB. To avoid establishing too many connections for small files, create 2 secondary connections every 500ms until MaxConnection (set to 16 here) is reached. This logic is relatively simple and can be optimized further.

For compatibility reasons, the server side accepts existing versions of the Shadowsocks protocol, supporting various existing clients on the market. The client side accelerates by default but can only connect to servers that support the acceleration protocol.

Experiment

Experimental Environment:

Fudan University Campus Network (Zhangjiang Campus 100M Wired Network) <—> AWS LightSail Tokyo <—> qd.myapp.com

Client: Windows 10 Enterprise 1803 (original protocol: shadowsocks-windows, acceleration: statically compiled go-shadowsocks-magic v1.0)

Server: Ubuntu 18.04.1 LTS (with BBR)

Experimental script:

https://gist.github.com/ihciah/bda1fab5dc1a8d0f70d7b0199f46169a

Experimental Results:

PS C:\Users\ihciah\Desktop> python .\socks-test.py
Start Download https://qd.myapp.com/myapp/qqteam/pcqq/QQ9.0.8_3.exe.
<Response [200]>
[Original]Done 22.401 seconds.
Start Download https://qd.myapp.com/myapp/qqteam/pcqq/QQ9.0.8_3.exe.
<Response [200]>
[MY]Done 14.744 seconds.
==========
Original: 22.40120 sec
My      : 14.74440 sec
==========

The experimental results show that in subpar networks, multiple connections can significantly speed up large data transfers.

This is a simple hobby project, welcome to contribute or suggest your ideas.

Follow-up Work (2019-10)

For domestic ISPs, multiple connections can indeed accelerate to some extent. However, the previously mentioned shadowsocks-magic can only be regarded as a POC, as it has some practical issues.

For every ShadowSocks layer connection, starting multiple connections with a certain delay can cause a surge in connection numbers, leading to a large number of connections in TIME_WAIT after disconnection; it also only accelerates connections with longer durations.

Subsequently, I did another thing: I separated the multi-connection acceleration into a separate module to support bilateral acceleration of any TCP traffic: Rabbit TCP. The principle is still to use multiple connections to carry data blocks of upper-layer connections and for segmentation and reassembly. The difference is that besides being embedded in Golang code for use, it can act as an independent proxy and load any number of upper-layer connections over a fixed number of underlying connections.

Unlike before, the underlying connections are persistent, which can avoid problems caused by frequent establishment and closing of a large number of connections; it also somewhat reduces the latency associated with establishing connections; another potential benefit is changing the timing characteristics of the traffic.

Of course, there are also downsides; since it carries arbitrary TCP traffic and needs to disguise Block data packet formats, the outermost layer needs to be encrypted. For protocols like ShadowSocks, where the traffic is already encrypted, this is equivalent to an additional round of encryption and decryption, incurring a performance cost. Another potential downside not yet implemented is the need for QoS — without it, all carrying traffic is equally weighted, which might not be ideal for user experience.

Overall, the Rabbit TCP project is far more stable than the previous go-shadowsocks-magic, and using the Rabbit Plugin to accelerate ShadowSocks traffic provides a good experience.