Wednesday, December 31, 2014

Ceph Pi - performance of scp, ciphers, and the Raspberry Pi

This article is not really much about Ceph, but it does explore technologies very tangential to it.  So I decided to add it to the series.

I am sure this has been discussed many times before.  What is the best cipher for SCP to use so you get maximum throughput.  So I decided to do a quick survey of the current state of the art and share it.

SSH, in some versions, including the version distributed in Debian Jessie has a cool new option which lists all the ciphers that are supported.

 ssh -Q cipher  

In case yours does not, you can do

 man ssh_conifg  

So I put together a little bash script to run on my Debian server ( nothing a special - a single CPU i5 with an SSD ) to copy the same 500MB file ( Debian ISO ) to a target computer via every supported cipher.  First I tried this against a Raspberry Pi, and then my Mac Book Pro connected to the same gig switch.  Here is the script for your amusement and pleasure.
 #!/bin/bash  
 for i in `ssh -Q cipher`  
 do  
     echo "scp -o Cipher=$i ./debian-live-7.7.0-amd64-standard.iso root@pi1:/mnt/sda1"  
     `time scp -o Ciphers=$i ./debian-live-7.7.0-amd64-standard.iso root@pi1:/mnt/sda1 2>&1`  
     echo ""  
     echo ""  
 done  

If your particular system does not support the -Q option you can copy and paste the list of ciphers from the 'man ssh_config' and modify the second line of the script to read something like

 for i in aes128-ctr aes192-ctr aes256-ctr arcfour256 arcfour128 aes128-gcm@openssh.com aes256-gcm@openssh.com aes128-cbc 3des-cbc blowfish-cbc cast128-cbc aes192-cbc aes256-cbc arcfour  

First observation - not all ciphers that the ssh client supports are enabled by default on all ssh servers so you will get a bunch of errors.  I am not going to enable the additional ciphers, since this is more of a practical guide.

Raspberry SCP results

Here are the throughput results copying the 500 MB file from the PC to Pi and vice versa.  
SCP Throughput with various algorithms using the SD card as storage on the Pi
I would think that the difference here comes down to one of two things.  Pipelining or the fact that computation on ciphers is harder on the encryption side than the decryption side.  I am honestly not too sure and would love some input as to this near 30% difference.

Using the USB attached 7200RPM HDD on the Pi as a storage device

SCP Throughput with various algorithms using USB 7200RPM HDD as storage

I really expected the HDD to be faster than the SD card, but it was not.  It was consistently a little bit slower.  Nothing very significant, but certainly a fact.  Looking for an explanation, I looked at the CPU utilization on the system.  It should be noted that the Pi is pretty much pegged at 100% during this operation.  This is split between System and User.  System takes care of the Disk and Network IO work and User is what the encryption algorithms use up.  Writing to the USB mounted HDD takes some extra CPU processing, and the media IO is not a bottleneck at the speeds we are achieving.  So CPU is still our limiting factor.
CPU System usage SD vs HDD
Notice that there is is a cool correlation - the higher the throughput of the algorithm, the higher the System CPU utilization since it has to manage more IOPS  both to the network and disk.  What is a bit surprising is how much CPU storage and network seem to require.  This would certainly be a place for future PIs to look for improvement.  The numbers seem to imply that even with no encryption what so ever, the Raspberry Pi would be limited at about 60 - 65 Mbs.

So what about User CPU per algorithm?

I did get some really interesting results here, but not from the Pi but from the PC.  The Pi looked pretty much as expected with less CPU resources being used on algorithms that achieved higher throughput.  
CPU Used by Cipher on the Raspberry Pi

This is not exactly the case on the PC.  Here our best performing algorithm - chacha20-poly1305@openssh.com is using way more CPU than its brethren.


The data makes almost no sense.  The worst performing algorithms in terms of throughput seem to use the least CPU.  So what gives?  How is it possible for chacha to be using the most CPU yet have the highest throughput?

Acceleration, my dear Watson, acceleration.  The Intel CPU on my PC supports the AES-NI instruction set which makes it so that AES based ciphers, especially the GCM family.  According to Wikipedia, the acceleration provides an 'increase in throughput from approximately 28.0 cycles per byte to 3.5 cycles per byte'.

This acceleration is not available on the Raspberry Pi CPU, which is actually the bottleneck in this transfer.  Since the Intel is not hitting anywhere near a hundred percent, the fact that chacha is using 7.77% vs aes128-gcm at 1.55 makes no difference to our session performance.

Conclusion

The clear winner here when considering single session performance between a Pi and and a non-heavily utilized modern x86 is chacha20-poly1305@openssh.com.  If, however, the PC is doing a lot of other work, supporting many sessions and generally has its CPU peaking frequently, something like aes128-ctr may be a better choice which gives us a good balance.

PC and SCP - results (beta)

So now that we looked at what the goodness and limitations of the Raspberry Pi, I figured we should take a step further and check out what modern PCs can do.  The setup in this case is

Macbook Pro <-- 5 Ghz WiFi --> Router <-- Gig Switch --> Linux PC

You may question the wisdom of using WiFi.  I would.

Well, it turns out that I only have a 100Mbs USB2 adapter to use as a wired connection on the Mac.  And as you will shortly see, the WiFi actually gives us pretty high throughput.  Not quite a gig, but half way there.  Good enough for significant results.  I will repeat the test again when I go back to work after the holidays and publish a follow up.  Maybe even try to push a 10 Gig pipe.

So I did a quick wireless survey, found a clean 5Ghz channel, and put my Mac Book about a foot away from the router.  Here is what we saw.

PC -> Mac

The thing I find most surprising about this chart is that two of the algorithms I thought would perform best simply did not.  There was something in the way that the Debian tried to initialize the connection that OSX found distatasteful with the AES-XXX-GCM algorithms.  OSX also did not support te chacha algorithm.

The results are very interesting, but lets look a little deeper to find out more.

Lets examine the CPU utilization by the cipher on the client machine.  This will tell us how scalable the algorithm is in supporting multiple connections.  This chart shows relative CPU percentage utilization during the time of the copy scaled by the duration.  The empty space between the top of the column and 100% can roughly be interpreted as idle time.

Note: this should be interpreted as the utilization of a single core
In this test setup I only could measure the CPU consumed on the client and not the server.  When I repeat this test I will make sure my setup can measure both sides.

I tried to think of better ways to analyze and present the data, but in the end I felt I can't the story without a raw representation.  It tells the story better than any massaging could.  This shows the amount of clock time (real), user time (cipher) and system time each algorithm used.


Mac -> PC


For completeness sake - I tried to run the test from Mac -> PC.  There were only 3 supported ciphers - essentially the AES XXX CBC family.  They had exactly the same results, but slightly better performance.  Yawn.

Here are the graphs

Throughput

Relative CPU

Time consumed by each function in seconds

Conclusion: 

When dealing with fully featured modern machines, I would probably choose AES-256-CBC.  It seems to have the best throughput/cpu utilization ratio and provides pretty decent security.