Tuesday, January 27, 2015

Every "Wish" Has A Price

Note: I saw this on Diana Deleva's blog, and I translated it from Bulgarian.  She in turn had translated it from Russian.  I do not know who the original author was yet.  I will research it further, but if someone finds out in the meantime, let me know!

Every Wish Has a Price

In the far reaches of the Universe, there was a small shop.

It didn't have a sign hanging outside - the sign was blown away by a  hurricane, and the new proprietor decided not to hang a new one because all the locals knew what the store sold - wishes.

There was a huge selection.

Here you could buy practically anything: huge yachts, apartments, marriage, the post of vice president of a corporation, money, kids, a dream job, a beautiful body, victory in a contest, powerful cars, power, success, and much much more.

The only things not on sale were life and death.  Those were the purview of Head Quarters, which was located in another galaxy.

Everyone who would walk into the store (and by the way, there were many wishing people who never, not once entered the store but stayed wishing at home) wanted to know the price for their wish.

Prices were different.

For example, a dream job cost giving up on stability and predictability, being ready to plan all by yourself and structure your life, belief in yourself and allowing yourself to work where you would like and not where you have to.

Power cost a little more.  You have to give up some of your convictions, to find a rational explanation for everything, to be able to say no to others, to know your own price (and that price needs to be high enough), to allow yourself to say 'I', putting yourself forward no matter the approval or disapproval of those around you.

Some prices looked strange - marriage, it was possible, to get almost for free. However, a happy life was really expensive: to be personally responsible for your own happiness, to be able to get joy out of life, to know your true desires, to give up on wanting to conform to or imitate  those around you, to value what you have, to allow yourself to be happy, to be conscious of your own value and importance, to give up on the perks of being a 'victim', to take the risk that you will part with some friends and acquaintances.

Not everyone who came into the store was ready to immediately buy their wish.  Some, seeing the price turned around and walked away.  Others stood there for a long time thinking and calculating how much they have right now and where they could get the rest.

Some started complaining about the too-high prices, and begged for a discount or inquired about when there would be a sale.

And there were those who gave all their savings and got their most intimate wish wrapped in beautiful crinkly paper.

Other shoppers looked upon the happy buyers with envy, and gossiped that the proprietor is a friend and they got their wish just like that, with no effort at all.

Often people offered the owner to lower the prices so he would increase the number of buyers.  But he always refused, because then the quality of wishes would suffer.

When they asked if he is not worried that he would go broke, he always shook his head and answered that in all ages you can find a few brave souls who are ready to risk and change their lives; who would give up their ordinary, predictable life;  who are capable to believe in themselves; who have the strength and what it takes to pay to get their wish.

And on the door of the store there was written, at least for the past hundred years:
'If your wish is not coming true, it means it hasn't been paid for yet'.

Tuesday, January 6, 2015

Ceph Pi - Adding an OSD, and more performance

So one of the most important features of a distributed storage array is that...  being distributed you should be able to expand it quickly in terms of size as well as  get better throughput and performance as you add more nodes..

We are going to add another Raspberry Pi + USB attached HDD ( Pi1) as an OSD.

This will make our cluster look like

  1. Monitor, Admin, Ceph-client - ( x64, gig port, hostname: charlie )
  2. OSD ( Rapberry, 100Mbp eth, USB attached 1TB HDD, pi2)
  3. OSD ( Rapberry, 100Mbp eth, USB attached 1TB HDD, pi3)
  4. New OSD ( Rapberry, 100Mbp eth, USB attached 1TB HDD, pi1)
Attaching another OSD is pretty easy! (note - I am using Pi1, which was already initialized in previous guides, so there are some steps missing here)


From the Monitor/Admin node (nor running on my x64 box) we run
 ceph-deploy install pi1  
 ceph-deploy osd prepare pi1:/mnt/sda1  
 ceph-deploy osd activate pi1:/mnt/sda1  

Here is the performance of the cluster while the OSD was being added.  Its a pretty complex graph, but it does cover all the KPIs we are tracking.  

Saving data to the ceph-cluster

We are getting measurably better performance from the 3 OSD ceph-cluster.  The throughput is at about 44Mbps.

This compares very favorably to the 2 OSD test which run at about 30Mbps.  Here is the data as a reminder

The Raspberry Pis had their CPUs peaked pretty flat, so I am not sure we can squeeze a whole lot out of them.

Here is the ethernet port utilization on all the OSD Raspberrys.

Loading data from the ceph cluster

Performance when loading data from the ceph cluster was surprisingly lower than writing the data!  It was stuck at 40Mbs
  As a matter of fact we are running at exactly same speed as we did when we had only 2 OSDs involved (reference the previous article).

Here is the graphs of the CPU on the Raspberry OSDs.

The utilization of the CPUs is pretty low and we have quite a bit of head room.

Finally here is the throughput from the individual network utilization from every OSD.

And just because it looks cool and matches perfectly the throughput of the ceph-client host, here is the stack line graph of the same data.


Adding an extra OSD gave us a measurable boost on the writing end of the equation.  It did nothing for reading performance.  We did observe lower loads both on the OSD CPUs and network utilization, yet the numbers on the client were unmoved.  

I am at a loss as to how to explain this.  There is a core on the client that is doing an inordinate amount of Wait...  but why? Maybe it is network latency of some sort?  Thought I am on a gig switch that has all devices plugged in directly...  I do not know.

The next step (and probably the last in this series) will be measuring concurrent access.  See where that takes us.  

Ideas and requests are welcome.

Monday, January 5, 2015

Ceph Pi - ... and now for some production numbers!

Note: This is a follow up to this article.  This article will make a lot more sense if you read it after reviewing the previous one.

Copying to the Ceph Cluster

So here we are.  Ceph-client and Mon are installed on a micro x64 system.  2 OSDs are installed on 2 Raspberry Pis.  And here are the numbers.

We are running consistently in the 30 Megabit range copying to the Raspberry PIs.

The limiting factor is the CPU on the Raspberrys.  Here they are.

They are bellow 100, though.  Which is slightly puzzling.

For completeness here are are the network utilization numbers from the OSD RPis

...as well as the HDD utilization

Copying From the Ceph Cluster

Copying from the cluster is moving at a decent 40Megabits/second. 

The CPUs of the Ceph Nodes are not pegged.  I am not sure why we are not getting better performance.  There really does not seem to be a bottleneck anywhere in sight.

The disks are doing well too


While we are getting decent numbers, we are not pegging anything.  I am not sure why we are not getting better numbers.  Any clues would be REALLY appreciated.

The only clue I have is that there is pretty massive Wait time going on in the ceph-client node.  We are getting pretty much a whole core pegged.

Next Step

We are going to add one more Raspberry Pi OSD and observe the impact.  Hopefully we are going to get an increase in throughput.  

Again - comments and suggestions are very welcome!

Sunday, January 4, 2015

Ceph Pi - oDroid Fail

I had a frustrating weekend.  I was really excited about making my oDroid U3 the controller for the Ceph cluster.  I was pumped.  The CPU runs like magic ( compile time of the kernel on board was comparable to an x64 ), and I keep being excited about its 2GB or RAM (but not the 100Mbs ethernet).

And yet...  I failed.  A lot.

It really seems to have to do with the kernel.  The oDroid comes with a 3.0 kernel.  I had to compile the RBD module.  And it was hell.  I failed to get the on-board compile to work.  When I failed - there was no way to get any feedback.  I would have it plugged into a screen and get no feedback whatsoever.   I finally got it working by doing a cross compile.  Technically there is a a 3.8 branch as well, but I did not manage to get it working (well, I really did not get to it in the end, so just keep reading).

So then I tried set it up as a mon / ceph-client.  And I failed.  And failed.  And failed.  In the end, I found this article which describes the features included in various versions of the kernel.  It turns out my 3.0 was just too old.  And even if I get 3.8 working, its likely going to be trouble as well, since it was missing a ton of features.   I tried every work around I could, but most posts really talked about 3.9 as being the earliest viable version.  And the oDroid does not seem to support it... at least with Ubuntu.

So I gave, up and set up my Intel NUC mini PC which is running x64 with Debian Jessie.

I have not given up on the oDroid yet, but it will be a couple of weeks before I forget my current level of frustration and give it another whirl.

This may be a blessing in disguise because I now have a gig port.  

Friday, January 2, 2015

Ceph Pi - Initial Performance Measurements

Tests to be run

We are going to run 3 tests.

  • Local copy HDD to SD (control of maximum throughput)
  • Local copy SD to Ceph mount
  • SCP
  • SAMBA (network share)
We are going to look at 
  • Network Throughput
  • Throughput by protocol
  • CPU usage overall
  • CPU per key processes


This is a review of what the current setup looks like.  We are going to be using a 3.5 GB gzip file of a SD card backup for our testing purposes.

It would be also interesting to try it with a directory of small files and see what that will do.

Local Copy HDD to SD

The first test we are doing is to see what the maximum throughput of the Raspberry Pi is when copying from one on-board device to another.  We will also try to see if we can pin point the limiting factor - Disk throughput, IOPS or CPU.

The first hump in the graph is copy from the SD card (mmcblk0p6) to USB attached HDD (sda1).  The second is the reverse - HDD to SD card.

As we can see CPU is certainly the limiting factor, though SD card throughput is close second.  

Here is another graph showing the actual throughput.

The good news is that the SD card is plenty fast to read at 100Mbs.  

Local Copy SD to Ceph

We have the file loaded on the SD of Pi1, which is also our ceph-client.  Lets go ahead and copy it over from the SD over to the the Ceph cluster mounted partition.

We are getting decent throughput, certainly good enough to stream video at about 30 megabits / sec.  Lets go ahead and check out how the OSDs did.  Unfortunately I seem to have a bit of a problem with pi3, so I only have data from pi2.  In this setup I expect the two OSDs to be fairly close, though.

The OSD has some more breathing room.  It is interesting that there is a lot of traffic in the reverse direction comping out of the OSD!  The traffic is headed to the other OSD, though.  Here is a composite graph... a bit complex for my taste...

Local copy from Ceph to SD

Now lets take a look in the opposite direction - downloading the file from the ceph cluster to the local storage.  Here is the utilization on the Monitor / Client node.

Here is the utilization on the OSD

Notice that there is almost no chatter.  CPU is low as well.  So this makes it pretty clear that the client node is CPU limiting us.  

Network SCP to Ceph

Given that we are already being CPU limited, adding the extra overhead of encryption is going to make things poor indeed.  I am not even going to bother testing this, unless someone in the comments specifically requests it.

Network SAMBA ( network share to ceph ).

Uploading data to SAMBA we see

And then downloading it we get
Interestingly upload is a little faster than downloading.

At 25 Mbs we are doing well, especially considering the 35 Mbs optimum base case we got from the local copy, and the fact that we are limited to 100Mbs. It is workable for most home applications (streaming movies), but we will have to optimize a bit before we can consider this anything like production ready ( uploading or downloading will take ages ) even for a home storage solution.


While this is a somewhat workable setup with enough bandwidth for most everyday uses, it is pretty obvious that we can get a huge improvement by moving the client to a more powerful box.  OSDs seem to be keeping up well.  I am looking forward to expanding the test cases.

Right now we are doing pretty much at the same level as if we used a low-end SDHC card in our PC as storage.