Sunday, December 7, 2014

Ceph Pi - The Plan

Time for a bit of background.  What is Ceph really, and what does it do?  According to, Ceph is
"Ceph is a distributed object store and file system [and can also be exposed as a block device] designed to provide excellent performance, reliability and scalability. "
The idea is that there are multiple servers (distributed) with a number of hard drives each that can participate in a cluster.  The setup is highly resilient and performing.  This cluster can then provide storage in the form on network storage exposed as an Object Store, File System or even a Block Device.  You could do worse if you want to learn more than start from the top.

I won't lie - I really like the base architecture.  Every node runs an instance of a hash algorithm called CRUSH which deterministically positions data on one of the participating nodes.  Thus any node running the algorithm can quickly determine which one of its peers has the data.  Sounds a bit like SevOne, doesn't it?

Object Stores are the new hotness in the industry.  If an application needs to store images or documents and largely treat them as blobs (retrieving all or nothing), why bother go through the overhead of a DB and a File System (FS hereon) if we can provide a key/value lookup API that interacts straight from the application natively?  It simply skips 2-3 abstraction layers and hopefully reduces the overhead significantly.  Ceph should be great at this, because it  is built on top of RADOS.  Which is an object store.

The Ceph file system (CephFS) seems OK, but I didn't find anything particularly appealing about it.  Using another FS on top of the cluster seems to work just fine, so I am opting for XFS for now, but I may choose to go with BtrFS, based on the recommendations here (and the fact this cluster is a testbed only). CephFS itself seems to be a thin vernier on top of the object store.  Part of the reason I chose not to play with it for now is that it requires that the cluster have a Metadata Server called MDS.  The MDS needs compute resources and my little cluster is already below the minimum system requirements.

Since I don't have a particular application in mind and I am looking for some needlessly complicated general purpose storage for my Boston place, I opted to go with the block storage device which eventually I would run XFS on, and expose to the rest of the network via SAMBA... or something...   streaming...  who needs applications when you have cool technology to build!

I would have loved to use Odroid U3s for this build, since they have substantially more oomph than Rasp Pis, but the order time was long, and with my holiday a day away, and Microcenter selling Rasberry Pi B+ for $30...  who could resist?  Even though they have only 500MB of RAM....

So I bought:

  • 3 Rasberry Pi B+
  • 3 Plastic cases for the Pis (overpriced)
  • 3 Travelstar 1TB 7200RPM 2.5'' SATA drives
  • 3 SATA to USB connectors
  • 3 16GB SDHC cards
  • 3 Powered USB hubs (1 initially, but the poor Pis had no hope of powering the Travelstars, so I had to cowboy up with 2 more)
  • 4 CAT5e patch cables
  • 1 NetGear 8 port gig switch
About $600 later I had everything I needed.  And the Pis really comprised only $90 of that... So if I end up having to replace them with something ooomphier...  eh...  I am sure I can work something out.  But I digress!  This is the CephPi project, and you can't have CephPi without Pi.  Though I would have preferred apple... or cherry...

  • OSD: Object Storage Device
    • A physical or logical storage unit (e.g., LUN). Sometimes, Ceph users use the term “OSD” to refer to Ceph OSD Daemon, though the proper term is “Ceph OSD”. Think of this as the drives.
  • MON: Ceph Monitor 
    • The Ceph monitor software.  Ceph Monitors maintain a “master copy” of the cluster map, which means a Ceph Client can determine the location of all Ceph Monitors, Ceph OSD Daemons, and Ceph Metadata Servers just by connecting to one Ceph Monitor and retrieving a current cluster map.  see here.
The initial plan looked like this picture which comes from the install guide.

My plan is a little different, because I did not have a node for admin.  The entire compute I owned (that I do not take in my bag in the morning) were the 3 Pis.  And I wanted all 3 of them to provide storage.

So I had Node 1 running both a MON and an OSD instance.  I also ran all ceph-deploy commands from Node1.  It was fine.

So next - building the Pi (if you've messed with Raspberry Pis before, this may be boring, but it will be short).