[big shoutout to to Dixon for giving Polymur a glance ahead of this post and adding it to the Graphite tools page]
About two years ago, I wrote about my initial adventures with shoving metrics into Graphite.
Since, our integration into the Graphite ecosystem in particular has slowly petrified. Our metrics system was unfortunately grown somewhat organically; with many dozens of applications using Statsd/Collectd, several homegrown data collection tools, and various alerting systems built around the Graphite API, we’re fairly hooked into it for now. We also have dozens of heavy internal users and a somewhat extensive level of dashboards and workflows built on top of it. It sort of just happened amidst our race to solve lots of other problems. Our metrics systems would coast along in the background, and as long as it wasn’t broken, we’d add things we needed here and there (this sounds 100% ideal, I know).
It’s not entirely to say “welp, we’re still using Graphite”, but that our Graphite infrastructure has also served us well. And while there’s certainly lots of features we could benefit from in other systems, Graphite doesn’t fail to deliver exactly what it’s supposed to: simple, raw metrics in and out. With our relatively tame volume of data points/sec., it’s still holding together. we 10x’d our volume, we’d definitely have to do be doing something much bigger.
fixing some of the problems we experienced with Graphite ended up being a path of less resistance than a ground-up migration to another system. What were the primary problems?
- Scaling Graphite is conceptually straightforward but operationally difficult
- Changes in production almost always resulted in large metrics loss (reconfigure/restart a relay = gaps in your dashboards)
- Shuffling around or adding storage backends is difficult to do in a transparent fashion
- Plaintext streams over TCP isn’t great for global collection unless you like a spiderweb of VPNs (and still think it’s 2005)
Originally this felt like a large enough list to consider moving on, until I realized it could essentially be addressed with a small collection of software:
So I created it, and named it Polymur
Polymur was originally whipped together to solve Graphite Carbon-relay scaling issues where our upstream relay nodes each had multiple relay daemons behind HAproxy, and the relay nodes themselves were behind a top-level load balancer (ELB). It was load balancers all the way down.
Polymur is a highly concurrent service that accepts plaintext ‘Graphite protocol’ (LF-delimited
'name value timestamp') message streams from thousands of connections and forwards (either broadcasting or hash-routing) everything to downstream Graphite-compatible services. For instance, this means it can accept all of your Collectd and Statsd output and write it straight into Carbon-cache for storage. Additionally, it allows runtime changes to where metrics were being sent to solve the interruption issues of reconfiguring relays.
Dedicated relay node, Carbon-relays -> Polymur:
It massively simplified the upstream relay design, improved performance (15K data points/sec. per core went from ~85% CPU utilization down to ~5%) and allowed us to make live changes without dropping metrics. If we needed to scale up a Graphite storage server (running Whisper, Carbon-cache, etc.), we could simply add in a new node and send a mirror of the metrics to it through Polymur’s simple API:
% echo putdest 10.0.5.25:2003 | nc localhost 2030 Registered destination: 10.0.5.25:2003
And the entire stream of metrics would storm down on the new node. Adding it to our top-level Graphite webapp
CARBONLINK_HOSTS lists allowed the results of the old and new storage server to combined for API or dashboard requests. When the new machine reached the retention we required, we could drop the original machine. Likewise, it created an easy pathway to try out new things. One day I decided to play with InfluxDB, put up a node and mirrored our full metrics stream to it with no impact to our production metrics systems.
Since Carbon-relay is used in both dedicated relay nodes and locally on Graphite servers (to distribute metrics across multiple Carbon-cache daemons for the same scaling limitations of Carbon-relay), I decided it would be nice to plant Polymur there as well. After porting Carbon’s consistent-hashing implementation to Go, this was a possibility.
Graphite server, Carbon-relays -> Polymur:
I also learned that it was important to mirror the actual mechanics of the Graphite consistent-hashing algo after initially creating my own implementation. The Graphite web app actually uses the same mechanism for cached metric lookups, meaning that in-flight metrics that don’t yet have Whisper DBs on disk can’t even be discovered if they’re not in memory on the exact Cache daemon that the web app is expecting (according to the CH lookup).
However once this was functioning, I was able to simplify the Graphite server design while dropping multiple relays and another HAproxy instance. It also removed yet another scaling layer since my local relays were becoming CPU bound and causing congestion.
Graphite server CPU, two Carbon-relays behind HAproxy -> Polymur
Ultimately, the operational flexibility and performance made it easy to come up with alternative design patterns on the fly. Given my choice of instances (since this setup runs on AWS), one design that worked out quite well was creating a hybrid fast/short + long/slow pair of Graphite servers.
Top level relays would terminate thousands of connections and ship a mirror of the metrics stream to both an i2 instance and a d2 instance. The i2 instance (tier 1) was configured for high resolution / short retention and really aggressive Carbon-cache write rates. The d2 (tier 2) machine had a much longer retention and more relaxed write rates (but would also keep much more in-flight). A top level Graphite web app was used to query / merge the data.
If we decided to fire up an environment that would introduce thousands of new metrics, the i2’s SSD stripe would handle the rush of new Whisper DBs and would allow for immediate visibility in dashboards. The majority of dashboard reads (which fits into our short retention) would also be served from low latency storage, whereas the occasional long retention fetch would just draw from the tier 2 storage.
And in complete isolation, an InfluxDB machine was receiving a stream copy for asynchronous evaluation.
Creating Polymur-proxy & Polymur-gateway
One of the last requirements that I imagined could be accomplished with a single class of tooling was the issue of creating a globally central metrics collection system that didn’t require a mess of network plumbing. We have dozens of sites across many VPCs, multiple AWS accounts, in addition to physical data center PoPs and labs.
The setup is pretty straightforward. Essentially you stick a Polymur-gateway instance at the edge of your centrally located metrics infrastructure and drop a Polymur-proxy instance at each of your remote sites. Polymur-proxy accepts the same inputs as Polymur (Collectd, Statsd, etc.), but batches, compresses and ships via HTTPS up to the Polymur-gateway. The gateway receives and unpacks data points then writes them out using the same internals as a standard Polymur instance (to other Polymur/Carbon daemons or otherwise compatible services).
Additionally, a somewhat simple API key layer backed with Consul is used to authorize connecting proxies (and even includes a key management helper tool). While I was at it, the key functionally was also used to knock out another problem: joining together many previously isolated metrics that weren’t name spaced for a global system. A key in Polymur-gateway has a key name and the key itself. By specifying the
-key-prefix parameter, you can have all received metrics automatically name space bucketed under the key name used by the origin Polymur-proxy.
For instance, if you registered a key named
product-A, any metrics shipped through a proxy configured with product-A’s key would be nested under
product-A in Graphite.
product-A.web01.some.metric. Name space (and even access) isolation across boundaries like teams or environments was handled with this simple trick.
And lastly, Polymur-gateway was designed with the need for varying topologies in mind. Having the only state synchronized in Consul means that propagating keys across a fleet of gateways is automatic and scaling out behind a load balancer is seamless. Optionally providing both HTTP and HTTPS listeners with automatic XFF handling means that you can connect proxies directly to gateways, terminate SSL at a L7 load balancer, or use a L4 load balancer and terminate SSL on multiple backend Polymur-gateway instances.
If you use Graphite in any capacity, this tool was made for you!