Building Polymur: a tool for global Graphite-compatible metrics ingestion

[big shoutout to to Dixon for giving Polymur a glance ahead of this post and adding it to the Graphite tools page] About two years ago, I wrote about my initial endeavors with shoving metrics into Graphite. Since, our integration into the Graphite ecosystem in particular has slowly petrified. Our…

Field notes - ElasticSearch at petabyte scale on AWS

I manage a somewhat sizable fleet of ElasticSearch clusters. How large? Well, "large" is relative these days. Strictly in ElasticSearch data nodes, it's currently operating at the order of: several petabytes of provisioned data-node storage thousands of Xeon E5 v3 cores 10s of terabytes of memory indexing 10s of billions…

Async counter updates in a global rate limiter

For a while, I've known that my tool Sangrenel (used for load testing Kafka) had some inefficiencies. Particularly the global counter. Sangrenel effectively fires up many workers that generate random messages. A global counter is used to periodically dump the rate of message generation in addition to controlling a global…

Load testing Apache Kafka on AWS

[update: This was written before EC2 d2 instances were released, which I'm currently a fan of. I would generally recommend them over r3s.] Notice my careful usage of the phrase "load testing" vs "benchmarking". Why is that? I think we've all learned by now that benchmark tests are often limited…

The architecture of clustering Graphite

[Note: It's not quite 'clustering' by my definition, but this post is linked to enough that it's too late to change the title. Based this on the Graphite config naming conventions for consistency.] [Note #2 - April, 2015: the purpose of this post was originally to describe the logical architecture…