PUG Life: How we run a Grafana instance for each user with Docker
At Hosted Graphite, we receive a lot of metric data from our customers. They rely on us to ingest the data, store it, and make it available to query -- often viewing their data through Grafana.
Over the years, we've run Grafana in a number of different ways. What started as duct-tape reactions to early Grafana updates has become a mature infrastructure that supports hundreds of customers sending us millions of metrics per second.
Now, we have independent Grafana servers in Docker containers for each one of our users. This post describes the long road to we are today, with Grafana servers in Docker services/containers for each of our users, and the many bumps along the way to getting it right.
It was a long, challenging road to this containerized world, and the story of Grafana at Hosted Graphite is a parallel story of my own progress from Junior Developer to Technical Lead of the Development team, and lessons learned along the way.
Oh, and this “Per User Grafana” project was (obviously) named “PUG”, so you can expect some cute dog photos as well.
I'll begin with a description of how we introduced Grafana as a backend, by showing outlines and simple descriptions of how Grafana can be run on a small scale with Docker, and continue by describing how we evolved into the full scale operation we use today. Along the way, I’ll describe the tooling we use to orchestrate everything and the monitoring we’ve found most useful to employ.
The Grafana Journey at Hosted Graphite
Grafana, before v2, was very simple. It had no back-end server or integrated database, so we ran it for our users and saved the json for a dashboard into our regular user database. The next version of Grafana would make some pretty big changes, which cemented their plans for the future and led to them being as popular as they are today.
In 2015, we began work to upgrade to "Grafana 2". This marked a big change for Grafana, introducing a Golang-based server and an integrated database. To continue hosting Grafana for our users, we needed to learn and adapt to this new set-up.
As Hosted Graphite was a smaller company at the time, the task of upgrading to Grafana 2 became my responsibility. I was in my 3rd or 4th month of being a Junior Developer at Hosted Graphite, and decisions I made on that project had a longer lasting impact than I anticipated.
I wrote almost every line of code that was part of this upgrade, beginning with the first PR, and had PR’s for the next 2 months consisting almost entirely of Grafana based bug-fixes. This was a large project with pretty high impact, and I learned a great deal in the 2 or 3 months it took to complete that first upgrade.
There were several goals with this first integration, but the overall theme was consistency for our users. Things that worked before Grafana 2 would need to work after. For example, a pattern of "https://www.hostedgraphite.com/unique-id/grafana/" was going to need to exist within Grafana 2, and the users dashboards would have to continue to exist at known URLs.
It’s worth noting that at the time, Grafana didn’t have nearly the same number of features and options for configuration that it has today, so I worked with what we had and ploughed ahead with what turned out to be some rather duct-tape solutions.
Initially, this was ok. The new Grafana did work, things were kept consistent for our users, and the upgrade went pretty smoothly (only a very small percentage of dashboards went temporarily missing, and all was fine after a few quick fixes).
We made a lot of small changes to the Grafana app: including changing which URLs were requested, changing the format of the URL, swapping icons, hiding features we couldn’t yet support, and even removing large sections of code to get rid of some random behaviours we thought could cause trouble.
For a long time, we changed which HTTP method was used for all Graphite renders (Grafana defaults to using a POST, we chose to make it a GET). These small changes (duct tape!) would, as time went on, set the precedent for more and more duct tape.
By the time we deployed “Grafana 2”, I was happy with things. Our Grafana set-up was a bit unique and a little rough in places, but could essentially be summarised as looking like this:
- All of our users were using a single shared instance of Grafana, and talking to a single Grafana database.
- HG Users and teams were kept separate from each other using Grafana's concept of "Organisations". The logic was maintained by passing all requests through our webapp.
- We handled all Grafana’s session tokens, cookies, and authorizations within our webapp.
- We proxied API endpoints for Grafana through our own API endpoints in our webapp.
A month or two after the first set of Grafana 2 deploys, I graduated from Jr. Developer to full time developer. The bosses were happy with things so far.
However, as Grafana kept churning out new releases, we started to fall behind. The many little hacks we had made to make the project work became very difficult to maintain. We were a small team and had plenty of other projects to work on. With more and more Grafana upgrades coming out, it felt like we were constantly chasing our tails.
Our Development team changed over time, and now that I was no longer a Jr. Developer, I could delegate Grafana upgrades to newer members of the team. Completing a Grafana upgrade was almost a rite-of-passage when someone joined the development team. It was time consuming, boring and quite confusing at times. After completing one, however, you came away with a very clear picture of the intricacies of how our various webapps hooked together. While everyone was delighted when it was done, nobody wanted to do another one. This was not ideal when you consider that this was a key part of our infrastructure.
It became clear that this Grafana situation was not ideal and, besides the dread of having to work on another Grafana version upgrade, could be broken up into a number of key issues.
- Grafana wanted to run at a “root URL”:
Probably the biggest issue (or at least the one with the biggest impact on development work for us) was that we couldn't make use of a unique per user "root URL" for Grafana (in the examples below that will be "http://0.0.0.0:3000/"). This led to us making a boatload of useless changes to Grafana in order to support the idea of our single Grafana instance serving requests. Such as https://www.hostedgraphite.com/myid123/grafana/ also working for https://www.hostedgraphite.com/myid789/grafana/.
- Using “Organisations” to separate users wasn’t very comfortable:
As time went on, we became increasingly concerned about how users were split by “organisation.” It led to a lot of code changes in our webapp dealing with users’ permission levels and led to some unintuitive mappings between what our users in the webapp and the users in Grafana were doing. In this world, a small bug would have been enough to suddenly have customer A see all of customer B’s dashboards and graphs. Thankfully, this never happened. Our unit, integration, and manual tests proved savvy enough to catch these, although the manual tests were very time consuming.
One mildly interesting quirk was that, at some point in time (it’s different now), Grafana saved the default light vs dark theme as a per-instance setting. This was not ideal when we had hundreds of customers all with different preferences all living blissfully unaware of each other in the same Grafana instance. As a result, we introduced yet another hack with some JS which would load the different CSS files based on URL parameters.
- Grafana introduced installable plugins (which are great), but with all the duct-tape solutions we’d implemented, it was almost impossible to swiftly add them in. It took us a few weeks to add one which should have been a 2 minute “install and restart”.
- Feature clashes:
Grafana supplied some things we already had in our stack and some newer features we didn’t yet have the capacity to support - Alerting and Datasources being two good examples (these are now part of our mature product).
- Our package build and test configuration began to differ from Grafana’s over time. We also used CircleCI, but initial configs were quite different and when we tried to upgrade, the test environment would have incompatible package dependencies, leading to tests that were working in Grafana’s project suddenly failing in ours. There were even a scary few days in the early stages of “Grafana 2” deployment, where the only place we could successfully build the package was on my own laptop.
Just to be clear, these issues weren’t Grafana’s fault -- they all sprung out of different design decisions we made over the course of 2.5 years as we gradually moved toward the mature solution we have today.
Enough is enough
After a while, we got fed up with this and knew we wanted a different set-up. Here were our goals:
- Isolate our users into separate Grafana instances. This gives us:
- Better overall security.
- Individual fail case tolerance (one instance going down impacts 1 user not all users).
- More customizability: each user’s Grafana can be loaded with its own config file .
- Make it possible to upgrade our Grafana versions quickly:
- We wanted to get rid of as many of the duct-tape solutions as possible.
- Provide a way to test that features work - no more searching files and replacing seemingly random lines of code.
- Support newer Grafana features.
- Easily install plugins to existing Grafana instances,aiming for a scale of days rather than weeks.
Docker seemed like the obvious solution for our first goal. Docker provided a simple way of running one Grafana per user, thus isolating each user from all of the others.
For a super quick proof of concept, I searched Grafana documentation, found a way to run a simple Grafana server (described below), then reworked sections of our webapp to make it forward any Grafana requests for a particular user to the container. Within a morning's work, I had a clear idea that this was a viable solution, though the path to release would be a long one.
Running Grafana in Docker
This should work for you, too, and serve as a nice break from our early Grafana woes.
It’s easy to recreate the test I did as a proof of concept. While the version of Grafana may differ, the end result would essentially be the same. Of course, you probably don’t have another webapp managing users, logins, or teams that you need to work around.
You need to have installed Docker, and https://docs.docker.com/install/ should make that pretty easy no matter what operating system you have.
Grafana provides some good documentation on running a single Grafana server with what I call "stock settings". Run this command:
$ docker run -d -p 3000:3000 grafana/grafana
"-d" this means that as the command is executed, we tell Docker to start the container in the background and just print us back the container id.
"-p 3000:3000" this tells Docker that there's a port being used in the container at "3000", and we want to publish that to the host machine (where you typed in the "$ docker run" command).
Once the command runs successfully:
Head over to http://0.0.0.0:3000/ in your browser and log in with "admin" as both username & password (this is the default for Grafana and you will be prompted to change it).
For the PUG project’s proof of concept, I then pointed our webapp to talk to 0.0.0.0:3000 for all Grafana requests, and most things loaded just fine! However, we didn’t have any access to any metric data.
How about graphing some data
So, I had a new Grafana in Docker, and I had my webapp talking to the Grafana instance. Now I needed to configure a Grafana “data source,” which would allow me to graph the data we care about.
Adding a data source via the UI is fairly easy in Grafana. If you ran the commands above, you can follow these steps to graph the metric data you have in Hosted Graphite:
- Go to: http://0.0.0.0:3000/datasources
- Click "add data source"
- Choose the "Graphite" option
- Browse to https://www.hostedgraphite.com/app/sharing/ (log in if prompted)
- Copy the "Graphite" key from this page. This URL should look something like: https://www.hostedgraphite.com/<ID>/<TOKEN>/graphite/
- Choose "Browser" under the dropdown for "Access Type" (both work in the latest Grafana versions, but that was not always the case).
- Click "Save & Test"
Now that we had a working Grafana with a working Graphite data source, I concluded that we could scale, automate and configure to run Per User Grafana at Hosted Graphite.
Docker, at scale, for many users
We knew we could do this for one user. Now, we needed to determine how to do it for a multi-tenant setup. Put simply, we wanted to run a Grafana server, in Docker, for each user.
Knowing it is possible to hook the two together and graph data, the first question we asked was "How reliable will that be?". One Grafana server in one container didn't sound ideal, and we run our services to very high standards of reliability. Docker Swarm https://docs.docker.com/engine/swarm/ was an obvious first place to start looking for another solution.
A Swarm of PUGs
Using Docker Swarm, we could run a Docker "service" for each user. A service consists of several Grafana containers (the service calls these “replicas”), each running an identical Grafana server. Ideally, each container would be spread around several different machines or nodes. Note: we opted to use an external MYSQL database for user sessions/Grafana.
It seemed simple to configure each user to have a Grafana container running on a unique port, spread across a Docker Swarm, and we added some changes to our webapp to route Grafana requests to the Docker Swarm rather than the old-school "shared Grafana on the server".
A simplistic view of the path taken by a request now looked something like this:
frontend -> webapp -> Docker-swarm -> Docker-container -> Grafana-server -> MySQL DB (not in Docker)
As you might expect, we created a test environment we felt would represent a percentage of what our production swarm might look like.
Quickly, we began to see the swarm fail. One issue was found when we were testing how quickly we could get some services up and running after some kind of "total failure". We saw a lot of RAM usage and it bogged down the system - https://github.com/moby/moby/issues/36264 . Eventually, this was deemed to either be a load issue or one that disappeared if we ran the swarm with more nodes (maybe they were the same thing). Either way, we mitigated the problem with more nodes. Perhaps we were expecting a bit more from the swarm than it could handle.
It took time and some tweaking, but we managed to scale the test environment and run Grafana for all our users (more on how we actually orchestrated that later). The PUGs were running!
The test environment had everything we needed to get our SRE team involved (to be fair, they already had helped with most steps along the way). Before taking the next steps toward production, we consulted with the SRE team for clear direction on determining how to monitor the swarm.
As we ran more tests over a couple of weeks, we developed ideas of what metrics to look at to determine swarm health. We built tools to gather these metrics and began building dashboards to monitor them. This is a typical process at Hosted Graphite, as dashboards are considered part of any new project and are developed over time with the project.
As we became more confident in the swarm, we moved it into production. We spun up a service per user with 3 replicas (so 3 Grafana-servers for each user), initially running silently alongside everything else, but not serving any requests. Once the SRE team cleared everything, we began a migration of users from the single-shared Grafana to each having their own unique instance of Grafana.
This was the first time we ran Docker in production at Hosted Graphite and we did quite a lot of testing and research. Now it’s just another tool in our belt for when we're designing new services.
From a user point of view, this changeover was only cosmetic. They got a slightly newer Grafana version, which came as part of getting our (still hacked) version of Grafana built into a Docker image. The large changes lay in the point of view of the development team. We had discovered a whole new way of working with Grafana that provided a better experience for our users. In some ways, the work was only beginning.
How we orchestrate the Docker swarm and it’s PUGs
We had a pre-existing service at Hosted Graphite which used Python RQ to schedule jobs based on user data pulled from our webapp. The shape of this service fit the bill for what we needed on our Docker swarm.
As a result, we extended the existing project to include new jobs for adding, removing, and upgrading Docker services. We're running a Grafana service (services contain multiple containers) per user, and we store any information about how we want that service to run in our user database. However, the actual amount of information needed to spin up the Grafana service is minimal:
- Port number to run the Docker service on
- A Grafana Docker image tag (this is how we handle new versions)
- Username and password (to talk to the DB)
- A few other trivial pieces of information that aren’t relevant here
We already had a secure, internal API for pulling some user info, so it was easy to add a new endpoint to pull back a list of per-user Grafana configs, pop each one into a job with RQ and using something like: https://github.com/docker/docker-py to tell the swarm what to do based on information passed in. If there were endpoints not available through the client, we make direct requests to the Docker API.
We added a second job to RQ to remove a service once a user was deemed "not active" anymore.
A slight issue we had was that (at the time), RQ automatically logged out all the parameters we were passing to each job. This wasn't ideal as it included some sensitive information, so I made a little addition to the RQ project to solve that: https://github.com/rq/rq/pull/991. We can now log the job status, and not worry about logging sensitive user information.
This system has worked really well for us for about a year now. As time went on, we’ve improved our monitoring, our SRE team have runbooks for how to recover from various failures (thankfully these haven’t been needed very often), and we even added some Grafana specific deploy commands to our ChatOps Slack deploy-bot.
As a bonus, this section produced a depiction of a shepherd minding a flock of Docker containers (something we’re all very proud of):
Monitoring the swarm
For a better explanation of the following terms, have a look here: https://docs.docker.com/engine/swarm/key-concepts/.
Each node in our swarm reports:
- Ingress peer count
- Leader status
- Manager status
- Count of containers
All of this information is available through the Docker API. We’ve used https://github.com/docker/docker-py at times or made direct requests to the Docker API, usually having to do some snooping to figure out the exact endpoints.
Like our own services, we send this information through our own monitoring stack and then use Grafana to visualise it, giving us a nice overview of what sort of state our swarm is in. Sticking this info beside some standard machine metrics (which we collect using Diamond) puts us in a great position to monitor the swarm.
Our Grafana service gets monitored differently. We treat it as part of our “webapp” stack, focusing on the user experience, and checking status codes and response times to gauge user impact.
What about the rest?
Once we had Grafana pushed into our Docker swarm, this opened the rest of our goals and made them much more approachable. This “phase” of work happened in the first half of 2019, and, as I had taken on the newer role of Tech Lead, this was a phase for which I wrote the least code. Instead, I defined each of the individual tasks and added them to our team’s sprints over a number of weeks, while liaising with our team manager. I was busy writing code elsewhere and was heavily involved in code reviews.
The project was a success, everything got deployed, and the teams were great at reporting back or highlighting newer tasks that emerged as the project progressed. We had our “hack free” version of Grafana deployed and running. Another great outcome of this phase was spreading knowledge around the team with regards to how our Grafana was integrated with our webapp.
Duct-tape free since 2k19
Simply deploying a clone of the official Grafana image to our swarm, taking advantage of the many configuration options provided, and then making a handful of small changes to our webapp, we have a version of Grafana “without all the hacks” up and running in production with just a few weeks of development work.
We’ve baked different plugins into the Docker image we provide for users. In most cases, it takes only a day or two to test and deploy a new one, and involves almost zero code changes.
We allow users to configure external data sources to their hosted Grafana. You can graph many others alongside your Hosted Graphite metrics.
Upgrading to newer Grafana versions
Grafana upgrades are more or less stress free for our developers. Within a month of getting everyone to the “hack free” version of Grafana (initially was 5.0.3), we upgraded to Grafana 6.0 and then straight to 6.1.
A nice coincidence occurred as I was writing this post. I got a message from one of the team, where the conversation started with a link to a pull request for a Grafana upgrade to version 6.2:
I’d call this a roaring success, though maybe we can improve that build time.
Where are we now?
Moving forward, we can focus on new features rather than the hours of work required to maintain what we have. During this process, we wrote a lot of internal documentation making the project(s) clear to any newer hires. It is an easy process to upgrade our Grafana version.
Personally, I learnt an awful lot over the years with this project, from the first deployment all the way to the current (much more frequent) ones. I ended up involved in most of these in some way. Whether that was part of planning and task specifications, running proof of concepts or digging deep into the code, the journey Grafana at Hosted Graphite has taken us on has taught us a lot that we can apply to future projects.
So you’ve read this far and only seen one pug? Have another: