But first, on-call: SRE onboarding (skydiving for nerds) 1/2

April 9, 2018

Onboarding a new hire is a tricky process and can be very difficult to get right. I’ve worked at/with companies that have had zero onboarding or way too much. In the past, it was either: being pushed out of the plane without a parachute; or the parachute was already deployed and I didn’t make it to production for three months.

Hosted Graphite’s process for SRE onboarding taught me to walk, talk and git commit with no ragrets within a week, explain our entire architecture by the end of the month and build a copy of our infrastructure from scratch after two.

My name is Evan, I’m twenty-three and a recent Computer Science graduate from Cork. Before Hosted Graphite, I worked freelance, short-term contracts and internships as a web developer (“full stack”) for six years before deciding I much preferred systems and automation to clients and web apps. I started in my current position as a full-time Site Reliability Engineer back in December ‘17 and my first months have been a wonderful transition from practice jumps to paired jumps to jumping out of planes I built just for the fun of it. Here’s how it happened.

Workflow Introduction

Parachutes and why you need one.

My first day was like most first days are: awkward, feeling a little out of my depth with a lot of imposter syndrome in the air. I had already met the entirety of the SRE team at different stages of my interview which certainly helped with my jitters – they had collectively agreed as a team to take me on. They wouldn’t think of me as a kid they had to babysit; I was a colleague who needed training.

We have a set process for the first day built around setting up a development environment, getting your SSH credentials into our config management and logging into the bazillion service invites for IMs, HR, email, etc. As the week went on, my first set of tasks were to edit just basic things in our config management as an introduction to our repos and our git process.

  1. All changes are developed on a branch of the main repository.
  2. Create a Pull Request on GitHub with your changes.
  3. Have it reviewed by another SRE on the team.

This version control process exists across the company at all levels – it is not a newbie thing. Regardless of seniority or experience, everyone has to go through this lifecycle before changes are pushed to master. The founders are not exempt.

Every time I make a change, it gets reviewed by a set of fresh eyes which helps to catch stupid bugs, typos and mistakes while also allowing collaboration from another set of knowledge. Everyone in the company has their own areas of focus. Choosing the right person for a pull request has led to a vast improvement in my python work, my puppet configs and my systems theory, just because of that extra level of feedback on the changes I make.

It’s also worth noting that the reverse effect also happens: when we have code reviews, two people know something about every change. It’s an opportunity to spread knowledge in both directions and reduces the chance of one person holding all the information.

By the end of the first week, I had pushed a feature to our internal ChatOps service, Glitter, that allowed us to provision new servers with any RAID level to help with expanding our ELK cluster.

Kanban and the People

Choose your own parachute.

In the SRE team we employ a kanban-style of task management using Jira. At first it was pretty overwhelming to see 5 years worth of tasks, but I was given free reign to tackle anything I felt comfortable with. The rest of the team made themselves available at any moment to answer questions, explain how a service worked or just to help me think of a solution.

A mixture of kanban and a great team gave me a sense of freedom, productivity and camaraderie. It was my second week in the job but I was already contributing to our production and internal services - tools that some people relied on every day to do their jobs. It’s been a great feeling to be productive from day one and not feel like a burden to anyone.

Architecture Overview

Four years of flight school in an evening.

Image of Hosted Graphite's Evan Smith's notes of Hosted Graphite's infrastructure end-to-end

Sometime in week two, I was given The Talk. A veteran of the team and the last newest person sat me down in front of a whiteboard and explained our entire infrastructure end-to-end. Just on a high level though. We stopped just before my brain melted out through my ears. It was an incredibly dense infodump to process.

The month after, our team lead “volunteered” (it wasn’t voluntary) me for our Weekly Wednesday Talk. The topic? Our Ingestion and Render layers - a.k.a. The biggest sections of our infrastructure.

I spent a day doing some research and compiling past drawings of presentations on similar topics. I gave a brief run-through of my understanding of the system to the coworkers who taught me originally and they added helpful tidbits, highlighted what was important and, most importantly to me, told me what I could trim back to a higher level.

Monday before The Wednesday, I stood in front of the whiteboard with the SRE and Dev Team Leads. They gave me a fresh explanation of both layers and answered all of my pretty tough questions; “how does Kafka work into this?” became a back-and-forth of more questions as we embarked further down the ever-present rabbit hole of technology.

I stuttered and fumbled my way through presenting the pair of service diagrams to the entire company. Honestly, I was a little amazed at how much I knew and how much I could talk about. It was also a great help to have all the people who taught me the material in the room too so I could defer questions to them if they were just a little too difficult for me.

Overall, this was a great idea. A little over a month in and I could recite our pipeline from start to finish. It also gave me a great excuse to ask all those (what I thought were) stupid questions.

On Call

Your first jump.

A vital part of being an SRE is that you are going to be on-call. That means: you are available and near a laptop plus internet connection 24/7 for a week. There are a couple of things we do at Hosted Graphite which makes this kind of work a lot easier.

For one, shifts are Thursday to Thursday and those on-call get the Friday off to recuperate and relax. Being on-call can be pretty stressful at times and sometimes leads to late nights and frustrating problems. That Friday off means the world. Just an extra day of the weekend to recharge the batteries and fully unwind.

Secondly, if you’re starting out, you get a secondary for your on-call shift and it’s repeated again and again and again: you are free to escalate everything to your secondary and you should only tackle issues you are 100% confident in. You are also free to escalate to your secondary if you just want to say “Hey, can you watch so I don’t screw this up?” Also, you’re both on-call so you both get Friday off.

Before I started, everybody made it extremely clear: you start on-call as early as possible. It makes a lot of sense to me. The best way to tackle something like this is to be the one who receives the alerts and has to make the decisions – even if that decision is to let someone more experienced take over.

I got to jump out of a plane and feel what it was like to be a first responder. Not gonna lie, feels kinda cool.

In the next part of this article, I talk about duplicating our entire production environment (and purposefully breaking it), making my first change to a vital component of our pipeline and... causing an incident.

Evan Smith

SRE at Hosted Graphite. A lover of systems, security, and spreadsheets.

Related Posts

See why thousands of engineers trust Hosted Graphite with their monitoring