Uncategorized Thoughtfully Training SRE Apprentices: Establishing Padawan and Jedi Matches –

Live Webinar and Q&A: Using Reinforcement Learning To Beat Go Masters and Write Java Unit Tests (Feb 10th, 2022) Register Now
Facilitating the spread of knowledge and innovation in professional software development

Justin Lee goes over a number of frameworks and libraries available for Kotlin development and not once have to touch the Android emulator.
What is the single best API technology you should always use? Thomas Betts moderated the discussion, with the goal to understand some of the high-level features and capabilities of three popular technologies for implementing APIs. The discussion covers some of the pros and cons of GraphQL and gRPC, and why you might use them instead of a RESTful API.
In this article, author Juan Pan discusses the data sharding architecture patterns in a distributed database system. She explains how Apache ShardingSphere project solves the data sharding challenges. Also discussed are two practical examples of how to create a distributed database and an encrypted table with DistSQL.
In this podcast, Shane Hastie, Lead Editor for Culture & Methods, spoke to Kevin Boyle about bringing DevOps culture practices and tools into low-code and no-code environments.
Christian Posta shares practical guidance for how to adopt a service mesh for an organization including separating out control plane and data plane, plugging in with observability tools, leveraging gateways appropriately, rolling out mTLS safely, and overall preparing for troubleshooting and debugging.
How do traditional security approaches scale in Cloud Native architectures? Register Now!
Learn from practitioners driving innovation and change in software. Attend in-person on April 4-6, 2022.
Uncover emerging trends and practices from software leaders. Attend online on May 10-20, 2022.
Your monthly guide to all the topics, technologies and techniques that every professional needs to know about. Subscribe for free.
InfoQ Homepage Articles Thoughtfully Training SRE Apprentices: Establishing Padawan and Jedi Matches
Aug 04, 2021 15 min read
Tammy Bryant Butow
reviewed by
Ben Linders
In this article, I will share how Padawans and Jedis can inspire and teach us how to help people of a wide variety of backgrounds, ages, and experience levels to observe and understand failures in production. I will share how I worked with a colleague to create an SRE Apprentice program to hire and train new SREs who wanted a career change. I will cover practical lessons learned, things I’d change and I’ll also share how you can create and roll out a program for SRE apprentices within your organization. I will also share feedback from the SRE apprentices themselves.  
This SRE Apprentice program was originally created by myself and a director of engineering while we were at Dropbox. We both realised that it was difficult to hire the talent we needed for our SRE teams. We also knew that there were many folks hungry for the opportunity to both become an SRE and work at Dropbox. We initially rolled out the program in 2016 and onboarded four SREs. We decided this would be a six-month program and at the completion of the program we would determine if the apprentices had learned enough to be invited to join as full-time SREs. All four apprentices successfully completed the program and they are still working as engineers to this day. After the success of this program, the next batch of apprentices were hired and the program was repeated.
Since the creation of this program I have thought long and hard about how this training could be provided in a more scalable model. At Gremlin, I decided to create a short and fast condensed version in the form of "Gremlin Bootcamps". These bootcamps are now offered for free all over the world via We’ve trained up to thousands of engineers and helped them learn critical SRE skills as documented by the Google Service Reliability Pyramid. This program now has a full team of contributors who keep it running on a day-to-day basis.
Developer Workshop: Module 2 – CI/CD Pipelines.
Join experts to learn how to implement well-engineered CI/CD pipelines that consider governance and traceability from idea to production. Watch Now!
The program was initially a lot more work than we expected it to be because we had a lot to teach the SRE apprentices, and they were very hungry to learn. It was fantastic to see all of the SRE teachers grow and develop their leadership skills through the program. Many of them went on to become engineering leaders and even startup CEOs.
When I asked the SRE apprentices to share what the program meant to them, I was told "the SRE apprenticeship was critical for my career – it was my foot into the door of the tech industry, when it can be hard to break in as a newcomer without the usual credentials. But getting your foot in the door is just the first step."
When you are starting out as an engineer, it’s difficult to identify if something is or is not a failure and to understand why it occurred. This is the art of troubleshooting. I call it an art because it really does take years to achieve mastery.
If I explain it in simple terms, imagine you are eating a cake and it doesn’t taste quite right. You’re not sure why; it's difficult to put your finger on it. Now imagine you can go back through and review everything that occurred to create the cake – how hot the oven was, the ingredients used, the amount of ingredients, etc. This would enable you to identify potential problems that occurred during the baking of the cake. However, if you don’t know what baking a cake is supposed to be like, it will be very difficult to know if something was correct or incorrect – you might need to ask someone with more experience in baking cakes to help you troubleshoot and understand. Through this experience, you become much wiser and it better prepares you for future troubleshooting when cakes don’t turn out as you expect.
When we think of this in terms of computer science and distributed systems, there are many areas of expertise to achieve mastery in – observability, databases, traffic management, caching, performance, availability, durability, and more. It takes us many years to develop our skills. Of course, software can help us on this journey, but knowing the right tools to use for specific tasks is a skill in itself.
I think the best way to learn to develop your skills is to follow a model we created when developing the SRE Apprentice (aka Padawan) program. This involves finding a dedicated mentor (SRE Teacher aka Jedi) who can guide you through your journey for six months. Asking this person to commit six months to helping you level up will be a big ask but they too will learn from this experience and it’s also an excellent way for them to develop their own leadership skills.
With your mentor by your side, I’d recommend following a structured approach which can be broken down into four phases:

Diagram 1.0 – SRE Apprentice Program
We came up with these phases by reflecting on past experience. We also spoke with the students and mentors to hear how they’d prefer to learn and share knowledge.  
Each of the four phases can be described as:
Something we pondered was "what is the best way to learn the required practical skills to be a successful SRE?" We knew you didn’t learn these skills in the university classroom, or at a coding bootcamp. We realised the majority of SREs learn skills on the job through real life practice. Generally speaking this is because students do not have access to production systems, production environments and production-grade SRE software.

Diagram 2.0 – Trust-Building Loop
Generally, when you are first starting out and you can’t take on projects that are too large in size, it’s important to be able to work on bite-size tasks for you to be able to be successful, learn, and be rewarded. Generally, the reward for completing a task is not money or praise; it’s usually more work. You did a great job, so you are rewarded with more work because you are now trusted by your team. If you do very good work, you will be rewarded with more complicated work (increase in scope of length of time required to complete tasks; see diagram "Trust-Building Loop").
Learning via osmosis is very powerful. There is a lot of jargon and technical terms that are best learned just by hearing others use these terms in context. For example, if you ask someone who doesn’t work in technology to pronounce nginx, they will likely say this incorrectly. This is very common for new engineers too. It’s not a problem, it just means there is a lot to learn which experienced engineers may take for granted. What if you asked a group of people who don’t work in technology to spell nginx? I’m sure you’d get many different answers.
How does this change in a remote world? Really, it’s the same. You’ll still be attending meetings and hearing new terms, you can still attend standup, and you can still continue to google the terms you don’t know to build your vocabulary. For example, imagine you are in a meeting on the topic of incident management and you are reviewing metrics as a team. As a new SRE apprentice you might wonder, what does MTTD mean? If you hear or see this term in a meeting you can quickly google it and learn on the job. Encourage your SRE apprentices to do this during meetings; give them permission to do this. I also recommend asking them to write down a weekly list of questions to review at their weekly 1:1 or during a daily end-of-day check-in.
There is more to mentoring than just learning the technical skills required to be an engineer. If you are lucky enough to find yourself in the role of a mentor, I encourage you to also teach your mentee:
Determining the appropriate task for an apprentice engineer can be quite difficult. Here are a number of examples that you can use to help your engineer learn critical skills for their long-term career. Below are examples of the specific types of tasks that can be assigned to an SRE apprentice:
It is important for SRE apprentices to take on tasks of increasing complexity. Task complexity can be altered by pulling one of the following levers:
Mentors need to learn their apprentice’s skill level quickly as they will be responsible for the estimation component of their apprentices’ first set of tasks. I recommend giving your apprentice a week of one-day tasks, then gradually increasing either the scope of work or the length of time required.

Diagram 3.0 – Task Complexity Matrix for SRE Apprentices
When planning out work for your engineer, it’s important to refer to it as "tasks" and not "projects". Your apprentice should be doing bite-size pieces of work allocated to them that fit within larger holistic projects. I recommend allowing them to work on tasks that fall within one project only for their first month; for example, you could give them the task of improving monitoring and alerting for a specific system. If this is not possible, I recommend scoping the tasks down to only two different projects for their first month. It’s important to not introduce too much scope complexity in the first month of their apprenticeship as they will be learning many new concepts and terms as well as meeting many new people. It’s important for them to have time to build relationships with their coworkers who they will also be learning from. This will help set them up for success and make their time more enjoyable. The diagram below "Learn By Practice" is an example program that you can use to allocate task work to your SRE Apprentice:

Diagram 4.0 – Learn By Practice
It’s important to give your SRE apprentice individual and private 1:1 time where they can ask you questions and get answers. I recommend allocating one hour a week for their apprenticeship. This may seem like a lot of time but it’s incredibly important. I recommend running this meeting on a Wednesday morning (11am) to give your apprentice a good chance to get unblocked on any task they may be stuck on.
Create a running agenda – create a shared document for your apprentice and you to add items to as the week progresses. Encourage your apprentice to add items to this document as they come up so they don’t forget them. This also gives you a chance to know what any issues are in advance; for example, is your apprentice blocked by another team member? Perhaps you can speak to that team member and determine why.
An important tip I’ll share is to never make assumptions. Do not assume anything when it comes to your apprentice. If you think they might not be ready for more complex tasks, ask them. You might be surprised! I find that the best mentors do not make assumptions; they ask questions and keep an open mind.
Here is an example of a great mentoring conversation:
Mentor: "Hey there, I noticed when you were working on delivering your monitoring and alerting task you took longer than expected to get it done. What was the main thing you got blocked on and spent time on?"
Apprentice: "I was actually hoping I’d get it done in one day but when I asked my apprentice friend from another team to review my work they said I should do it differently so I completely redid it."
Mentor: "Oh interesting, it’d be great to see your original version of that task. Could you share it with me?"
Apprentice: "Sure, here it is."
Mentor: "This would have been perfect actually, you were spot on. You could have submitted this and have finished the work on time. This is a great learning opportunity though. What you did on the second day is called refactoring – where you take your existing code and modify it in the hope of making improvements."
Apprentice: "Oh really, wow I wish I would have asked you before redoing it. It’s hard to know if something is done in the best way possible."
Mentor: "You can always share with me your work, that’s what I am here for. Especially if you are going to do a big refactor when your code is already working. In engineering, there are always many ways to do the same thing. Both ways were right and actually would have the same performance results so the main benefit of not refactoring your code would have been that you’d completed the task on-time. Next task, let’s agree that you’ll share it with me and your apprentice friend before doing a major refactor of your work."
Apprentice: "Sure, right on! Thanks."
This conversation shows how not making assumptions is really important. Your apprentice actually completed the work correctly and on time but they just didn’t share it with you. They were sent off in the wrong direction by a friend.  Now you and your apprentice have built more trust and you have a more accurate picture of their current skill level.
What are the overall learnings we gained from this new program?
The best advice I can share comes from the SRE apprentices themselves. Here are three tips they shared for SRE teachers:
These tips also come from the SRE apprentices. This is what they’d say:
In summary, this article shares how you can create your own SRE Apprentice program and learn from the experiences of the Dropbox SRE team who created this original program and structure. This practical step-by-step approach will enable you to add new SRE talent to your team.
Tammy Bryant Butow is the principal SRE at Gremlin, where she works on Chaos Engineering, the facilitation of controlled experiments to identify systemic weaknesses. Gremlin helps engineers build resilient systems using their control plane and API. Butow previously led SRE teams at Dropbox responsible for databases and storage systems used by over 500 million customers. Prior to this, she worked at DigitalOcean and at one of Australia’s largest banks in security engineering, product engineering, and infrastructure engineering. Butow is the co-founder of Girl Geek Academy, a movement to teach 1 million girls technical skills by 2025. Butow spoke about training SRE apprentices at QCon Plus May 2021. You can find her on Twitter at @tambryantbutow.

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.
You need to Register an InfoQ account or or login to post comments. But there’s so much more behind being registered.
Get the most out of the InfoQ experience.
Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.
QCon, the international software development conference, is returning (in-person and online) in 2022.
QCon brings together the world’s most innovative senior software engineers across multiple domains to share their real-world implementation of emerging trends and practices.
Find practical inspiration (not product pitches) from software leaders deep in the trenches creating software, scaling architectures and fine-tuning their technical leadership to help you make the right decisions. Save your spot now! and all content copyright © 2006-2022 C4Media Inc. hosted at Contegix, the best ISP we’ve ever worked with.
Privacy Notice, Terms And Conditions, Cookie Policy


Author Details

Sign up for our newsletter to stay up to
date with tech news!