Hire site reliability engineer

Almost overnight, the Site Reliability Engineer (SRE) has become one of the hottest job titles in the IT industry. Considering a growing number of companies struggling to hire a site reliability engineer, LinkedIn has named it the second most promising job in the US in 2019, before the pandemic hit. So why the sudden buzz?

Reliable, scalable, and maintainable applications are a top priority for businesses. The problem is that conventional operations teams fail to keep up with the rapidly growing complexity of modern software. This is when SRE’s expertise comes into play.

Oleksii Glib, CEO at Acropolium, says that Site Reliability Engineers often act as firefighters. When an emergency arises, they put out fires in your business without others noticing the problem. But this can only be achieved if SREs have enough competence and knowledge in development, IT operations, and business integrations.

In this article, we’ll focus on the benefits of having SRE on board. We’ll also give you some tips on how to find and interview site reliability engineers, so keep reading.

Executive summary

Along with supporting DevOps' success, SREs provide measurable improvements for businesses

While developers need to release new or updated products with better functionality, the operations team needs to ensure they won’t cause outages and performance degradation. SREs remove or reduce much of this traditional conflict.

SRE responsibilities closely align with those of DevOps – they both deal with automating IT operations but do it from different perspectives. While DevOps engineers are responsible for production environments, SREs focus on the whole infrastructure, its reliability and performance at scale.

Supporting the DevOps success

The model of the SRE team serves as a bridge between development and operations teams. Before software is put into production, developers and DevOps engineers need to provide SREs with test evidence such as automated test results or instrumentations. If they don’t prove their code has appropriate operability complying with the system availability target, SREs will reject the code. But once the SRE team is happy with the results, they support the production stage instead of the DevOps team.

SREs also help companies establish Service Level Agreements (SLA), including Indicators (SLI) and Objectives (SLO), and error budgets. An error budget is based on the availability target set by service level agreements. For example, an uptime benchmark of 99,8% leaves 0,2% errors covered by the error budget.

As long as the product runs within this budget, developers feel free to update the product or modernize its functionality. However, exceeding the error budget freezes updates and forces developers and SREs to focus on fixing bugs and restoring services.

SRE benefits for businesses

DevOps success is not the only benefit you get after hiring an SRE specialist. Their approach is more holistic and will affect many aspects of your business environment.

An SRE specialist:

  • Informs companies about service health by setting and tracking metrics and KPIs across all services.
  • Improves system performance and reduces downtimes by establishing strict SLA standards for development and operations teams.
  • Helps to identify root causes of incidents and optimize incident response by designing alerting workflows and building more effective on-call processes.
  • Helps management to measure how system reliability affects sales, marketing, revenue, and other business functions.
  • Creates and maintains documentation for resources, services, infrastructure, and automation.

Though SREs may seem a one-size-fits-all solution, not all businesses will equally benefit from having an SRE in their team. This brings us to the next point.

When you need to hire a site reliability engineer

The role of SRE is to maintain highly reliable, scalable, and available applications

You’ll get most from hiring an SRE specialist if you relate to any of these cases:

Downtimes affect your revenue and customers

One of the key roles of SREs is to minimize or avoid downtimes of your services. According to Statista, 25% of enterprises worldwide report that an hour of downtime costs them between $301,000-$400,000. That’s roughly $5,000-$6,700 per minute. For more than a third of small and medium businesses, downtimes cause customer loss, and for 17%, they lead to revenue loss.

So if you run a real-time system, SREs will ensure it remains available for users without lengthy or frequent downtimes.

You need to reduce risks and improve security

A quarter of enterprises faced a data breach in 2019, while 38% of companies aren’t even sure whether they had a breach or not. Half of the companies believe they will face a cyberattack within the next 12 months.

Hiring an SRE specialist makes a lot of sense if your business is at risk or subject to regulatory compliance and security requirements. By setting measurable objectives and designing alerting workflows, SREs can monitor risks and allocate resources for their mitigation beforehand.

You need to accelerate the development cycle

SREs enhance and implement the DevOps principles of automating the product delivery cycle and sharing responsibilities across engaged teams. Setting and monitoring metrics related to code quality and errors allows companies with SREs on board to deliver higher-quality applications quickly and more predictively.

You need to improve cost-efficiency

SREs dramatically reduce the risk of downtimes and outages, and if you run a real-time system, you know how costly these can be. One of the most expensive downtime cases so far belongs to AWS. Though the exact number is unknown, research estimated that the notorious AWS downtime of 2017 cost S&P 500 companies at least $150 million.

Now, with a clearer picture of the SRE team’s role and responsibilities, you may feel more confident about whether you need one or not. And if you decide that you do, the next point to consider is the integration model.

SRE as a managed service vs. hiring in-house

In most cases, an SRE as a managed service is a better option than hiring in-house

Let’s not beat around the bush: in most cases, you will benefit more from outsourcing SRE as a service than hiring these specialists in-house. Before digging into details, consider these two questions first:

Will an SRE’s expertise be enough for your project? An SRE should be competent both in coding and business functioning. And the more complex your project is, the broader expertise your SRE team should bring along.

If you’re looking for a reliability engineer, you need to ensure that their knowledge and, to be more precise, the limit of their knowledge, won’t set your project back. Because if yes, you’ll need to hire more SREs to cover knowledge gaps. This brings us to the second question.

How much are you ready to pay? It’s simple math: more SREs means more expenses. If you have a large-scale project with millions of users, you may consider building an SRE team in-house. For example, Google has nearly 2,000 in-house SREs and keeps hiring more. But if your project is small, site reliability engineering managed services will save you budget and headaches. We’ll leave the pros and cons here for you to decide.

In-house SRE team

Pros

  • Loyalty. You set a long-term team that knows your business inside out.
  • Safety. Sensitive information doesn’t leave your company.
  • Control. You control the entire progress and timeframe.

Cons

  • Time. Building an in-house team requires time to find the right specialists.
  • Knowledge gaps. If your team lacks expertise for handling specific tasks, you’ll need to hire more experts. What’s worse, with the lack of knowledge in certain fields, the team can continuously generate worse results. And with time, this can snowball into performance degradation or even failure.
  • Management. To organize the team, you’ll need strong project managers who can drive the progress.
  • High cost. The lengthy hiring process, salaries, and retaining of SREs will cost you a fortune.

SRE as a managed service

Pros

  • Cost-efficiency. Pay for the services you receive when you need them.
  • Time savings. Instead of looking through dozens of candidates, you’ll only need to find a reliable development vendor who’ll be in your corner.
  • Broader expertise. Outsourcing solves the knowledge gap issue. Your development vendor will be responsible for site reliability, so if the team lacks knowledge, the vendor will simply allocate more specialists.
  • Shared management. You control the crucial stages while the vendor takes care of managing the daily progress of the team.
  • No administrative burden. The outsourcing vendor handles searching, hiring, and retaining specialists.

Cons

  • Safety. You’ll need to transfer technical information to a third-party team, so make sure the vendor has physical, technological, and administrative security measures in place.

If you opt for SRE as a service, consider development vendors who offer site reliability engineering consulting and audit like Acropolium does. They are usually more versatile as they work with more industries and cover a broader set of tasks.

Requirements & skills for an SRE

When hiring a SRE, look for an unusual skill set including expertise in development, operations, and business

When you hire a software reliability engineer, you need to look for an unusual skill set. It includes competency in development, DevOps, and system administration. On top of that, a good SRE should have certain personality traits and a business mind to focus on the company’s objectives first.

Technical skills

SREs should be versatile to really see the big picture. A person with a narrow focus on tech is not who you need to be looking for. Here are several tech criteria you may consider when hiring the specialist.

  • Software development experience and knowledge of major languages such as Python, C++, Go, or Java. An SRE should be able to create tools for managing and automating infrastructure.
  • Comprehensive knowledge of continuous integration, delivery, and deployment pipeline as well as tools like GitLab, Jenkins, and SonarQube
  • Competency in major operating systems and experience in their administration (for example, Linux distributives)
  • Experience in networking, visualization, and network monitoring tools such as Splunk, Nagios, or Grafana
  • Knowledge of protocols, hosting services, and clouds (TCP/IP protocols, DNS server, etc.)
  • Expertise in IT troubleshooting and root cause analysis (RCA) to mitigate service downtimes

Soft skills

SREs usually work under pressure to fix multiple issues simultaneously, so the right soft skill set is a must to prevent ineffective communications with other teams and hasty decisions.

  • Working under pressure. An SRE should be well-organized and demonstrate good performance even in a critical or high-volume production environment.
  • Problems solving. A good SRE needs to be detail-oriented to define the problem, detect the cause, and choose the solution.
  • Business-centred approach. SREs should focus on cross-functional metrics and the benefits of improved system reliability for the business. This perspective allows SREs to lead developer teams toward better business outcomes rather than simple system optimization.
  • Communication skills. Along with technical communication, SREs need to pitch their ideas to the less tech-savvy management. So their fluency in both technical and business languages will be a plus.

Now that you know the right set of site reliability engineer skills to include in your job description, let’s move to the interviewing process.

How to run an SRE interview

Let your developers, DevOps engineers, and system administrators interview SREs

The interview will depend on how high your bar for SRE talent is. For a high-scale project that handles millions of queries per second, you may consider interviewing candidates through five or more stages.

A short pre-screen

You can start with checking the candidates’ elementary knowledge for the SRE’s role. Consider asking them simple, unambiguous questions with only one possible answer. For example, you can name several IP addresses and ask the candidate to tell which are public and routable on the internet and which are not.

Technical pre-interview

Your engineers may interview candidates at this stage through Skype, Zoom, or other apps for video conferencing with the screen sharing function. They may assess the candidates’ code knowledge by asking to solve some tasks, starting from elementary problems and ending with those your SRE performs daily. Some companies use collaborative typing systems like Coderpad to check how candidates review or write their code in real-time.

Coding is not the only technical knowledge to check when hiring an SRE specialist. You can also check the candidates’ understanding of how the system works, its architecture, and monitoring.

The main interview

Once the candidates pass the initial stages, you can start the main interview with many specialists involved. Consider dividing this interview into several steps with separate interviewers covering specific topics on each step. For example, you may invite your developers, system administrators, DevOps engineers to interview candidates.

The goal is to discover the limit of the candidates’ knowledge. Ideally, the interview will show how they reason and handle the problems they’ve never faced before. Because this is what SRE’s responsibilities are, after all.

For example, ask your candidates to design a large-scale system. Check whether they can estimate system components considering costs and tradeoffs for reliability at the same time.

Whatever questions you choose, make sure you don’t use trivia, brainteasers, and whiteboard coding to assess your candidates. These complicated tasks will take much time and effort but provide minimal insight into how well the candidate fits the role.

How Acropolium can help

Acropolium has been delivering business solutions for 18 years now. During this time, we have acquired extensive expertise in software development and IT operations, so we know how to keep the lights on in a highly scalable and available IT environment.

Here are some cases when we helped clients improve their service reliability to:

  • Spend one month instead of three years. Due to problems in the code, one operation was causing downtime for several hours. To address this, the client had been building a new platform for three years by that time. We managed to detect the problem and solve it within a month.
  • Reduce downtime by 87%. We managed to reduce release-related downtime from 48 hours to six hours by reviewing the code.
  • Increase operating time by 150 times. After code revision, a booking system displayed information about hotel rooms within 200 msec, instead of 30 sec.

Whether you want SRE as a managed service or consult with the SRE team over your system, we’re ready to go the extra mile for you.

The bottom line

An SRE with the right set of skills and experience can be a silver bullet taking your business a step further. Though SREs focus on process automation like DevOps, their role is far more business-oriented and impacts the entire system infrastructure rather than a separate product.

SREs can improve your system performance, reduce downtimes and make the incidence response better. If you’re ready for a leap forward and looking for site reliability engineering services, contact us to learn how we can help refine your business.