Mid DevOps / SRE Interview Questions

What a Mid DevOps / SRE interview focuses on, the questions you'll face, and how to practice them with instant AI feedback.

Run a free AI mock interview

What's expected at the Mid level

Expect CI/CD, Kubernetes, observability and independent on-call ownership.

Junior Senior

Sample DevOps / SRE interview questions

TechnicalA service is slow. Walk through how you debug it from the OS up.
What a strong answer covers
- Check CPU, memory, disk I/O with top, iostat
- Analyze network latency and drops with netstat, traceroute
- Profile application using strace or perf
- Identify slow database queries via slow query log
- Review application logs and metrics dashboards
View a sample answer
Start with OS-level checks: run `top` to see CPU/memory usage, `iostat -x 1` for disk I/O, `vmstat` for swap and context switches. Next, check network with `netstat -s` for retransmits, `traceroute` for latency. Then profile the application: use `strace -c -p <pid>` to count system calls, or `perf` for hotspot functions. At the database level, enable slow query log and look for queries with high latency. Finally, correlate with application logs and metrics dashboards to pinpoint the bottleneck. A common pitfall is jumping straight into code without first ruling out resource exhaustion or network issues.
TechnicalWhat happens, step by step, when you type a URL and hit enter?
What a strong answer covers
- DNS resolution via caching and recursion
- TCP three-way handshake and possible TLS handshake
- HTTP request processing at server (load balancer, web server, app)
- Database queries and response generation
- Rendering on client side
View a sample answer
First, the browser checks its DNS cache, then the OS cache, and finally queries a recursive resolver. Once the IP is obtained, a TCP three-way handshake (SYN, SYN-ACK, ACK) establishes a connection. If HTTPS, a TLS handshake occurs (client hello, server certificate, key exchange). The browser sends an HTTP request (e.g., GET) to the server, which may pass through a CDN and load balancer. The web server (e.g., Nginx) hands it to the application server, which may query databases, cache, or external APIs. The server returns an HTTP response (HTML, JSON, etc.), and the browser parses and renders it. Common pitfalls: DNS delay, TLS overhead, and chatty backend calls.
TechnicalHow do Kubernetes liveness and readiness probes differ?
What a strong answer covers
- Liveness probe restarts container (kubelet)
- Readiness probe controls service endpoint (kube-proxy)
- Liveness for deadlocks, readiness for startup/backpressure
- Both can be HTTP, TCP, or command
- Common pitfall: using same endpoint causes cascading failures
View a sample answer
Liveness probes determine if the container is running and should be restarted if it fails. They catch deadlocks or crashes. Readiness probes determine if the container is ready to serve traffic; if it fails, the pod is removed from service endpoints but not restarted. Both can be HTTP GET, TCP socket, or command execution. The key difference: liveness ensures the app is alive, readiness ensures it can handle requests. A common pitfall is using the same endpoint for both probes; if the app becomes temporarily overloaded, the liveness probe fails causing restart, making the situation worse. Instead, readiness should reflect ability to accept traffic (e.g., queue depth), while liveness should be basic health.
CodingWrite a script to parse logs and report the top error rates.
What a strong answer covers
- Parse log lines with regex or split
- Count errors by type or minute
- Use dictionary for accumulation
- Sort results and output top N
- Complexity O(n) time, O(m) space for m error types
View a sample answer
I'll write a Python script that reads a log file line by line, extracts error messages (e.g., lines containing 'ERROR'), groups by error category (e.g., by extracting a pattern), counts occurrences, and prints the top 5. It uses a dictionary to accumulate counts and sorts the items. Time complexity is O(n) where n is number of lines, space O(m) for m distinct error types.
Reference solutionpython
import re from collections import Counter def parse_log_and_report(file_path, top_n=5): # Regex to capture error type (e.g., "ERROR: <type>") error_pattern = re.compile(r'\bERROR\b\s*:?\s*(\w+)') error_counts = Counter() with open(file_path, 'r') as f: for line in f: match = error_pattern.search(line) if match: error_type = match.group(1) error_counts[error_type] += 1 # Sort by count descending and report top N for error_type, count in error_counts.most_common(top_n): print(f"{error_type}: {count}") # Example usage parse_log_and_report('app.log', top_n=5)
System DesignDesign a CI/CD pipeline with safe, gradual production rollouts.
What a strong answer covers
- Build stage: compile, unit tests, static analysis
- Package stage: build container image and push to registry
- Deploy to staging: automated integration and smoke tests
- Progressive rollout: canary or blue-green with automated gating
- Rollback strategy: revert to previous version automatically
View a sample answer
The CI/CD pipeline starts with a commit triggering a build that compiles code, runs unit tests, and static analysis. On success, a container image is built and pushed to a registry. The pipeline then deploys to a staging environment for automated integration, security, and smoke tests. For production rollout, we use a gradual approach: first deploy to a small canary (e.g., 1% of traffic), monitor error rates and latency for a few minutes, then gradually increase to 10%, 50%, and 100% with automated gating based on metrics. If any gating metric breaches a threshold, rollback is triggered automatically. Feature flags can also be used to decouple deployment from feature activation. A common pitfall is insufficient monitoring during canary phase, leading to unnoticed regressions.
System DesignDesign monitoring and alerting for a multi-region service.
What a strong answer covers
- Metrics per region: latency, error rate, throughput, saturation
- Synthetic transactions (probes) from multiple regions
- Centralized aggregation with separate alerting per region
- Alerting: threshold-based and anomaly detection (e.g., 3σ)
- On-call rotation with escalation policies
View a sample answer
We monitor each region separately with a standardized set of metrics: latency (p50, p99), error rates (HTTP 5xx), request throughput, and resource saturation (CPU, memory). Synthetic transactions run from outside the region to measure end-to-end health. A centralized monitoring system (e.g., Prometheus + Thanos) aggregates metrics, but alerting is per-region to avoid noisy global alerts. We use threshold-based alerts for critical metrics (e.g., error rate > 1% for 5 minutes) and anomaly detection (e.g., sudden latency spike 3σ above baseline) for pre-failure signals. Alerts page the on-call engineer via PagerDuty with escalation to second-level if not acknowledged. A common pitfall is setting alerts too sensitivity, causing alert fatigue; we tune thresholds using historical data.
BehavioralWalk me through an incident you led and the postmortem actions.
What a strong answer covers
- Situation: database replication lag caused read timeouts
- Task: restore normal service and fix root cause
- Action: scaled readers, switched to async API, patched code
- Result: service recovered in 30 minutes, no data loss
- Postmortem: blameless, added monitoring, improved deploy process
View a sample answer
Situation: During a traffic spike, our primary database replication lag increased to seconds, causing read timeouts for users. Task: I led the incident response to restore service and identify the root cause. Action: I first scaled up read replicas and switched read-heavy APIs to use replicas, which reduced load on the primary. I then updated the application logic to handle stale reads gracefully with a fallback to primary when needed. Finally, we patched the database connection pool settings. Result: Service fully recovered in 30 minutes with no data loss. Postmortem actions: We added replication lag monitoring and alerting, implemented a circuit breaker for upstream services, and added a pre-deployment load test to catch scaling issues before they hit production.
BehavioralHow do you balance reliability work against feature requests?
What a strong answer covers
- Define SLOs and error budget (e.g., 99.9% availability)
- Use error budget burn rate to prioritize reliability work
- Quantify cost of outages vs. feature velocity
- Involve product stakeholders in prioritization
- Allocate dedicated reliability sprints (e.g., 20% time)
View a sample answer
I balance reliability and features by establishing clear SLOs and an error budget. For example, if our service targets 99.9% availability monthly, we have 43 minutes of allowable downtime. As long as we stay within budget, we prioritize feature work; when we start burning budget faster than expected, we allocate more time to reliability improvements. I involve product managers by quantifying the impact: a 5% increase in error rate might cost X users. We also schedule regular reliability sprints (e.g., every fourth sprint) dedicated to addressing tech debt and incident prevention. A common pitfall is treating reliability as an afterthought; instead, it should be a shared responsibility with clear SLIs and SLOs that everyone understands.

What interviewers assess

Linux & networking

Processes, file systems, DNS, TCP/IP and debugging tools.

CI/CD & IaC

Pipelines, Terraform, and reproducible, automated deploys.

Containers & orchestration

Docker, Kubernetes scheduling, networking and scaling.

Observability

Metrics, logs, traces, SLOs and meaningful alerting.

Reliability

Incident response, blameless postmortems and capacity planning.

How to prepare

Debug from first principles — interviewers want a systematic, layered approach.
Frame answers around SLOs, error budgets and blameless postmortems.
Automate everything in your answers; manual steps read as red flags.

Frequently asked questions

What kind of coding do DevOps/SRE interviews include?

Usually scripting (log parsing, automation) and sometimes data-structure problems, alongside heavy troubleshooting and system design.

How important is Kubernetes for SRE interviews?

Very common at mid and senior levels — expect questions on scheduling, networking, probes and scaling.

How do I prepare for an SRE interview?

Practice layered troubleshooting, review reliability concepts like SLOs and incident response, and rehearse scenarios in mock interviews.

Practice DevOps / SRE questions with instant AI feedback

Offersly runs a mock interview tailored to your resume and target role, then scores every answer on relevance, depth, clarity and correctness.

Start free All DevOps / SRE interview questions

Sample DevOps / SRE interview questions

TechnicalA service is slow. Walk through how you debug it from the OS up.

What a strong answer covers

Check CPU, memory, disk I/O with top, iostat
Analyze network latency and drops with netstat, traceroute
Profile application using strace or perf
Identify slow database queries via slow query log
Review application logs and metrics dashboards

View a sample answer

Start with OS-level checks: run `top` to see CPU/memory usage, `iostat -x 1` for disk I/O, `vmstat` for swap and context switches. Next, check network with `netstat -s` for retransmits, `traceroute` for latency. Then profile the application: use `strace -c -p <pid>` to count system calls, or `perf` for hotspot functions. At the database level, enable slow query log and look for queries with high latency. Finally, correlate with application logs and metrics dashboards to pinpoint the bottleneck. A common pitfall is jumping straight into code without first ruling out resource exhaustion or network issues.

TechnicalWhat happens, step by step, when you type a URL and hit enter?

What a strong answer covers

DNS resolution via caching and recursion
TCP three-way handshake and possible TLS handshake
HTTP request processing at server (load balancer, web server, app)
Database queries and response generation
Rendering on client side

View a sample answer

First, the browser checks its DNS cache, then the OS cache, and finally queries a recursive resolver. Once the IP is obtained, a TCP three-way handshake (SYN, SYN-ACK, ACK) establishes a connection. If HTTPS, a TLS handshake occurs (client hello, server certificate, key exchange). The browser sends an HTTP request (e.g., GET) to the server, which may pass through a CDN and load balancer. The web server (e.g., Nginx) hands it to the application server, which may query databases, cache, or external APIs. The server returns an HTTP response (HTML, JSON, etc.), and the browser parses and renders it. Common pitfalls: DNS delay, TLS overhead, and chatty backend calls.

TechnicalHow do Kubernetes liveness and readiness probes differ?

What a strong answer covers

Liveness probe restarts container (kubelet)
Readiness probe controls service endpoint (kube-proxy)
Liveness for deadlocks, readiness for startup/backpressure
Both can be HTTP, TCP, or command
Common pitfall: using same endpoint causes cascading failures

View a sample answer

Liveness probes determine if the container is running and should be restarted if it fails. They catch deadlocks or crashes. Readiness probes determine if the container is ready to serve traffic; if it fails, the pod is removed from service endpoints but not restarted. Both can be HTTP GET, TCP socket, or command execution. The key difference: liveness ensures the app is alive, readiness ensures it can handle requests. A common pitfall is using the same endpoint for both probes; if the app becomes temporarily overloaded, the liveness probe fails causing restart, making the situation worse. Instead, readiness should reflect ability to accept traffic (e.g., queue depth), while liveness should be basic health.

CodingWrite a script to parse logs and report the top error rates.

What a strong answer covers

Parse log lines with regex or split
Count errors by type or minute
Use dictionary for accumulation
Sort results and output top N
Complexity O(n) time, O(m) space for m error types

View a sample answer

I'll write a Python script that reads a log file line by line, extracts error messages (e.g., lines containing 'ERROR'), groups by error category (e.g., by extracting a pattern), counts occurrences, and prints the top 5. It uses a dictionary to accumulate counts and sorts the items. Time complexity is O(n) where n is number of lines, space O(m) for m distinct error types.

Reference solutionpython

import re
from collections import Counter

def parse_log_and_report(file_path, top_n=5):
    # Regex to capture error type (e.g., "ERROR: <type>")
    error_pattern = re.compile(r'\bERROR\b\s*:?\s*(\w+)')
    error_counts = Counter()

    with open(file_path, 'r') as f:
        for line in f:
            match = error_pattern.search(line)
            if match:
                error_type = match.group(1)
                error_counts[error_type] += 1

    # Sort by count descending and report top N
    for error_type, count in error_counts.most_common(top_n):
        print(f"{error_type}: {count}")

# Example usage
parse_log_and_report('app.log', top_n=5)

System DesignDesign a CI/CD pipeline with safe, gradual production rollouts.

What a strong answer covers

Build stage: compile, unit tests, static analysis
Package stage: build container image and push to registry
Deploy to staging: automated integration and smoke tests
Progressive rollout: canary or blue-green with automated gating
Rollback strategy: revert to previous version automatically

View a sample answer

The CI/CD pipeline starts with a commit triggering a build that compiles code, runs unit tests, and static analysis. On success, a container image is built and pushed to a registry. The pipeline then deploys to a staging environment for automated integration, security, and smoke tests. For production rollout, we use a gradual approach: first deploy to a small canary (e.g., 1% of traffic), monitor error rates and latency for a few minutes, then gradually increase to 10%, 50%, and 100% with automated gating based on metrics. If any gating metric breaches a threshold, rollback is triggered automatically. Feature flags can also be used to decouple deployment from feature activation. A common pitfall is insufficient monitoring during canary phase, leading to unnoticed regressions.

System DesignDesign monitoring and alerting for a multi-region service.

What a strong answer covers

Metrics per region: latency, error rate, throughput, saturation
Synthetic transactions (probes) from multiple regions
Centralized aggregation with separate alerting per region
Alerting: threshold-based and anomaly detection (e.g., 3σ)
On-call rotation with escalation policies

View a sample answer

We monitor each region separately with a standardized set of metrics: latency (p50, p99), error rates (HTTP 5xx), request throughput, and resource saturation (CPU, memory). Synthetic transactions run from outside the region to measure end-to-end health. A centralized monitoring system (e.g., Prometheus + Thanos) aggregates metrics, but alerting is per-region to avoid noisy global alerts. We use threshold-based alerts for critical metrics (e.g., error rate > 1% for 5 minutes) and anomaly detection (e.g., sudden latency spike 3σ above baseline) for pre-failure signals. Alerts page the on-call engineer via PagerDuty with escalation to second-level if not acknowledged. A common pitfall is setting alerts too sensitivity, causing alert fatigue; we tune thresholds using historical data.

BehavioralWalk me through an incident you led and the postmortem actions.

What a strong answer covers

Situation: database replication lag caused read timeouts
Task: restore normal service and fix root cause
Action: scaled readers, switched to async API, patched code
Result: service recovered in 30 minutes, no data loss
Postmortem: blameless, added monitoring, improved deploy process

View a sample answer

Situation: During a traffic spike, our primary database replication lag increased to seconds, causing read timeouts for users. Task: I led the incident response to restore service and identify the root cause. Action: I first scaled up read replicas and switched read-heavy APIs to use replicas, which reduced load on the primary. I then updated the application logic to handle stale reads gracefully with a fallback to primary when needed. Finally, we patched the database connection pool settings. Result: Service fully recovered in 30 minutes with no data loss. Postmortem actions: We added replication lag monitoring and alerting, implemented a circuit breaker for upstream services, and added a pre-deployment load test to catch scaling issues before they hit production.

BehavioralHow do you balance reliability work against feature requests?

What a strong answer covers

Define SLOs and error budget (e.g., 99.9% availability)
Use error budget burn rate to prioritize reliability work
Quantify cost of outages vs. feature velocity
Involve product stakeholders in prioritization
Allocate dedicated reliability sprints (e.g., 20% time)

View a sample answer

I balance reliability and features by establishing clear SLOs and an error budget. For example, if our service targets 99.9% availability monthly, we have 43 minutes of allowable downtime. As long as we stay within budget, we prioritize feature work; when we start burning budget faster than expected, we allocate more time to reliability improvements. I involve product managers by quantifying the impact: a 5% increase in error rate might cost X users. We also schedule regular reliability sprints (e.g., every fourth sprint) dedicated to addressing tech debt and incident prevention. A common pitfall is treating reliability as an afterthought; instead, it should be a shared responsibility with clear SLIs and SLOs that everyone understands.

What interviewers assess

Linux & networking

Processes, file systems, DNS, TCP/IP and debugging tools.

CI/CD & IaC

Pipelines, Terraform, and reproducible, automated deploys.

Containers & orchestration

Docker, Kubernetes scheduling, networking and scaling.

Observability

Metrics, logs, traces, SLOs and meaningful alerting.

Reliability

Incident response, blameless postmortems and capacity planning.

Frequently asked questions

What kind of coding do DevOps/SRE interviews include?

Usually scripting (log parsing, automation) and sometimes data-structure problems, alongside heavy troubleshooting and system design.

How important is Kubernetes for SRE interviews?

Very common at mid and senior levels — expect questions on scheduling, networking, probes and scaling.

How do I prepare for an SRE interview?

Practice layered troubleshooting, review reliability concepts like SLOs and incident response, and rehearse scenarios in mock interviews.