Senior Data Engineer Interview Questions

What a Senior Data Engineer interview focuses on, the questions you'll face, and how to practice them with instant AI feedback.

Run a free AI mock interview

What's expected at the Senior level

Expect platform/warehouse architecture, reliability and governance ownership.

Junior Mid

Sample Data Engineer interview questions

CodingWrite a SQL query to find the second-highest salary per department.
What a strong answer covers
- Use DENSE_RANK() or ROW_NUMBER() window function
- Partition by department, order by salary descending
- Handle ties correctly (DENSE_RANK) vs unique ranking (ROW_NUMBER)
- Filter where rank = 2
- Consider performance with large tables (indexes)
View a sample answer
To find the second-highest salary per department, use a window function like DENSE_RANK() partitioned by department and ordered by salary descending. DENSE_RANK ensures that if multiple employees have the same highest salary, they all get rank 1, and the next distinct salary gets rank 2. Then filter where rank = 2. Alternatively, ROW_NUMBER() assigns unique ranks, which may skip rank 2 if there are ties for rank 1. The query is: WITH ranked AS (SELECT *, DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rnk FROM employees) SELECT department, salary FROM ranked WHERE rnk = 2. This returns all employees with the second-highest salary per department. If no second-highest exists (e.g., department has only one salary), the query returns no rows. For large tables, ensure an index on (department, salary) for efficiency. A common pitfall is using RANK() or ROW_NUMBER() without understanding tie-handling requirements.
Reference solutionsql
WITH ranked AS ( SELECT *, DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rnk FROM employees ) SELECT department, salary FROM ranked WHERE rnk = 2;
TechnicalExplain the difference between ETL and ELT and when you would use each.
What a strong answer covers
- ETL: Extract, Transform, Load (transform before loading)
- ELT: Extract, Load, Transform (transform in target system)
- ETL use case: complex transformations, regulatory compliance, lower target system power
- ELT use case: big data, cloud warehouses (Snowflake, BigQuery), schema-on-read
- Tradeoffs: ETL more upfront work, ELT requires powerful target compute
View a sample answer
ETL (Extract, Transform, Load) processes data by extracting from sources, transforming it in a staging area (e.g., using Spark or a dedicated ETL tool), and then loading it into the target system. ELT (Extract, Load, Transform) first extracts and loads raw data into the target system (e.g., a cloud data warehouse), then performs transformations within that system using SQL or its compute. Choose ETL when you need to apply complex business logic before loading, when the target system has limited compute resources, or when compliance requires masking data early. Choose ELT when dealing with large volumes of semi-structured data (like JSON logs), when using cost-effective cloud warehouses with elastic compute (Snowflake, BigQuery, Redshift), or when you want to keep raw data for reprocessing. A common pitfall is assuming ELT is always better; in reality, ETL can reduce load on the target system and simplify monitoring. For example, a financial report requiring strict aggregation and validation is better suited to ETL, while a clickstream analysis exploring unstructured data benefits from ELT.
TechnicalHow would you design an idempotent pipeline that can safely re-run?
What a strong answer covers
- Idempotency: re-running produces same result
- Use unique identifiers and upsert (merge) logic
- Watermark or partition cuts to avoid duplicates
- Store deduplicated state (e.g., job run IDs table)
- Handle partial failures with atomic commits or checkpointing
View a sample answer
An idempotent pipeline ensures that re-running from the same input yields identical output, preventing duplicates or inconsistent aggregations. Key design elements: First, each run should have a unique run ID, and the pipeline should store a table of processed runs (e.g., `pipeline_runs` with run_id and watermark). The pipeline checks this table before processing a batch to avoid reprocessing fully completed intervals. For incremental loads, use append-only logs with deduplication at insert time via MERGE (UPSERT) statements based on a record’s natural key. For streaming, use exactly-once semantics from the source (e.g., Kafka with transactional producers) and checkpoint offsets in a sink (e.g., a database or object store). Additionally, partition or window your data by time (e.g., processing_date), and re-run only failed partitions. For batch jobs, compare source data fingerprints (e.g., checksums) to skip unchanged files. A common pitfall is assuming idempotency with only a uniqueness constraint; handle retries during partial runs by making transforms deterministic. For example, a pipeline that computes daily sales aggregates should store results keyed by (date, store_id) and use INSERT … ON CONFLICT UPDATE to recalc safely.
TechnicalHow do you handle late-arriving data in a streaming pipeline?
What a strong answer covers
- Define handling strategy: discard, reprocess, maintain state
- Use watermarks and allowed lateness thresholds
- Use side inputs for late data merges
- Store raw events in a persistent log for replay
- Handle late data via upserts or incremental recompute
View a sample answer
Late-arriving data in streaming pipelines requires a clear strategy based on business requirements. The typical approach: set a watermark (e.g., max event time observed minus allowed lateness) and discard data older than that. But to avoid data loss, store raw events in a persistent log (Kafka topic with retention) so you can reprocess later if needed. For aggregations, use a windowed approach with allowed lateness (e.g., in Apache Flink, set `.allowedLateness(Time.hours(1))`) and trigger early firings with updates. For joins, use state with a required key and store late events in a side output, then merge them into the main result via a separate batch job or a lookup table. Alternatively, use a lambda architecture: streaming provides low-latency approximations, and batch recomputes exact results including late data. A common pitfall is assuming streaming systems handle late data seamlessly; always plan for out-of-order events by using event-time processing and checkpointing. For example, in a real-time dashboard, accept updates to metrics for up to 1 hour, then finalize the hour's aggregates.
CodingUse window functions to compute a 7-day rolling average.
What a strong answer covers
- Window function with ROWS or RANGE frame
- Partition by entity if needed (e.g., user, product)
- Use AVG() with ORDER BY date and ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
- Handle missing dates (gaps) by using RANGE or filling dates
- Time complexity: O(n log n) due to sorting
View a sample answer
To compute a 7-day rolling average, use a window function like AVG() with a frame clause. For each row, the average is computed over the current row and the preceding 6 rows (by date). Ensure data is sorted by date; if there are multiple entities (e.g., multiple stores), partition by entity. A common query is: SELECT date, sales, AVG(sales) OVER (PARTITION BY store_id ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_avg_7d. The ROWS clause assumes one row per day per store; if there are gaps (missing days), ROWS counts preceding rows, not days. If you need a strictly 7-day calendar window, use RANGE INTERVAL '7' DAY PRECEDING if supported (e.g., in PostgreSQL). Note that RANK or other window functions are not needed. The time complexity is O(n log n) due to the ORDER BY in the window clause. A common pitfall is forgetting to partition, causing the average to span across different groups.
Reference solutionsql
-- Assume table daily_sales(store_id, date, sales) -- For each store, compute 7-day rolling average (rows, assuming no gaps) SELECT store_id, date, sales, AVG(sales) OVER ( PARTITION BY store_id ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ) AS rolling_avg_7d FROM daily_sales ORDER BY store_id, date;
System DesignDesign a data warehouse for an e-commerce company's analytics.
What a strong answer covers
- Requirements: customer analytics, sales, inventory, marketing
- Star schema: fact tables (sales, inventory, clicks) and dimension tables (customer, product, time, store)
- ETL/ELT strategy: from OLTP and logs; incremental loads
- Scaling: partitioning by date, using columnar storage, aggregations
- Consider slowly changing dimensions (SCD) for customer and product attributes
View a sample answer
For an e-commerce company, the data warehouse should support analytics on sales, customer behavior, inventory, and marketing. I would design a star schema: Fact tables include Sales (foreign keys to date, customer, product, store, promotion; measures: quantity, price, discount), Inventory (date, product, store; quantity_on_hand), and Clickstream (session_id, date, customer, product, action). Dimension tables: Customer (SCD Type 2 for changing attributes like address), Product (SCD Type 2 for category, price changes), Date (pre-populated with calendar attributes), Store (location, type), Promotion (type, discount). The ETL pipeline would extract from transactional databases via CDC (Debezium) and from web logs via Kafka, land raw data in a staging area, then load into dimensional models. Use ELT for heavy transformations using the warehouse's SQL engine (e.g., Redshift, BigQuery). Partition fact tables by date (e.g., monthly); use columnar storage and compression. Create aggregate tables (e.g., daily sales by product) for common queries. A common pitfall is over-normalizing dimensions that don't change; prefer degenerate dimensions for low-cardinality attributes.
BehavioralTell me about a data-quality issue you caught and fixed.
What a strong answer covers
- Situation: production pipeline outputting incorrect aggregates
- Task: identify root cause, fix, prevent recurrence
- Action: used automated data profiling, added validation checks, wrote reconciliation script
- Result: error caught intra-day, zero business impact, implemented monitoring
- Soft skills: cross-team communication, documentation
View a sample answer
In a previous role, I noticed that a daily sales aggregation pipeline was showing a $1M discrepancy compared to the source ERP system. The issue was a changed join key in a dimension table that caused some records to be silently dropped. I first ran a row-count reconciliation to confirm the gap. Then I profiled the data and found that the product dimension's key had been updated from a legacy integer to a UUID, but the source feed was still sending the old ID plus a new column. I worked with the upstream data engineering team to add a crosswalk table and modified the ETL to use the correct join. I also added automated checks: row count validation, sum-of-sales comparison, and a daily email report to stakeholders. As a result, the error was caught within the first hour of the run with zero business impact. The key lesson was to always monitor data quality with automated checks, not manual spot checks. A common pitfall is assuming source data consistency; always profile and validate upstream changes.
BehavioralHow do you prioritize pipeline reliability vs. new requests?
What a strong answer covers
- Establish SLA/SLOs for reliability (e.g., 99.9% uptime)
- Quantify impact of pipeline failures and maintenance costs
- Use incident severity: P0 (data loss) overrides new features
- Invest in monitoring, alerting, and automated remediation
- Balance: allocate 70% time to reliability, 30% to new features; adjust
View a sample answer
Prioritizing pipeline reliability over new requests requires a systematic approach. First, define clear Service Level Objectives (SLOs) for latency, completeness, and uptime (e.g., 99.9% of daily batches complete within 4 hours). Any new feature must not degrade these SLOs. I use a severity framework: P0 incidents (data loss or major outage) block all feature work; P1 (partial delay) triggers a 24-hour fix window. For new requests, I evaluate the cost of reliability debt: if the request would increase complexity without proper monitoring, I propose a reliability-first design. I allocate roughly 70% of engineering time to maintaining pipelines (monitoring, alerts, failover, disaster recovery) and 30% to new features. For example, I pushed back on a request to add a new API feed because the team had no test harness for it; instead, we first built a schema validation layer. I also communicate tradeoffs to business stakeholders using data: "Adding this feed without reliability improvements increases probability of missed SLAs by 15%." A common pitfall is treating reliability as a one-time project; it must be ongoing, with blameless postmortems driving improvements.

What interviewers assess

SQL

Window functions, joins, query optimization and execution plans.

Data modeling

Star/snowflake schemas, partitioning and slowly changing dimensions.

Pipelines

ETL/ELT, orchestration, idempotency and backfills.

Batch & streaming

Spark, Kafka and trade-offs between latency and throughput.

Data quality

Validation, lineage and handling late or duplicate data.

How to prepare

Sharpen SQL — window functions and optimization come up in almost every loop.
For pipeline design, lead with idempotency, retries and backfills.
Mention data quality and lineage proactively; it signals production maturity.

Frequently asked questions

How much SQL do data engineer interviews require?

A lot — expect multiple SQL rounds covering joins, window functions and optimization, often on realistic schemas.

What system design appears in data engineering interviews?

Designing a data warehouse, an event pipeline, or a batch/streaming architecture with attention to partitioning, idempotency and data quality.

How do I prepare for a data engineering interview?

Drill SQL daily, practice pipeline design out loud, and use mock interviews to rehearse trade-off explanations.

Practice Data Engineer questions with instant AI feedback

Offersly runs a mock interview tailored to your resume and target role, then scores every answer on relevance, depth, clarity and correctness.

Start free All Data Engineer interview questions

Sample Data Engineer interview questions

CodingWrite a SQL query to find the second-highest salary per department.

What a strong answer covers

Use DENSE_RANK() or ROW_NUMBER() window function
Partition by department, order by salary descending
Handle ties correctly (DENSE_RANK) vs unique ranking (ROW_NUMBER)
Filter where rank = 2
Consider performance with large tables (indexes)

View a sample answer

To find the second-highest salary per department, use a window function like DENSE_RANK() partitioned by department and ordered by salary descending. DENSE_RANK ensures that if multiple employees have the same highest salary, they all get rank 1, and the next distinct salary gets rank 2. Then filter where rank = 2. Alternatively, ROW_NUMBER() assigns unique ranks, which may skip rank 2 if there are ties for rank 1. The query is: WITH ranked AS (SELECT *, DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rnk FROM employees) SELECT department, salary FROM ranked WHERE rnk = 2. This returns all employees with the second-highest salary per department. If no second-highest exists (e.g., department has only one salary), the query returns no rows. For large tables, ensure an index on (department, salary) for efficiency. A common pitfall is using RANK() or ROW_NUMBER() without understanding tie-handling requirements.

Reference solutionsql

WITH ranked AS (
  SELECT *,
         DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rnk
  FROM employees
)
SELECT department, salary
FROM ranked
WHERE rnk = 2;

TechnicalExplain the difference between ETL and ELT and when you would use each.

What a strong answer covers

ETL: Extract, Transform, Load (transform before loading)
ELT: Extract, Load, Transform (transform in target system)
ETL use case: complex transformations, regulatory compliance, lower target system power
ELT use case: big data, cloud warehouses (Snowflake, BigQuery), schema-on-read
Tradeoffs: ETL more upfront work, ELT requires powerful target compute

View a sample answer

ETL (Extract, Transform, Load) processes data by extracting from sources, transforming it in a staging area (e.g., using Spark or a dedicated ETL tool), and then loading it into the target system. ELT (Extract, Load, Transform) first extracts and loads raw data into the target system (e.g., a cloud data warehouse), then performs transformations within that system using SQL or its compute. Choose ETL when you need to apply complex business logic before loading, when the target system has limited compute resources, or when compliance requires masking data early. Choose ELT when dealing with large volumes of semi-structured data (like JSON logs), when using cost-effective cloud warehouses with elastic compute (Snowflake, BigQuery, Redshift), or when you want to keep raw data for reprocessing. A common pitfall is assuming ELT is always better; in reality, ETL can reduce load on the target system and simplify monitoring. For example, a financial report requiring strict aggregation and validation is better suited to ETL, while a clickstream analysis exploring unstructured data benefits from ELT.

TechnicalHow would you design an idempotent pipeline that can safely re-run?

What a strong answer covers

Idempotency: re-running produces same result
Use unique identifiers and upsert (merge) logic
Watermark or partition cuts to avoid duplicates
Store deduplicated state (e.g., job run IDs table)
Handle partial failures with atomic commits or checkpointing

View a sample answer

An idempotent pipeline ensures that re-running from the same input yields identical output, preventing duplicates or inconsistent aggregations. Key design elements: First, each run should have a unique run ID, and the pipeline should store a table of processed runs (e.g., `pipeline_runs` with run_id and watermark). The pipeline checks this table before processing a batch to avoid reprocessing fully completed intervals. For incremental loads, use append-only logs with deduplication at insert time via MERGE (UPSERT) statements based on a record’s natural key. For streaming, use exactly-once semantics from the source (e.g., Kafka with transactional producers) and checkpoint offsets in a sink (e.g., a database or object store). Additionally, partition or window your data by time (e.g., processing_date), and re-run only failed partitions. For batch jobs, compare source data fingerprints (e.g., checksums) to skip unchanged files. A common pitfall is assuming idempotency with only a uniqueness constraint; handle retries during partial runs by making transforms deterministic. For example, a pipeline that computes daily sales aggregates should store results keyed by (date, store_id) and use INSERT … ON CONFLICT UPDATE to recalc safely.

TechnicalHow do you handle late-arriving data in a streaming pipeline?

What a strong answer covers

Define handling strategy: discard, reprocess, maintain state
Use watermarks and allowed lateness thresholds
Use side inputs for late data merges
Store raw events in a persistent log for replay
Handle late data via upserts or incremental recompute

View a sample answer

Late-arriving data in streaming pipelines requires a clear strategy based on business requirements. The typical approach: set a watermark (e.g., max event time observed minus allowed lateness) and discard data older than that. But to avoid data loss, store raw events in a persistent log (Kafka topic with retention) so you can reprocess later if needed. For aggregations, use a windowed approach with allowed lateness (e.g., in Apache Flink, set `.allowedLateness(Time.hours(1))`) and trigger early firings with updates. For joins, use state with a required key and store late events in a side output, then merge them into the main result via a separate batch job or a lookup table. Alternatively, use a lambda architecture: streaming provides low-latency approximations, and batch recomputes exact results including late data. A common pitfall is assuming streaming systems handle late data seamlessly; always plan for out-of-order events by using event-time processing and checkpointing. For example, in a real-time dashboard, accept updates to metrics for up to 1 hour, then finalize the hour's aggregates.

CodingUse window functions to compute a 7-day rolling average.

What a strong answer covers

Window function with ROWS or RANGE frame
Partition by entity if needed (e.g., user, product)
Use AVG() with ORDER BY date and ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
Handle missing dates (gaps) by using RANGE or filling dates
Time complexity: O(n log n) due to sorting

View a sample answer

To compute a 7-day rolling average, use a window function like AVG() with a frame clause. For each row, the average is computed over the current row and the preceding 6 rows (by date). Ensure data is sorted by date; if there are multiple entities (e.g., multiple stores), partition by entity. A common query is: SELECT date, sales, AVG(sales) OVER (PARTITION BY store_id ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_avg_7d. The ROWS clause assumes one row per day per store; if there are gaps (missing days), ROWS counts preceding rows, not days. If you need a strictly 7-day calendar window, use RANGE INTERVAL '7' DAY PRECEDING if supported (e.g., in PostgreSQL). Note that RANK or other window functions are not needed. The time complexity is O(n log n) due to the ORDER BY in the window clause. A common pitfall is forgetting to partition, causing the average to span across different groups.

Reference solutionsql

-- Assume table daily_sales(store_id, date, sales)
-- For each store, compute 7-day rolling average (rows, assuming no gaps)
SELECT 
  store_id,
  date,
  sales,
  AVG(sales) OVER (
    PARTITION BY store_id
    ORDER BY date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS rolling_avg_7d
FROM daily_sales
ORDER BY store_id, date;

System DesignDesign a data warehouse for an e-commerce company's analytics.

What a strong answer covers

Requirements: customer analytics, sales, inventory, marketing
Star schema: fact tables (sales, inventory, clicks) and dimension tables (customer, product, time, store)
ETL/ELT strategy: from OLTP and logs; incremental loads
Scaling: partitioning by date, using columnar storage, aggregations
Consider slowly changing dimensions (SCD) for customer and product attributes

View a sample answer

For an e-commerce company, the data warehouse should support analytics on sales, customer behavior, inventory, and marketing. I would design a star schema: Fact tables include Sales (foreign keys to date, customer, product, store, promotion; measures: quantity, price, discount), Inventory (date, product, store; quantity_on_hand), and Clickstream (session_id, date, customer, product, action). Dimension tables: Customer (SCD Type 2 for changing attributes like address), Product (SCD Type 2 for category, price changes), Date (pre-populated with calendar attributes), Store (location, type), Promotion (type, discount). The ETL pipeline would extract from transactional databases via CDC (Debezium) and from web logs via Kafka, land raw data in a staging area, then load into dimensional models. Use ELT for heavy transformations using the warehouse's SQL engine (e.g., Redshift, BigQuery). Partition fact tables by date (e.g., monthly); use columnar storage and compression. Create aggregate tables (e.g., daily sales by product) for common queries. A common pitfall is over-normalizing dimensions that don't change; prefer degenerate dimensions for low-cardinality attributes.

BehavioralTell me about a data-quality issue you caught and fixed.

What a strong answer covers

Situation: production pipeline outputting incorrect aggregates
Task: identify root cause, fix, prevent recurrence
Action: used automated data profiling, added validation checks, wrote reconciliation script
Result: error caught intra-day, zero business impact, implemented monitoring
Soft skills: cross-team communication, documentation

View a sample answer

In a previous role, I noticed that a daily sales aggregation pipeline was showing a $1M discrepancy compared to the source ERP system. The issue was a changed join key in a dimension table that caused some records to be silently dropped. I first ran a row-count reconciliation to confirm the gap. Then I profiled the data and found that the product dimension's key had been updated from a legacy integer to a UUID, but the source feed was still sending the old ID plus a new column. I worked with the upstream data engineering team to add a crosswalk table and modified the ETL to use the correct join. I also added automated checks: row count validation, sum-of-sales comparison, and a daily email report to stakeholders. As a result, the error was caught within the first hour of the run with zero business impact. The key lesson was to always monitor data quality with automated checks, not manual spot checks. A common pitfall is assuming source data consistency; always profile and validate upstream changes.

BehavioralHow do you prioritize pipeline reliability vs. new requests?

What a strong answer covers

Establish SLA/SLOs for reliability (e.g., 99.9% uptime)
Quantify impact of pipeline failures and maintenance costs
Use incident severity: P0 (data loss) overrides new features
Invest in monitoring, alerting, and automated remediation
Balance: allocate 70% time to reliability, 30% to new features; adjust

View a sample answer

Prioritizing pipeline reliability over new requests requires a systematic approach. First, define clear Service Level Objectives (SLOs) for latency, completeness, and uptime (e.g., 99.9% of daily batches complete within 4 hours). Any new feature must not degrade these SLOs. I use a severity framework: P0 incidents (data loss or major outage) block all feature work; P1 (partial delay) triggers a 24-hour fix window. For new requests, I evaluate the cost of reliability debt: if the request would increase complexity without proper monitoring, I propose a reliability-first design. I allocate roughly 70% of engineering time to maintaining pipelines (monitoring, alerts, failover, disaster recovery) and 30% to new features. For example, I pushed back on a request to add a new API feed because the team had no test harness for it; instead, we first built a schema validation layer. I also communicate tradeoffs to business stakeholders using data: "Adding this feed without reliability improvements increases probability of missed SLAs by 15%." A common pitfall is treating reliability as a one-time project; it must be ongoing, with blameless postmortems driving improvements.

What interviewers assess

SQL

Window functions, joins, query optimization and execution plans.

Data modeling

Star/snowflake schemas, partitioning and slowly changing dimensions.

Pipelines

ETL/ELT, orchestration, idempotency and backfills.

Batch & streaming

Spark, Kafka and trade-offs between latency and throughput.

Data quality

Validation, lineage and handling late or duplicate data.

Frequently asked questions

How much SQL do data engineer interviews require?

A lot — expect multiple SQL rounds covering joins, window functions and optimization, often on realistic schemas.

What system design appears in data engineering interviews?

Designing a data warehouse, an event pipeline, or a batch/streaming architecture with attention to partitioning, idempotency and data quality.

How do I prepare for a data engineering interview?

Drill SQL daily, practice pipeline design out loud, and use mock interviews to rehearse trade-off explanations.