Senior Data Engineer Interview Questions
What a Senior Data Engineer interview focuses on, the questions you'll face, and how to practice them with instant AI feedback.
What's expected at the Senior level
Expect platform/warehouse architecture, reliability and governance ownership.
Sample Data Engineer interview questions
- CodingWrite a SQL query to find the second-highest salary per department.What a strong answer covers
- Use DENSE_RANK() or ROW_NUMBER() window function
- Partition by department, order by salary descending
- Handle ties correctly (DENSE_RANK) vs unique ranking (ROW_NUMBER)
- Filter where rank = 2
- Consider performance with large tables (indexes)
View a sample answer
To find the second-highest salary per department, use a window function like DENSE_RANK() partitioned by department and ordered by salary descending. DENSE_RANK ensures that if multiple employees have the same highest salary, they all get rank 1, and the next distinct salary gets rank 2. Then filter where rank = 2. Alternatively, ROW_NUMBER() assigns unique ranks, which may skip rank 2 if there are ties for rank 1. The query is: WITH ranked AS (SELECT *, DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rnk FROM employees) SELECT department, salary FROM ranked WHERE rnk = 2. This returns all employees with the second-highest salary per department. If no second-highest exists (e.g., department has only one salary), the query returns no rows. For large tables, ensure an index on (department, salary) for efficiency. A common pitfall is using RANK() or ROW_NUMBER() without understanding tie-handling requirements.
Reference solutionsql WITH ranked AS ( SELECT *, DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rnk FROM employees ) SELECT department, salary FROM ranked WHERE rnk = 2; - TechnicalExplain the difference between ETL and ELT and when you would use each.What a strong answer covers
- ETL: Extract, Transform, Load (transform before loading)
- ELT: Extract, Load, Transform (transform in target system)
- ETL use case: complex transformations, regulatory compliance, lower target system power
- ELT use case: big data, cloud warehouses (Snowflake, BigQuery), schema-on-read
- Tradeoffs: ETL more upfront work, ELT requires powerful target compute
View a sample answer
ETL (Extract, Transform, Load) processes data by extracting from sources, transforming it in a staging area (e.g., using Spark or a dedicated ETL tool), and then loading it into the target system. ELT (Extract, Load, Transform) first extracts and loads raw data into the target system (e.g., a cloud data warehouse), then performs transformations within that system using SQL or its compute. Choose ETL when you need to apply complex business logic before loading, when the target system has limited compute resources, or when compliance requires masking data early. Choose ELT when dealing with large volumes of semi-structured data (like JSON logs), when using cost-effective cloud warehouses with elastic compute (Snowflake, BigQuery, Redshift), or when you want to keep raw data for reprocessing. A common pitfall is assuming ELT is always better; in reality, ETL can reduce load on the target system and simplify monitoring. For example, a financial report requiring strict aggregation and validation is better suited to ETL, while a clickstream analysis exploring unstructured data benefits from ELT.
- TechnicalHow would you design an idempotent pipeline that can safely re-run?What a strong answer covers
- Idempotency: re-running produces same result
- Use unique identifiers and upsert (merge) logic
- Watermark or partition cuts to avoid duplicates
- Store deduplicated state (e.g., job run IDs table)
- Handle partial failures with atomic commits or checkpointing
View a sample answer
An idempotent pipeline ensures that re-running from the same input yields identical output, preventing duplicates or inconsistent aggregations. Key design elements: First, each run should have a unique run ID, and the pipeline should store a table of processed runs (e.g., `pipeline_runs` with run_id and watermark). The pipeline checks this table before processing a batch to avoid reprocessing fully completed intervals. For incremental loads, use append-only logs with deduplication at insert time via MERGE (UPSERT) statements based on a record’s natural key. For streaming, use exactly-once semantics from the source (e.g., Kafka with transactional producers) and checkpoint offsets in a sink (e.g., a database or object store). Additionally, partition or window your data by time (e.g., processing_date), and re-run only failed partitions. For batch jobs, compare source data fingerprints (e.g., checksums) to skip unchanged files. A common pitfall is assuming idempotency with only a uniqueness constraint; handle retries during partial runs by making transforms deterministic. For example, a pipeline that computes daily sales aggregates should store results keyed by (date, store_id) and use INSERT … ON CONFLICT UPDATE to recalc safely.
- TechnicalHow do you handle late-arriving data in a streaming pipeline?What a strong answer covers
- Define handling strategy: discard, reprocess, maintain state
- Use watermarks and allowed lateness thresholds
- Use side inputs for late data merges
- Store raw events in a persistent log for replay
- Handle late data via upserts or incremental recompute
View a sample answer
Late-arriving data in streaming pipelines requires a clear strategy based on business requirements. The typical approach: set a watermark (e.g., max event time observed minus allowed lateness) and discard data older than that. But to avoid data loss, store raw events in a persistent log (Kafka topic with retention) so you can reprocess later if needed. For aggregations, use a windowed approach with allowed lateness (e.g., in Apache Flink, set `.allowedLateness(Time.hours(1))`) and trigger early firings with updates. For joins, use state with a required key and store late events in a side output, then merge them into the main result via a separate batch job or a lookup table. Alternatively, use a lambda architecture: streaming provides low-latency approximations, and batch recomputes exact results including late data. A common pitfall is assuming streaming systems handle late data seamlessly; always plan for out-of-order events by using event-time processing and checkpointing. For example, in a real-time dashboard, accept updates to metrics for up to 1 hour, then finalize the hour's aggregates.
- CodingUse window functions to compute a 7-day rolling average.What a strong answer covers
- Window function with ROWS or RANGE frame
- Partition by entity if needed (e.g., user, product)
- Use AVG() with ORDER BY date and ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
- Handle missing dates (gaps) by using RANGE or filling dates
- Time complexity: O(n log n) due to sorting
View a sample answer
To compute a 7-day rolling average, use a window function like AVG() with a frame clause. For each row, the average is computed over the current row and the preceding 6 rows (by date). Ensure data is sorted by date; if there are multiple entities (e.g., multiple stores), partition by entity. A common query is: SELECT date, sales, AVG(sales) OVER (PARTITION BY store_id ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_avg_7d. The ROWS clause assumes one row per day per store; if there are gaps (missing days), ROWS counts preceding rows, not days. If you need a strictly 7-day calendar window, use RANGE INTERVAL '7' DAY PRECEDING if supported (e.g., in PostgreSQL). Note that RANK or other window functions are not needed. The time complexity is O(n log n) due to the ORDER BY in the window clause. A common pitfall is forgetting to partition, causing the average to span across different groups.
Reference solutionsql -- Assume table daily_sales(store_id, date, sales) -- For each store, compute 7-day rolling average (rows, assuming no gaps) SELECT store_id, date, sales, AVG(sales) OVER ( PARTITION BY store_id ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ) AS rolling_avg_7d FROM daily_sales ORDER BY store_id, date; - System DesignDesign a data warehouse for an e-commerce company's analytics.What a strong answer covers
- Requirements: customer analytics, sales, inventory, marketing
- Star schema: fact tables (sales, inventory, clicks) and dimension tables (customer, product, time, store)
- ETL/ELT strategy: from OLTP and logs; incremental loads
- Scaling: partitioning by date, using columnar storage, aggregations
- Consider slowly changing dimensions (SCD) for customer and product attributes
View a sample answer
For an e-commerce company, the data warehouse should support analytics on sales, customer behavior, inventory, and marketing. I would design a star schema: Fact tables include Sales (foreign keys to date, customer, product, store, promotion; measures: quantity, price, discount), Inventory (date, product, store; quantity_on_hand), and Clickstream (session_id, date, customer, product, action). Dimension tables: Customer (SCD Type 2 for changing attributes like address), Product (SCD Type 2 for category, price changes), Date (pre-populated with calendar attributes), Store (location, type), Promotion (type, discount). The ETL pipeline would extract from transactional databases via CDC (Debezium) and from web logs via Kafka, land raw data in a staging area, then load into dimensional models. Use ELT for heavy transformations using the warehouse's SQL engine (e.g., Redshift, BigQuery). Partition fact tables by date (e.g., monthly); use columnar storage and compression. Create aggregate tables (e.g., daily sales by product) for common queries. A common pitfall is over-normalizing dimensions that don't change; prefer degenerate dimensions for low-cardinality attributes.
- BehavioralTell me about a data-quality issue you caught and fixed.What a strong answer covers
- Situation: production pipeline outputting incorrect aggregates
- Task: identify root cause, fix, prevent recurrence
- Action: used automated data profiling, added validation checks, wrote reconciliation script
- Result: error caught intra-day, zero business impact, implemented monitoring
- Soft skills: cross-team communication, documentation
View a sample answer
In a previous role, I noticed that a daily sales aggregation pipeline was showing a $1M discrepancy compared to the source ERP system. The issue was a changed join key in a dimension table that caused some records to be silently dropped. I first ran a row-count reconciliation to confirm the gap. Then I profiled the data and found that the product dimension's key had been updated from a legacy integer to a UUID, but the source feed was still sending the old ID plus a new column. I worked with the upstream data engineering team to add a crosswalk table and modified the ETL to use the correct join. I also added automated checks: row count validation, sum-of-sales comparison, and a daily email report to stakeholders. As a result, the error was caught within the first hour of the run with zero business impact. The key lesson was to always monitor data quality with automated checks, not manual spot checks. A common pitfall is assuming source data consistency; always profile and validate upstream changes.
- BehavioralHow do you prioritize pipeline reliability vs. new requests?What a strong answer covers
- Establish SLA/SLOs for reliability (e.g., 99.9% uptime)
- Quantify impact of pipeline failures and maintenance costs
- Use incident severity: P0 (data loss) overrides new features
- Invest in monitoring, alerting, and automated remediation
- Balance: allocate 70% time to reliability, 30% to new features; adjust
View a sample answer
Prioritizing pipeline reliability over new requests requires a systematic approach. First, define clear Service Level Objectives (SLOs) for latency, completeness, and uptime (e.g., 99.9% of daily batches complete within 4 hours). Any new feature must not degrade these SLOs. I use a severity framework: P0 incidents (data loss or major outage) block all feature work; P1 (partial delay) triggers a 24-hour fix window. For new requests, I evaluate the cost of reliability debt: if the request would increase complexity without proper monitoring, I propose a reliability-first design. I allocate roughly 70% of engineering time to maintaining pipelines (monitoring, alerts, failover, disaster recovery) and 30% to new features. For example, I pushed back on a request to add a new API feed because the team had no test harness for it; instead, we first built a schema validation layer. I also communicate tradeoffs to business stakeholders using data: "Adding this feed without reliability improvements increases probability of missed SLAs by 15%." A common pitfall is treating reliability as a one-time project; it must be ongoing, with blameless postmortems driving improvements.
What interviewers assess
SQL
Window functions, joins, query optimization and execution plans.
Data modeling
Star/snowflake schemas, partitioning and slowly changing dimensions.
Pipelines
ETL/ELT, orchestration, idempotency and backfills.
Batch & streaming
Spark, Kafka and trade-offs between latency and throughput.
Data quality
Validation, lineage and handling late or duplicate data.
How to prepare
- Sharpen SQL — window functions and optimization come up in almost every loop.
- For pipeline design, lead with idempotency, retries and backfills.
- Mention data quality and lineage proactively; it signals production maturity.
Frequently asked questions
How much SQL do data engineer interviews require?
A lot — expect multiple SQL rounds covering joins, window functions and optimization, often on realistic schemas.
What system design appears in data engineering interviews?
Designing a data warehouse, an event pipeline, or a batch/streaming architecture with attention to partitioning, idempotency and data quality.
How do I prepare for a data engineering interview?
Drill SQL daily, practice pipeline design out loud, and use mock interviews to rehearse trade-off explanations.
Practice Data Engineer questions with instant AI feedback
Offersly runs a mock interview tailored to your resume and target role, then scores every answer on relevance, depth, clarity and correctness.