| name | career-growth |
| description | Portfolio building, technical interviews, job search strategies, and continuous learning |
| sasmp_version | 1.3.0 |
| bonded_agent | 01-data-engineer |
| bond_type | SUPPORT_BOND |
| skill_version | 2.0.0 |
| last_updated | 2025-01 |
| complexity | foundational |
| estimated_mastery_hours | 40 |
| prerequisites | |
| unlocks |
Career Growth
Professional development strategies for data engineering career advancement.
Quick Start
# Data Engineer Portfolio Checklist
## Required Projects (Pick 3-5)
- [ ] End-to-end ETL pipeline (Airflow + dbt)
- [ ] Real-time streaming project (Kafka/Spark Streaming)
- [ ] Data warehouse design (Snowflake/BigQuery)
- [ ] ML pipeline with MLOps (MLflow)
- [ ] API for data access (FastAPI)
## Documentation Template
Each project should include:
1. Problem statement
2. Architecture diagram
3. Tech stack justification
4. Challenges & solutions
5. Results/metrics
6. GitHub link with clean code
Core Concepts
1. Technical Interview Preparation
# Common coding patterns for data engineering interviews
# 1. SQL Window Functions
"""
Write a query to find the running total of sales by month,
and the percentage change from the previous month.
"""
sql = """
SELECT
month,
sales,
SUM(sales) OVER (ORDER BY month) AS running_total,
100.0 * (sales - LAG(sales) OVER (ORDER BY month))
/ NULLIF(LAG(sales) OVER (ORDER BY month), 0) AS pct_change
FROM monthly_sales
ORDER BY month;
"""
# 2. Data Processing - Find duplicates
def find_duplicates(data: list[dict], key: str) -> list[dict]:
"""Find duplicate records based on a key."""
seen = {}
duplicates = []
for record in data:
k = record[key]
if k in seen:
duplicates.append(record)
else:
seen[k] = record
return duplicates
# 3. Implement rate limiter
from collections import defaultdict
import time
class RateLimiter:
def __init__(self, max_requests: int, window_seconds: int):
self.max_requests = max_requests
self.window = window_seconds
self.requests = defaultdict(list)
def is_allowed(self, user_id: str) -> bool:
now = time.time()
# Remove old requests
self.requests[user_id] = [
t for t in self.requests[user_id]
if now - t < self.window
]
if len(self.requests[user_id]) < self.max_requests:
self.requests[user_id].append(now)
return True
return False
# 4. Design question: Data pipeline for e-commerce
"""
Requirements:
- Process 1M orders/day
- Real-time dashboard updates
- Historical analytics
Architecture:
1. Ingestion: Kafka for real-time events
2. Processing: Spark Streaming for aggregations
3. Storage: Delta Lake for ACID, Snowflake for analytics
4. Serving: Redis for real-time metrics, API for dashboards
"""
2. Resume Optimization
## Data Engineer Resume Template
### Summary
Data Engineer with X years of experience building scalable data pipelines
processing Y TB/day. Expert in [Spark/Airflow/dbt]. Reduced pipeline
latency by Z% at [Company].
### Experience Format (STAR Method)
**Senior Data Engineer** | Company | 2022-Present
- **Situation**: Legacy ETL system processing 500GB daily with 4-hour latency
- **Task**: Redesign for real-time analytics
- **Action**: Built Spark Streaming pipeline with Delta Lake, implemented
incremental processing
- **Result**: Reduced latency to 5 minutes, cut infrastructure costs by 40%
### Skills Section
**Languages**: Python, SQL, Scala
**Frameworks**: Spark, Airflow, dbt, Kafka
**Databases**: PostgreSQL, Snowflake, MongoDB, Redis
**Cloud**: AWS (Glue, EMR, S3), GCP (BigQuery, Dataflow)
**Tools**: Docker, Kubernetes, Terraform, Git
### Quantify Everything
- "Built data pipeline" → "Built pipeline processing 2TB/day with 99.9% uptime"
- "Improved performance" → "Reduced query time from 30min to 30sec (60x improvement)"
3. Interview Questions to Ask
## Questions for Data Engineering Interviews
### About the Team
- What does a typical data pipeline look like here?
- How do you handle data quality issues?
- What's the tech stack? Any planned migrations?
### About the Role
- What would success look like in 6 months?
- What's the biggest data challenge the team faces?
- How do data engineers collaborate with data scientists?
### About Engineering Practices
- How do you handle schema changes in production?
- What's your approach to testing data pipelines?
- How do you manage technical debt?
### Red Flags to Watch For
- "We don't have time for testing"
- "One person handles all the data infrastructure"
- "We're still on [very outdated technology]"
- Vague answers about on-call and incident response
4. Learning Path by Experience Level
## Career Progression
### Junior (0-2 years)
Focus Areas:
- SQL proficiency (complex queries, optimization)
- Python for data processing
- One cloud platform deeply (AWS/GCP)
- Git and basic CI/CD
- Understanding ETL patterns
### Mid-Level (2-5 years)
Focus Areas:
- Distributed systems (Spark)
- Data modeling (dimensional, Data Vault)
- Orchestration (Airflow)
- Infrastructure as Code
- Data quality frameworks
### Senior (5+ years)
Focus Areas:
- System design and architecture
- Cost optimization at scale
- Team leadership and mentoring
- Cross-functional collaboration
- Vendor evaluation and selection
### Staff/Principal (8+ years)
Focus Areas:
- Organization-wide data strategy
- Building data platforms
- Technical roadmap ownership
- Industry thought leadership
Resources
Learning Platforms
Interview Prep
Community
Books
- "Fundamentals of Data Engineering" - Reis & Housley
- "Designing Data-Intensive Applications" - Kleppmann
- "The Data Warehouse Toolkit" - Kimball
Best Practices
# ✅ DO:
- Build public projects on GitHub
- Write technical blog posts
- Contribute to open source
- Network at meetups/conferences
- Keep skills current (follow trends)
# ❌ DON'T:
- Apply without tailoring resume
- Neglect soft skills
- Stop learning after getting hired
- Ignore feedback from interviews
- Burn bridges when leaving jobs
Skill Certification Checklist:
- Have 3+ portfolio projects on GitHub
- Can explain system design decisions
- Can solve SQL problems efficiently
- Have updated LinkedIn and resume
- Active in data engineering community