Data Engineering Skill
Quick Reference
| Role |
Focus |
Timeline |
Entry From |
| Data Engineer |
Pipelines, Infra |
12-24 mo |
Backend Dev |
| ML Engineer |
Models, Features |
12-24 mo |
Data Scientist |
| AI Engineer |
LLMs, Agents |
6-12 mo |
Any Developer |
Learning Paths
Data Engineer
[1] SQL Mastery (4-6 wk)
│ └─ Window functions, CTEs, optimization
│
▼
[2] Python for Data (4-6 wk)
│ └─ Pandas, file formats, scripting
│
▼
[3] ETL/ELT Pipelines (6-8 wk)
│ └─ Extract, transform, load patterns
│
▼
[4] Big Data: Spark (8-12 wk)
│ └─ PySpark, DataFrames, partitioning
│
▼
[5] Data Warehouse (4-6 wk)
│ └─ Star schema, dbt, Snowflake/BQ
│
▼
[6] Orchestration (4-6 wk)
└─ Airflow/Prefect, scheduling, monitoring
2025 Stack: Python + Spark + Airflow + dbt + Snowflake/BigQuery
ML Engineer
[1] Python + NumPy (4-6 wk)
│
▼
[2] Math Foundations (6-8 wk)
│ └─ Linear algebra, calculus, statistics
│
▼
[3] Classical ML (8-12 wk)
│ └─ scikit-learn, XGBoost, evaluation
│
▼
[4] Deep Learning (8-12 wk)
│ └─ PyTorch, CNNs, Transformers
│
▼
[5] MLOps (6-8 wk)
└─ MLflow, model serving, monitoring
2025 Stack: Python + PyTorch + scikit-learn + MLflow + W&B
AI Engineer (2025 Hot Path)
[1] LLM Fundamentals (2-3 wk)
│ └─ Tokens, embeddings, context windows
│
▼
[2] Prompt Engineering (2-3 wk)
│ └─ Few-shot, CoT, structured output
│
▼
[3] RAG Systems (3-4 wk)
│ └─ Embeddings, vector DBs, retrieval
│
▼
[4] AI Agents (4-6 wk)
│ └─ Tool calling, agent loops, memory
│
▼
[5] Production Deploy (ongoing)
└─ Evaluation, guardrails, monitoring
2025 Stack: Python + LangChain/LlamaIndex + OpenAI/Anthropic + ChromaDB
2025 Tool Matrix
Data Processing
| Tool |
Scale |
Use Case |
| Pandas |
<10GB |
Prototyping, small data |
| Polars |
<100GB |
Fast local processing |
| Spark |
>100GB |
Distributed processing |
| dbt |
Any |
Transformations, testing |
ML Frameworks
| Framework |
Best For |
Complexity |
| scikit-learn |
Classical ML |
Low |
| XGBoost |
Tabular data |
Low |
| PyTorch |
Research, flexibility |
Medium |
| TensorFlow |
Production, mobile |
Medium |
LLM/AI Tools
| Tool |
Use Case |
| LangChain |
LLM orchestration |
| LlamaIndex |
RAG systems |
| Claude/OpenAI |
LLM APIs |
| ChromaDB |
Vector storage |
Algorithm Reference
Classical ML
| Type |
Algorithms |
| Regression |
Linear, Ridge, Lasso, ElasticNet |
| Classification |
Logistic, SVM, Decision Tree |
| Ensemble |
Random Forest, XGBoost, LightGBM |
| Clustering |
K-Means, DBSCAN, Hierarchical |
Deep Learning
| Architecture |
Use Case |
| CNN |
Images, vision |
| RNN/LSTM |
Sequences |
| Transformer |
NLP, LLMs |
| Diffusion |
Image generation |
AI Agent Architecture (2025)
┌─────────────────────────────────────────┐
│ AGENTIC LOOP │
├─────────────────────────────────────────┤
│ PERCEIVE → REASON → ACT → REFLECT │
│ │ │ │ │ │
│ │ │ │ └─► Loop │
│ │ │ └─► Execute tools│
│ │ └─► LLM decides action │
│ └─► Gather context, observations │
└─────────────────────────────────────────┘
Design Patterns (Anthropic 2025):
• Prompt Chaining - Sequential fixed steps
• Routing - Classify and dispatch
• Parallelization - Concurrent subtasks
• Orchestrator-Workers - Central delegation
• Evaluator-Optimizer - Generate + critique
Troubleshooting
Which path to choose?
├─► Love building infrastructure? → Data Engineer
├─► Love algorithms/math? → ML Engineer
├─► Want fastest AI entry? → AI Engineer
└─► Uncertain? → Start with Python + SQL
Model not performing well?
├─► Data quality issues? → Clean data first
├─► Feature engineering? → Create better features
├─► Wrong algorithm? → Try different models
├─► Overfitting? → More data, regularization
└─► Hyperparameters? → Grid/random search
LLM giving bad answers?
├─► Prompt too vague? → Be more specific
├─► Missing context? → Add relevant info
├─► Hallucinating? → Use RAG, verify facts
└─► Wrong tool? → Improve tool descriptions
Common Failure Modes
| Symptom |
Root Cause |
Recovery |
| Model fails in prod |
Data drift |
Monitor distributions |
| Pipeline always late |
Unoptimized queries |
Profile, partition |
| RAG finds wrong docs |
Bad chunking |
Tune chunk size, overlap |
| Agent loops forever |
No exit condition |
Add max iterations |
Portfolio Projects
Data Engineering
- ETL Pipeline (Airflow + dbt)
- Real-time Streaming (Kafka + Spark)
- Data Warehouse Design
ML Engineering
- Classification Model (scikit-learn)
- Deep Learning Model (PyTorch)
- ML Pipeline (MLflow)
AI Engineering
- RAG Chatbot (LangChain + ChromaDB)
- AI Agent with Tools
- Multi-Agent System
Next Actions
Specify your target role for a detailed learning plan.