Stream Processing
Patterns and technologies for real-time data processing, event streaming, and stream analytics.
When to Use This Skill
- Designing real-time data pipelines
- Choosing stream processing frameworks
- Implementing event-driven architectures
- Building real-time analytics
- Understanding streaming vs batch trade-offs
Batch vs Streaming
Comparison
| Aspect |
Batch |
Streaming |
| Latency |
Minutes to hours |
Milliseconds to seconds |
| Data |
Bounded (finite) |
Unbounded (infinite) |
| Processing |
Process all at once |
Process as it arrives |
| State |
Recompute each run |
Maintain continuously |
| Complexity |
Lower |
Higher |
| Cost |
Often lower |
Often higher |
When to Use Streaming
Use streaming when:
- Real-time responses required (<1 minute)
- Events need immediate action (fraud, alerts)
- Data arrives continuously
- Users expect live updates
- Time-sensitive business decisions
Use batch when:
- Daily/hourly reports sufficient
- Complex transformations needed
- Cost optimization priority
- Historical analysis
- One-time processing
Stream Processing Concepts
Event Time vs Processing Time
Event Time: When event actually occurred
Processing Time: When event is processed
Example:
┌─────────────────────────────────────────────────────────┐
│ Event: Purchase at 10:00:00 (event time) │
│ Network delay: 5 seconds │
│ Processing: 10:00:05 (processing time) │
└─────────────────────────────────────────────────────────┘
Why it matters:
- Late events need handling
- Ordering not guaranteed
- Watermarks track progress
Watermarks
Watermark = "All events before this time have arrived"
Event stream:
──[10:01]──[10:02]──[10:00]──[10:03]──[Watermark: 10:00]──
Allows system to:
- Know when window is complete
- Handle late events
- Balance latency vs completeness
Windows
Tumbling Window (fixed, non-overlapping):
|─────|─────|─────|
0 5 10 15 (seconds)
Sliding Window (fixed, overlapping):
|─────|
|─────|
|─────|
Size: 5s, Slide: 2s
Session Window (activity-based):
|──────| |───────────| |───|
User activity with gaps defines windows
Count Window:
Process every N events
State Management
Stateful operations require maintained state:
- Aggregations (sum, count, avg)
- Joins between streams
- Pattern detection
- Deduplication
State backends:
- In-memory (fast, limited)
- RocksDB (larger, persistent)
- External (Redis, database)
Stream Processing Frameworks
Apache Kafka Streams
Characteristics:
- Library (not a cluster)
- Exactly-once semantics
- Kafka-native
- Java/Scala
Best for:
- Kafka-centric architectures
- Simpler transformations
- Microservices
Example topology:
source → filter → map → aggregate → sink
Apache Flink
Characteristics:
- Distributed cluster
- True streaming (not micro-batch)
- Advanced state management
- SQL support
Best for:
- Complex event processing
- Large-scale streaming
- Low-latency requirements
Example:
DataStream<Event> events = env.addSource(kafkaSource);
events
.keyBy(e -> e.getUserId())
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.aggregate(new CountAggregator())
.addSink(sink);
Apache Spark Streaming
Characteristics:
- Micro-batch processing
- Unified batch + streaming API
- Wide ecosystem
- Python, Scala, Java, R
Best for:
- Teams with Spark experience
- Batch + streaming unified
- Machine learning integration
Latency: Seconds (micro-batch)
Kafka Streams vs Flink vs Spark
| Factor |
Kafka Streams |
Flink |
Spark Streaming |
| Deployment |
Library |
Cluster |
Cluster |
| Latency |
Low |
Lowest |
Medium |
| State |
Good |
Excellent |
Good |
| Exactly-once |
Yes |
Yes |
Yes |
| Complexity |
Low |
High |
Medium |
| Scaling |
With Kafka |
Independent |
Independent |
| SQL |
Limited |
Yes |
Yes |
| ML integration |
Limited |
Limited |
Excellent |
Stream Processing Patterns
Filtering
Input: All events
Output: Events matching criteria
Example: Only process events where amount > 1000
Mapping/Transformation
Input: Event type A
Output: Event type B
Example: Enrich order events with customer data
Aggregation
Input: Multiple events
Output: Single aggregated result
Examples:
- Count events per window
- Sum amounts per user
- Average latency per endpoint
Join Patterns
Stream-Stream Join:
┌─────────────┐ ┌─────────────┐
│ Orders │ ──► │ Join │
└─────────────┘ │ (by order_id│
┌─────────────┐ │ in window) │
│ Shipments │ ──► │ │
└─────────────┘ └─────────────┘
Stream-Table Join (Enrichment):
┌─────────────┐ ┌─────────────┐
│ Events │ ──► │ Join │
└─────────────┘ │ (lookup by │
┌─────────────┐ │ customer) │
│ Customer │ ──► │ │
│ Table │ └─────────────┘
└─────────────┘
Deduplication
Problem: Duplicate events from at-least-once delivery
Solution:
1. Track seen IDs in state (with TTL)
2. If seen, drop
3. If new, process and store ID
State: {event_id: timestamp}
TTL: Based on expected duplicate window
Event Delivery Guarantees
At-Most-Once
May lose events, never duplicates
Process → Commit → (if fail, event lost)
Use when: Loss acceptable, simplicity preferred
At-Least-Once
Never loses, may have duplicates
Commit → Process → (if fail, reprocess)
Use when: No loss acceptable, handle duplicates downstream
Exactly-Once
Never loses, never duplicates
Requires:
- Idempotent operations, OR
- Transactional processing
How it works:
1. Read from source transactionally
2. Process and update state
3. Write output and commit together
Flink: Checkpointing + two-phase commit
Kafka Streams: Transactional producer + EOS
Late Event Handling
Strategies
1. Drop late events
Simple, may lose data
2. Allow late events (allowed lateness)
Process if within lateness threshold
3. Side output late events
Main stream processes on-time
Side stream handles late separately
4. Reprocess historical
Batch job fixes late data impact
Watermark Strategies
Bounded Out-of-Orderness:
watermark = max_event_time - max_lateness
Example:
max_event_time = 10:00:00
max_lateness = 5 seconds
watermark = 09:59:55
Events before 09:59:55 considered complete
Scalability Patterns
Partitioning
Partition by key for parallel processing:
┌─────────────────────────────────────────────────────┐
│ Kafka Topic (3 partitions) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐│
│ │ Partition 0 │ │ Partition 1 │ │ Partition 2 ││
│ │ user_a, b │ │ user_c, d │ │ user_e, f ││
│ └─────────────┘ └─────────────┘ └─────────────────┘│
└─────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Worker 0 │ │Worker 1 │ │Worker 2 │
└─────────┘ └─────────┘ └─────────┘
Backpressure
When downstream can't keep up:
1. Buffer (risk: OOM)
2. Drop (risk: data loss)
3. Backpressure (slow down source)
Flink: Backpressure propagates automatically
Kafka: Consumer lag indicates backpressure
Monitoring Streaming Applications
Key Metrics
Throughput:
- Events per second
- Bytes per second
Latency:
- Processing latency
- End-to-end latency
Health:
- Consumer lag
- Checkpoint duration
- Backpressure rate
- Error rate
Consumer Lag
Lag = Latest offset - Consumer offset
High lag indicates:
- Processing too slow
- Need more parallelism
- Downstream bottleneck
Monitor: Set alerting thresholds
Best Practices
1. Design for exactly-once when needed
2. Handle late events explicitly
3. Use event time, not processing time
4. Monitor consumer lag closely
5. Plan for state recovery
6. Test with realistic data volumes
7. Implement backpressure handling
8. Keep processing idempotent when possible
Related Skills
message-queues - Messaging patterns
data-architecture - Data platform design
etl-elt-patterns - Data pipeline patterns