name	cloud-storage-optimization
description	Optimize cloud storage across AWS S3, Azure Blob, and GCP Cloud Storage with compression, partitioning, lifecycle policies, and cost management.

Cloud Storage Optimization

Overview

Optimize cloud storage costs and performance across multiple cloud providers using compression, intelligent tiering, data partitioning, and lifecycle management. Reduce storage costs while maintaining accessibility and compliance requirements.

When to Use

Reducing storage costs
Optimizing data access patterns
Implementing tiered storage strategies
Archiving historical data
Improving data retrieval performance
Managing compliance requirements
Organizing large datasets
Optimizing data lakes and data warehouses

Implementation Examples

1. AWS S3 Storage Optimization

# Enable Intelligent-Tiering
aws s3api put-bucket-intelligent-tiering-configuration \
  --bucket my-bucket \
  --id OptimizedStorage \
  --intelligent-tiering-configuration '{
    "Id": "OptimizedStorage",
    "Filter": {"Prefix": "data/"},
    "Status": "Enabled",
    "Tierings": [
      {
        "Days": 90,
        "AccessTier": "ARCHIVE_ACCESS"
      },
      {
        "Days": 180,
        "AccessTier": "DEEP_ARCHIVE_ACCESS"
      }
    ]
  }'

# Analyze storage usage
aws s3api list-bucket-metrics-configurations --bucket my-bucket

# Enable S3 Select for cost optimization
aws s3api put-bucket-metrics-configuration \
  --bucket my-bucket \
  --id EntireBucket \
  --metrics-configuration '{
    "Id": "EntireBucket",
    "Filter": {"Prefix": ""}
  }'

# Use S3 Batch Operations for bulk tagging
aws s3control create-job \
  --account-id ACCOUNT_ID \
  --operation LambdaInvoke \
  --manifest '{
    "Spec": {"Format": "S3BatchOperations_CSV_20180820"},
    "Location": "s3://my-bucket/manifest.csv"
  }' \
  --report '{
    "Bucket": "s3://my-bucket/reports/",
    "Prefix": "batch-operation-",
    "Format": "Report_CSV_20180820",
    "Enabled": true,
    "ReportScope": "AllTasks"
  }'

2. Data Compression and Partitioning Strategy

# Python data optimization
import boto3
import gzip
import json
from datetime import datetime
import pandas as pd

class StorageOptimizer:
    def __init__(self, bucket_name):
        self.s3_client = boto3.client('s3')
        self.bucket = bucket_name

    def compress_and_upload(self, file_path, key):
        """Compress file and upload to S3"""
        with open(file_path, 'rb') as f_in:
            with gzip.open(f_in, 'rb') as f_out:
                self.s3_client.put_object(
                    Bucket=self.bucket,
                    Key=f'{key}.gz',
                    Body=f_out.read(),
                    ContentEncoding='gzip',
                    ServerSideEncryption='AES256'
                )

    def partition_csv_data(self, csv_path, partition_columns):
        """Partition CSV by date and other columns"""
        df = pd.read_csv(csv_path)

        # Partition by date
        df['date'] = pd.to_datetime(df['date'])

        for date, date_group in df.groupby(df['date'].dt.date):
            for partition_val, partition_group in date_group.groupby(partition_columns[0]):
                # Parquet format (more efficient than CSV)
                file_key = f"data/date={date}/category={partition_val}/data.parquet"

                partition_group.to_parquet(
                    f"/tmp/{partition_val}.parquet",
                    compression='snappy',
                    index=False
                )

                self.upload_parquet_file(f"/tmp/{partition_val}.parquet", file_key)

    def upload_parquet_file(self, local_path, s3_key):
        """Upload Parquet file with optimization"""
        with open(local_path, 'rb') as data:
            self.s3_client.put_object(
                Bucket=self.bucket,
                Key=s3_key,
                Body=data.read(),
                ContentType='application/octet-stream',
                ServerSideEncryption='AES256',
                StorageClass='INTELLIGENT_TIERING'
            )

    def analyze_storage_patterns(self):
        """Analyze and recommend storage optimizations"""
        response = self.s3_client.list_objects_v2(
            Bucket=self.bucket,
            Prefix='data/'
        )

        stats = {
            'total_size': 0,
            'file_count': 0,
            'by_extension': {},
            'old_files': []
        }

        for obj in response.get('Contents', []):
            size = obj['Size']
            key = obj['Key']
            modified = obj['LastModified']

            stats['total_size'] += size
            stats['file_count'] += 1

            ext = key.split('.')[-1]
            stats['by_extension'][ext] = stats['by_extension'].get(ext, 0) + 1

            # Files older than 90 days
            days_old = (datetime.now(modified.tzinfo) - modified).days
            if days_old > 90:
                stats['old_files'].append({
                    'key': key,
                    'size': size,
                    'days_old': days_old
                })

        return stats

    def implement_lifecycle_optimization(self):
        """Implement comprehensive lifecycle policy"""
        lifecycle_config = {
            'Rules': [
                # Recent data - standard
                {
                    'Id': 'KeepRecentStandard',
                    'Status': 'Enabled',
                    'Filter': {'Prefix': 'data/'},
                    'NoncurrentVersionTransition': {
                        'NoncurrentDays': 30,
                        'StorageClass': 'STANDARD_IA'
                    }
                },
                # Archive old data
                {
                    'Id': 'ArchiveOldData',
                    'Status': 'Enabled',
                    'Filter': {'Prefix': 'archive/'},
                    'Transitions': [
                        {
                            'Days': 30,
                            'StorageClass': 'STANDARD_IA'
                        },
                        {
                            'Days': 90,
                            'StorageClass': 'GLACIER'
                        },
                        {
                            'Days': 180,
                            'StorageClass': 'DEEP_ARCHIVE'
                        }
                    ],
                    'Expiration': {
                        'Days': 2555  # 7 years
                    }
                },
                # Delete incomplete multipart uploads
                {
                    'Id': 'CleanupIncompleteUploads',
                    'Status': 'Enabled',
                    'AbortIncompleteMultipartUpload': {
                        'DaysAfterInitiation': 7
                    }
                }
            ]
        }

        self.s3_client.put_bucket_lifecycle_configuration(
            Bucket=self.bucket,
            LifecycleConfiguration=lifecycle_config
        )

3. Terraform Multi-Cloud Storage Configuration

# storage-optimization.tf

# AWS S3 with tiering
resource "aws_s3_bucket" "data_lake" {
  bucket = "my-data-lake-${data.aws_caller_identity.current.account_id}"
}

resource "aws_s3_bucket_intelligent_tiering_configuration" "archive" {
  bucket = aws_s3_bucket.data_lake.id
  name   = "archive-tiering"

  tiering {
    access_tier = "ARCHIVE_ACCESS"
    days        = 90
  }

  tiering {
    access_tier = "DEEP_ARCHIVE_ACCESS"
    days        = 180
  }

  status = "Enabled"
}

# Azure Blob storage with lifecycle
resource "azurerm_storage_account" "data_lake" {
  name                     = "mydatalake"
  resource_group_name      = azurerm_resource_group.main.name
  location                 = azurerm_resource_group.main.location
  account_tier             = "Standard"
  account_replication_type = "LRS"

  access_tier = "Hot"
}

resource "azurerm_storage_management_policy" "data_lifecycle" {
  storage_account_id = azurerm_storage_account.data_lake.id

  rule {
    name    = "ArchiveOldBlobs"
    enabled = true

    filters {
      prefix_match       = ["data/"]
      blob_index_match {
        name      = "age-days"
        operation = "=="
        value     = "90"
      }
    }

    actions {
      base_blob {
        tier_to_cool_after_days_since_modification_greater_than       = 30
        tier_to_archive_after_days_since_modification_greater_than     = 90
        delete_after_days_since_modification_greater_than              = 2555
      }

      snapshot {
        delete_after_days_since_creation_greater_than = 90
      }

      version {
        tier_to_cool_after_days_since_creation_greater_than       = 30
        tier_to_archive_after_days_since_creation_greater_than     = 90
        delete_after_days_since_creation_greater_than              = 365
      }
    }
  }
}

# GCP Cloud Storage with lifecycle
resource "google_storage_bucket" "data_lake" {
  name     = "my-data-lake-${data.google_client_config.current.project}"
  location = "US"

  uniform_bucket_level_access = true
  storage_class               = "STANDARD"

  lifecycle_rule {
    action {
      type          = "SetStorageClass"
      storage_class = "NEARLINE"
    }

    condition {
      age = 30
    }
  }

  lifecycle_rule {
    action {
      type          = "SetStorageClass"
      storage_class = "COLDLINE"
    }

    condition {
      age = 90
    }
  }

  lifecycle_rule {
    action {
      type          = "Delete"
    }

    condition {
      age = 2555
    }
  }

  lifecycle_rule {
    action {
      type = "Delete"
    }

    condition {
      num_newer_versions = 3
      is_live            = false
    }
  }
}

data "aws_caller_identity" "current" {}
data "google_client_config" "current" {}

4. Data Lake Partitioning Strategy

# Optimized partitioning for data lakes
def create_partitioned_data_lake(source_file, bucket, format='parquet'):
    import pyarrow.parquet as pq
    import pyarrow as pa

    # Read data
    table = pq.read_table(source_file)
    df = table.to_pandas()

    # Create partitions
    partitions = {
        'year': df['date'].dt.year,
        'month': df['date'].dt.month,
        'day': df['date'].dt.day,
        'region': df['region']
    }

    # Group by partitions
    for year, year_group in df.groupby(partitions['year']):
        for month, month_group in year_group.groupby(partitions['month']):
            for day, day_group in month_group.groupby(partitions['day']):
                for region, region_group in day_group.groupby(partitions['region']):
                    # Create partition path
                    path = f"s3://{bucket}/data/year={year}/month={month:02d}/day={day:02d}/region={region}"

                    # Save as Parquet with compression
                    table = pa.Table.from_pandas(region_group)
                    pq.write_table(
                        table,
                        f"{path}/data.parquet",
                        compression='snappy',
                        use_dictionary=True
                    )

Best Practices

✅ DO

Use Parquet or ORC formats for analytics
Implement tiered storage strategy
Partition data by time and queryable dimensions
Enable versioning for critical data
Use compression (gzip, snappy, brotli)
Monitor storage costs regularly
Implement data lifecycle policies
Archive infrequently accessed data

❌ DON'T

Store uncompressed data
Keep raw logs long-term
Ignore storage optimization
Use only hot storage tier
Store duplicate data
Forget to delete old test data

Cost Optimization Tips

Use Intelligent-Tiering for variable access patterns
Archive data older than 90 days
Use equivalent cold storage across cloud providers
Delete incomplete multipart uploads
Monitor usage with cloud tools
Estimate costs before large uploads

cloud-storage-optimization

Install Skill

SKILL.md