Difficulties & Development Direction

This section documents the real-world difficulties encountered during the deployment of each infrastructure step for Money Manager in section 5.3 (VPC/EC2, RDS/ElastiCache, DynamoDB/SQS, S3/VPC Endpoints, CloudWatch), comparing them against the High Availability architecture proposed in section 2, how they were resolved, and future development directions.

Difficulties Encountered

Single EC2 instead of Multi-AZ Auto Scaling Group: The proposal in section 2, item 4 set the goal of High Availability using EC2 + Auto Scaling Group across 2 Availability Zones. Within the workshop timeframe, I only managed a single t3.micro EC2 instance running two containers moneymanager-api/moneymanager-worker (Step 1, section 5.3) — I didn’t have time to set up a Launch Template, ALB Target Group, and Auto Scaling Policy since additional time was needed for health check testing before cloning instances.

Confusion between Gateway Endpoint and Interface Endpoint for S3: Initially I thought only one type of VPC Endpoint was sufficient. After reading the documentation carefully, I understood that the Gateway Endpoint is free but only works for internal VPC traffic, while the Interface Endpoint (PrivateLink) charges by the hour but provides private IPs and allows access from VPN/on-prem through Route 53 Resolver. Since Step 4 (section 5.3) required both EC2 inside the VPC and office workstations over VPN to access S3, I ended up using a combination of both types.

Endpoint Policy and S3 Bucket Policy conflicting: When first setting up, I configured the Bucket Policy to block all access not going through the VPC Endpoint before assigning the correct bucket Resource ARN to the Endpoint Policy, causing even the EC2 API to get Access Denied when calling S3. I had to go through each step and test directly from EC2 using aws s3 cp to discover the configuration order issue.

Route 53 Resolver Inbound Endpoint was entirely new territory: Previously I was only familiar with using Route 53 for regular public DNS records, and had never configured an internal DNS resolver for the office over VPN to resolve S3 domain names to the Interface Endpoint’s IP. This was the most time-consuming part to learn across all of Step 4.

Migration from MongoDB Atlas to DynamoDB and redesigning the SQS queue: When implementing Step 3 (section 5.3), removing the old MongoDB Atlas library and redesigning the Partition Key for the two tables chat_sessions/chat_messages using the AWS SDK v2 took considerable trial-and-error time. Configuring the Redrive Policy for the DLQ moneymanager-async-jobs-dlq (maximum 3 retries) also had to be adjusted several times because the default retry count was too low, causing legitimate jobs to be sent to the DLQ unfairly.

How They Were Resolved

  • Compared the proposed architecture in section 2 against what could be completed within the workshop timeframe: prioritized building the main data flow correctly (VPC, EC2 running containers, RDS/Redis, DynamoDB/SQS, S3/VPC Endpoint, CloudWatch) first, while noting the Auto Scaling Group/Multi-AZ ALB as a missing item to be added later, rather than trying to do everything at once without completing any part thoroughly.
  • Re-read the official AWS documentation on VPC Endpoints and Route 53 Resolver, comparing Gateway vs Interface Endpoint using a table to select the right type for each traffic flow.
  • Tested in small incremental steps instead of configuring everything at once: created the Gateway Endpoint first, tested access from EC2 using CLI (aws s3 cp, aws dynamodb scan, aws sqs receive-message), then added the Interface Endpoint and Endpoint Policy, testing after each change to easily isolate errors.
  • When encountering access permission errors, checked CloudWatch Logs and VPC Flow Logs to determine which layer was blocking the request (Security Group, Bucket Policy, or Endpoint Policy) instead of guessing.

Future Development Directions

  • Add a Multi-AZ Auto Scaling Group and Application Load Balancer as specified in the High Availability architecture proposed in section 2, item 4, replacing the current single EC2 instance.
  • Apply Reserved Instances for stable components (EC2, RDS, ElastiCache) as outlined in the cost optimization strategy in section 2, item 8 (risk assessment).
  • Migrate all infrastructure currently being done manually through the Console (VPC, Endpoint, Route Table, Bucket Policy…) to Infrastructure as Code (CloudFormation or Terraform) for easier reproduction, review, and version control.
  • Extend Gateway/Interface Endpoints to other services currently routing through the NAT Gateway (DynamoDB, SQS) to further reduce NAT Gateway costs and increase transmission security.
  • Consolidate the currently scattered CloudWatch Alarms and Log Groups into a single unified Dashboard for faster monitoring when system issues occur.
  • Perform recovery/failover testing for RDS multi-AZ and ElastiCache HA as planned in Week 12, section 2, item 6 — this was not yet tested within the workshop scope.