Skip to main content

Command Palette

Search for a command to run...

Stories of forgotten EC2 instances racking up massive weekend bills

Updated
8 min read

Friday at 6 PM, your engineers log off. Monday at 9 AM, they log back on.

In between: 63 hours. Your staging environment noticed none of it.

This is not a monitoring story. It is a billing story - and the math is specific enough to be uncomfortable.


The Weekend Stack, Left Running Let's construct a typical mid-size dev/staging environment. Nothing exotic. The kind that exists at hundreds of software companies right now, this weekend, fully running, fully billing, with zero engineers looking at it.

EC2 - staging application servers

Two m5.xlarge instances (4 vCPU, 16 GB RAM each). On-demand in us-east-1: ~$0.192/hour each.

63 weekend hours × 2 instances × \(0.192 = \)24.19 Common objection: "We use Reserved Instances, so that's already paid for." True - but a reservation is a billing commitment, not a reason to leave things on. You prepaid for capacity. Leaving it idle doesn't recover that cost; it just adds opportunity cost on top of the fixed spend. For teams on on-demand (which is most early-stage and mid-size teams), every idle hour is a direct charge.

RDS - staging database

A db.r6g.large instance (2 vCPU, 16 GB RAM), PostgreSQL, Multi-AZ disabled for staging. On-demand: ~$0.24/hour.

63 hours × \(0.24 = \)15.12 RDS also charges for storage separately - a 100 GB gp3 volume runs ~\(0.115/GB/month, or roughly \)0.38 for a weekend. Not the main driver, but it adds up over time.

ECS - background worker service

A Fargate task: 2 vCPU, 4 GB memory. Fargate pricing in us-east-1: \(0.04048/vCPU-hour + \)0.004445/GB-hour.

vCPU: 63 × 2 × \(0.04048 = \)5.10 Memory: 63 × 4 × \(0.004445 = \)1.12 Total ECS: $6.22 NAT Gateway - the one nobody thinks about

Here is the cost that surprises people when they finally notice it. A NAT Gateway charges two ways: hourly existence (\(0.045/hour) plus data processing (\)0.045/GB).

Hourly alone: 63 × \(0.045 = \)2.84

Data processing during the weekend is low if nothing is actively running, but if your ECS tasks are polling, your EC2 instances are pulling package updates, or your RDS is syncing replicas - even light background traffic adds $1-3 on a quiet weekend.

Conservative NAT total: $4-6

ALB - Application Load Balancer

ALB pricing is \(0.008/LCU-hour plus \)0.0225/hour for the load balancer itself. With near-zero traffic, you're paying mostly the base rate.

63 × \(0.0225 = \)1.42 With even minimal idle traffic (health checks, keep-alives), round up to $2-3.


The Weekend Total | Resource | Cost (63 hrs) | |---|---| | EC2 (2× m5.xlarge) | \(24.19 | | RDS (db.r6g.large) | \)15.12 | | ECS Fargate worker | \(6.22 | | NAT Gateway | ~\)5.00 | | ALB | ~\(2.50 | | Total | ~\)53/weekend |

\(53 per weekend × 52 weekends = ~\)2,756/year. For one staging environment. Before you account for evenings (say, 7 PM to 8 AM on weekdays - another ~65 hours/week of potential idle time).

Add evenings and the figure for a single environment climbs toward $10,000-15,000/year, depending on instance types and region.

For a company with two or three staging environments - one per team, one for QA, one for integration testing - multiply accordingly.


Why Don't Teams Just Turn It Off? This is the fair pushback. The math above is not a secret. AWS bills are public to anyone who opens Cost Explorer. So why does the staging environment keep running all weekend?

A few reasons, all real:

  1. The "what if" anxiety

Someone, at some point, needs to hotfix something on a Sunday. Turning staging off means spin-up time when the incident hits. So the default is "leave it on just in case." Reasonable for production. Overkill for a staging database that nobody will query until Monday.

  1. Pipeline coupling

CI/CD pipelines often assume the environment is always up. If your GitHub Actions workflow deploys to a fixed EC2 IP, and that instance is stopped, the deploy fails. Teams that have been burned by this once tend to stop touching the scheduling problem entirely.

  1. Automation scripts that break quietly

A lot of teams have tried this. They wrote a Lambda that calls ec2.stop_instances() at 7 PM Friday. It worked for a month. Then someone added an Auto Scaling Group, or an ECS service with a minimum desired count, or an RDS instance the script didn't know about - and the Lambda started failing silently. Nobody noticed until the bill didn't change.

  1. The cost is invisible until it isn't

AWS Cost Explorer shows you monthly totals and service breakdowns. It does not, by default, show you a time-of-day heat map against your engineering calendar. The idle spend is real but it looks like normal compute cost. It blends in.


What Actually Works - and What Doesn't Fixed-schedule Lambda shutdown The most common DIY approach: a CloudWatch Events rule triggers a Lambda at 6 PM to stop instances and at 8 AM to start them.

import boto3

def lambda_handler(event, context): ec2 = boto3.client('ec2', region_name='us-east-1') ec2.stop_instances(InstanceIds=['i-0abc123def456', 'i-0def789ghi012']) This works - until it doesn't. The gaps:

Hardcoded instance IDs break when the infrastructure changes No awareness of whether anyone is actually using the environment at 6 PM (working late, running a load test) Does nothing for RDS (separate API), ECS services (task count ≠ instance stop), or ASGs (which will just relaunch terminated instances) No multi-account support without significant additional scaffolding No visibility into what was stopped, when, or why For a single EC2 instance in a single account, this is fine. For a realistic staging stack, you end up writing - and maintaining - a fairly complex piece of automation.

AWS Instance Scheduler AWS's own solution is more complete: a CloudFormation template that deploys a scheduled Lambda with DynamoDB-backed configuration. It handles EC2 and RDS, supports multiple accounts, and is maintained by AWS.

It is schedule-based. You define windows: "running Mon - Fri 08:00-20:00". Everything outside that window is stopped.

The limitation is the same one that affects any fixed-schedule approach: the schedule does not know what your engineers are actually doing. If you run a regression suite at 9 PM on a Tuesday, or if your team is heads-down before a release and working until midnight, the scheduler stops the environment anyway - or you disable it, and it never turns back on.

Activity-driven pause/resume "Activity-driven" means the system observes whether engineers are actually present and working before making pause/resume decisions - rather than inferring state from a clock. A desktop agent on each engineer's machine detects active work sessions (IDE open, terminal active, work-tool focus) and signals the infrastructure layer. Resources pause when everyone's genuinely idle; they resume automatically when work starts.

This closes the gap that schedules can't: the 3-hour afternoon meeting where nobody's touching staging, the late-evening session that runs past the scheduled shutdown window, the Monday morning that starts at 7 AM for one engineer but 10 AM for another.

Tool like Trigops take this approach - managing EC2, RDS, ECS, and ASG across accounts based on actual user presence rather than static windows. The setup is a desktop agent plus a connected cloud account; no Lambda maintenance, no hardcoded IDs.

For teams that want the DIY path: the core challenge is building a reliable presence signal and wiring it to the AWS APIs for all four resource types without creating new failure modes. The schedule approach is a good starting point for EC2 in a single account; the complexity grows sharply with resource diversity and account count.


What the Bill Actually Shows You AWS Cost Explorer will show you $X in EC2 spend last month. It will not, by default, show you that $Y of that landed between Friday night and Monday morning.

You can get close with CloudWatch detailed billing and some tag discipline. Tag your staging resources with Environment: staging, enable Cost Allocation Tags, and then build a Cost Explorer report filtered to that tag. Then compare weekend vs. weekday spend manually. It is tedious, but it surfaces the real number.

Alternatively, enable AWS Cost and Usage Reports (CUR) to S3 and query with Athena:

SELECT DATE_TRUNC('day', line_item_usage_start_date) AS usage_date, DAYOFWEEK(line_item_usage_start_date) AS day_of_week, SUM(line_item_unblended_cost) AS daily_cost FROM your_cur_table WHERE resource_tags_user_environment = 'staging' AND line_item_usage_start_date BETWEEN DATE '2026-01-01' AND DATE '2026-06-01' GROUP BY 1, 2 ORDER BY 1; Filter day_of_week to 1 (Sunday) and 7 (Saturday). The result is your idle staging spend, quantified. Most teams who run this query find the weekend number is larger than they expected.


The Actual Takeaway The staging environment running all weekend is not a DevOps failure. It is a default state - one that exists because turning things off reliably is harder than it looks, and because the cost is invisible until it accumulates.

The path to fixing it is: measure first (CUR + Athena query above), then pick a solution matched to your complexity. A single-account, few-instance setup can go far with AWS Instance Scheduler. A multi-account, multi-resource setup - EC2, RDS, ECS, ASGs across teams - needs either a more complete automation layer or an activity-aware tool.

The architecture of the problem is solvable. The bill keeps arriving because most teams never build the system to solve it.


If you want to see what activity-driven pause/resume looks like without building it yourself, Trigops connects to your cloud account and starts tracking idle time - the dashboard alone tends to make the case.