Finding, Tracking, and Holding Teams Accountable for Savings Opportunities (aka Cloud Waste)

This was originally going to come after a post on maximizing savings with Reserved Instances and Savings Plans – but due to popular demand, moving it forward. This blog post is intended to equip its reader with a proven approach for measuring and driving reduction in wasteful cloud use and costs. While AWS services will be mentioned, none of the principles are specific to their cloud.

As an organization grows its cloud use – whether through migration or growth, parties involved reasonably expect costs to grow in kind. Despite this, most orgs inevitably face one or more “cost crises” on their journey – moments where costs pass certain thresholds (like the first time the cloud bill grows a new digit), or the first time costs exceed budgetary projections by a large amount. These occasions spawn fresh examinations of cloud use, seeking to “get things back under control”.

Deploying the magnifying glass reveals places where some costs contributing to the crisis could have been avoided, if only people had known about them sooner. These opportunities take the shape of orphaned or not-ideally-configured resources, which, with better and more timely remediation, could have saved the organization real money.

The all-too-common pattern is to address the surface findings of the crisis investigation, mark cloud efficiency “back on track”, and consider it done. In reality, your cloud probably had a dozen new inefficient resources provisioned before that Mission Accomplished! email even finished circulating. You need a process of continual monitoring, measurement, and remediation to ensure cloud efficiency is always on track.

Before getting deeper on this subject, a couple points. First, attribution is mandatory to succeed in a cloud waste management endeavor. You need to know who (could be a team, an org, or an individual) is responsible for the efficiency of every resource in your fleet. Resource doesn’t always mean an EC2 or other instance – it can be an S3 bucket, a DynamoDB table, an elastic IP address – there and hundreds of different types of resources available.

The next, as you go to address this space for your organization, is to be realistic and focused in what you work to make happen given your resources. I think of this as the “depth and breadth” problem. Some organizations intentionally limit the variety of cloud usage patterns for compliance or security purposes, while others might arrive at a homogenous technological approach based on the nature of their business. Other organizations might see tremendous variation from one team to the next not just in technology used, but in the means of implementing each technology. Where a wider variety of tech patterns exist (more “breadth”), you will be less able to provide robust pathways to remediation (“depth”) for a given level of resourcing in your program.

The 11 step program of cloud efficiency-

  1. Identify the technology costing your org the most $
  2. Within that technology, identify patterns of wasteful use
  3. Develop software to assess your entire cloud for these patterns
  4. Run the software regularly and comprehensively
  5. Calculate the savings opportunity of each observation
  6. Make the data available along appropriate channels
  7. For every facet of waste exposed, provide pathways to remediation
  8. Establish a metric for cloud efficiency and solidify executive support
  9. Repeat Steps 1-7, continuously expanding the set of wasteful patterns
  10. If you must, allow exceptions
  11. When and where appropriate, auto remediate!

1. Identify the technology costing your org the most $

This should be fairly common sense, but think of it this way – if your org is spending $1M/mo on EC2, and $10k/mo on SageMaker – even if somehow 100% of the SageMaker spend was waste and you brought it to $0, odds are even greater opportunities exist within that $1M/mo EC2 bill. Start with your biggest spend.

2. Within that technology, identify patterns of wasteful or sub-optimal use

Things get trickier here, and here’s where having cloud-tech-savvy members of the team (as discussed in the blog’s first post) pays dividends. In the case of AWS, Amazon can get you started with services like Trusted Advisor and Compute Optimizer. For most, underutilized EC2 instances are the biggest opportunity area, hence the variety of perspectives on EC2 right-sizing from AWS and others.

EC2 might not be the technical heart of your business – maybe it’s S3, DynamoDB, or any other service lacking rich native utilization efficiency indicators. In those cases, it’ll be up to you to understand how your teams are wielding these services, and determine what usage patterns are OK and which are not. You might also find the out-of-the-box recommendation sets insufficient and wish to generate your own.

3. Develop software to assess your entire cloud for these patterns

With a wasteful pattern identified, you now need software to generate and collect the data required to determine the efficiency of a service’s use. This data might be in the form of already-available metrics from AWS, but it might also be manifest in your own application or service logs and traces. It might be something new that needs code changes to create. Every one is its own adventure.

Within these processes, you’ll need to choose where to draw the line between efficient and inefficient use. In most cases usage efficiency falls upon a spectrum, and as the owner of the algorithm, you’ll need to be able to defend the choice you’ve made. Too low/conservative, and you end up ignoring potentially vast amounts of addressable waste and achievable savings. Too high/aggressive, and you risk incurring operational impacts, wasting time chasing low-value opportunities, and sparking the ire of teams held to unattainable standards. Don’t expect to get these all right out of the gate – do the best you can and iterate over time. Sometimes it makes sense to start with coarse, broadsword-style approaches, achieving greater surgical precision over time as more and better data becomes available. If you wait for perfect, you’ll miss months of real opportunity.

A completed software component evaluating a facet of cloud operation and returning opportunities for efficiency gains I call a Waste Sensor.

4. Run the software regularly and comprehensively

All this work spotting waste won’t bring full returns if only done occasionally, or against a subset of your cloud. Ensure your waste-sensing software runs regularly, in every corner, evaluating all occurrences of potentially wasteful usage.

Something you’ll likely find once measuring everyone with these sensors, is their desire to immediately see their remediation activities “move the needle” of measured waste. If the sensor is only running weekly, it could then be a week before teams see and can measure the results of their work. Daily is better, and even then you’re likely to have people asking for faster!

In addition, I recommend warehousing or otherwise tracking history from these sensors. With tracking you gain the ability to demonstrate how teams have taken action to reduce their measured waste over time. Along the lines of “If a tree falls in the forest…” is “If a cost savings effort was taken but not measured, did its savings really happen?” Clear, data-driven before & afters are feathers in the cap not only of the teams having taken the cleanup actions, but for you as a driver of the organization’s cloud efficiency.

5. Calculate the savings opportunity of each observation

Think of savings opportunity as the difference between the price as paid for the resource, vs. what it would have cost if the resource had been correctly configured, or retired if appropriate. Again some judgment may be required, there is no standard way of doing this.

I would suggest first choosing a standard timeframe of savings, for use across your entire fleet of waste sensors. It becomes a confusing mess if you’re measuring DynamoDB savings in $/hour, EC2 rightsizing savings opportunity in $/day, S3 savings in $/week, RDS in $/month. I find monthly works well because people can directly relate it to what they see on the invoice, or in their internal billing each month.

In AWS most costs are metered on an hourly basis, so it becomes a matter of calculating savings at that grain, then extrapolating out to your chosen common span. In the simplest case, consider an instance you’ve found severely underutilized; the team has deployed an m5.16xlarge instance, but based on the utilization metrics you’ve collected, an m5.2xlarge instance (1/8th the size) would provide more than sufficient resources to cover the system’s encountered load without incurring operational risk. If an m5.16xlarge costs $2/hour to run (making this cost up) then an m5.2xlarge should only be $.25/hour, for a savings opportunity of $1.75/hour. Extrapolated over an average 730-hour month, this is a $1277.50 opportunity.

If it were the case this resource was orphaned and a candidate for termination, the savings opportunity is the full $2/hour, or $1460/month.

In some situations the math on hourly pricing deltas is not as straightforward. In others, it may not be appropriate to perform a linear extrapolation out to a full month, if there’s reasonable expectation the resource and its descendants will not live that long.

By putting a dollar figure on every observed savings opportunity, teams are able to prioritize their efforts. If you have a good fleet of waste sensors, odds are nobody will be able to fix everything and get to zero waste. But with this common meter by which all different elements of their waste have been measured, they will be able to ensure they’ve at least addressed their greatest “bang for the buck” opportunities first.

6. Make the data available along appropriate channels

At this point in a database somewhere you have a listing of all the savings opportunities your collection of waste sensors have uncovered. What next?

At a minimum you’ll want to have a dashboard, report, or similar business intelligence artifact created to track everybody’s waste day by day. As mentioned above, attribution is foundational to this sort of effort, so at this point it should be academic to create org-wide rollup numbers, and provide drill-down into groups and subgroups throughout the org hierarchy, to see who’s wasting what.

Scorecard-style views, where the metrics of one group are plainly laid alongside another, can be great motivators to group leaders. Nobody wants to be at the top of the “wall of shame”.

In addition to rollup reporting for leadership (and to show off the efficacy of your program!) this data may also be suitable for programmatic consumption by individual development teams. Offering your findings through an API or similar programmatic methods allows the software and automation-focused dev teams to consume and process your insights in a manner familiar to them. The fewer manually-built spreadsheets that have to get passed around, the quicker and more comprehensively waste can be addressed.

7. For every facet of waste exposed, provide contextual pathways to remediation

Knowing is half the battle.

–G.I. Joe

The work thus far exposes and quantifies your organization’s opportunities to operate more efficiently and save money – you now know your waste – but so far nothing’s actually been saved. An important aspect of your program is that it facilitates quick, easy, low-risk, and lasting resolutions to observed inefficiencies. Every waste sensor must be accompanied by a treatment plan, with clear guidance provided on criteria of efficient operation.

This is where the depth/breadth factor comes into play. If you have a large and well-staffed team of cost optimization engineers addressing a tiny number of services all with common architectures, CICD pipelines, and programming languages, then you may be in a position to “reach in” quite a ways to the software dev teams. In theory you could provide suggestions for remediation at the source code level, but also anywhere in-between, like in their CloudFormation templates or TerraForm configs. For common patterns of underutilization (EC2), there exists third-party software you can purchase which can even do some of this for you.

In other situations, with a small team up against hundreds of approaches to dozens of technologies with almost nothing in common, you may not have resources to programmatically suggest higher-level pathways to remediation. This reality should be a factor for your organization in deciding what stance it takes on technology standardization. There may be benefits to allowing each team ultimate flexibility in how they code, deploy, and run their services, but that flexibility comes at the cost in additional effort and complexity when it comes time to optimize their unique approach.

8. Establish a metric for cloud efficiency and solidify executive support

As not all teams generate the same volume of spend in the cloud, it might not be fair to presnt the absolute $ waste of one team against another. Considering waste alone, a team incurring $100,000/month in waste might look awful next to one incurring only $10,000. However when it’s examined in the context of overall spend, where first team is spending $5M/month and the second only $50,000/month, the perspective shifts.

To account for this, I’ve found it valuable to establish a common metric thusly:

(Total Savings Opportunity) / (Total Monthly Spend) = Percent of spend wasted

Using this measure on the above two teams, the first:

$100,000 / $5,000,000 = 2%

The second:

$10,000 / $50,000 = 20%

This approach normalizes efficiency across teams and spends of various sizes, allowing for use uniformly throughout the organization. With attribution in place, this measurement can be made at any level within the hierarchy – the whole company, large groups, all the way down to individual teams.

With a system of measurement in place, the next step is to set goals and drive teams to achieve them. This is where the establishing executive sponsorship comes into play. As a practitioner or even leader of cloud excellence within your organization, you are likely not in a position of significant authority to compel or even encourage teams to strive for a cloud waste goal. You need backing from those of sufficient influence to get all cloud operators in your organization to agree to be measured not just by pure budget adherence (or whatever other metrics they might already be assessed by) but also by this new metric.

You also need to help make sure everyone party to this measurement understands the purpose and reasonability of the metric. You’ll need to help establish reasonable goals and timeframes. For an org with a substantial cloud footprint just starting out but new to addressing waste, “We’re gonna get to zero waste in 6 weeks!” may sound great, but probably isn’t reasonable. But cutting measured waste in half by the end of year 1? That might be achievable if teams are given adequate cycles to focus on it.

9. Repeat Steps 1-7, continuously expanding the set of wasteful patterns

Those now subject to your cloud waste management program need to be open to an ongoing evolution in the yardstick of efficient operation. Everyone needs to know you’re “moving the goalposts” continuously, on purpose, and with your team’s full effort, so as to discover and address new patterns of inefficiency. A rich new source of savings should always be surfaced as quickly as possible, and not be put off to the next quarterly (or worse) metric review cycle.

Leave No Stone Unturned In Your Desire for SuccessTony Fahkry ...
The underutilized EC2 instance stone

There are a couple analogies I think of in regards to this continuous expansion. The first is the classic “leave no stone unturned”. Every different style of use of every different cloud service is itself a stone, underneath which potential savings from inefficient operation may be hiding. The continuous change in the ways a large organization uses the cloud and the constant influx of new and expanded services from the cloud provider, mean a constant influx of new stones worthy of investigation.

Oil & Gas Exploration ETFs Surge on Chevron, Anadarko Deal
Offshore drilling of unattached elastic IP addresses

The other analogy, more relevant as the program matures, relates the process of cloud waste discovery to that of oil & gas exploration. A large organization’s cloud is a vast and rich place to discover people wasting money, just as the earth is a vast place with lots of fossil fuels tucked under the surface. But as with oil & gas, once you’ve tapped the big and obvious stores of value, subsequent discoveries are likely to generate smaller and smaller returns. At some point you’re going to be fracking for savings.

Eventually you may come across a use pattern generating only a few hundred dollars a month in waste in total across the enterprise. If the code to capture those opportunities was small and cheap to maintain and run, and the pathway to remediate quick, simple, and lasting for the teams affected, then it might be worth going after. But if not – if it required a complex new data pipeline, or if the opportunities were each very small, with risky and time-consuming steps required to remediate, then your organization may never see ROI on going after that waste.

At the same time, if you’re somewhere really big, try not to lose perspective. It may sound incredible if you’ve come from a small shop, but it might not be unheard of for a large enterprise to have millions of dollars a month in observed waste. In that setting, a new waste sensor exposing “only” $25k/month of new savings opportunity might feel not worth pursuing. That’s still $300k/year – so even if it takes an engineer a month to build and lots of time to maintain, there’s still a great return potential there.

10. If you must, allow exceptions

Once teams are held to a waste standard, you’ll see lots of interesting behavioral changes. The costs of technical decisions made in the name of disaster recovery, business continuity, or other – decisions potentially generating huge wasted cost in the form of underutilized standby resources – will be brought back to the surface for re-evaluation. It’s inevitable teams will seek exclusion for some of their resources from your waste sensors, and there’s one right and two wrong ways to handle this situation.

The first wrong way is to not allow any exceptions. In addition to making everybody mad at you and question the validity of the program, it ends up distracting people with savings opportunities which aren’t really addressable. Remember, it’s your objective to maximize the savings benefit to your organization, and that effort is impeded if teams are constantly having to filter-out things they can’t fix for various (and sometimes legitimate!) reasons.

The other wrong way is to allow anybody to exclude anything at any time without restriction. I can share an anecdote from a (large!) peer company who was struggling both to get tagging going, and to get their waste program launched. They eventually prepared and launched their first two waste sensors, one for EC2 instances and the other for S3 buckets. Along with the sensors, they allowed any resource receiving a certain tag key:value pair to be excluded from evaluation.

Beaker Muppet GIFs | Tenor
The company’s name, which I shan’t mention, always reminds me of Beaker the Muppet

Well wouldn’t you know it, practically overnight every single bucket and instance in their fleet was tagged for exclusion! Somehow the pressure of waste scrutiny cured their inability to effectively deploy tags. Unrestrained exclusion isn’t the answer either.

Instead, an approach allowing for appropriate and justified exclusions works best. A workflow begins when somebody requests an exception for a very specific technical pattern for a very specific business purpose. They provide thorough justification, and the request for exception should be reviewed for both its technical and business merits by approvers equipped to understand the situation and empowered to say no when warranted. Part of the process should include documentation of a plan to eventually address and remediate the wasteful pattern.

Only after the exception is approved should the waste be excluded from the team and overall company metrics. I recommend continuing to track this waste closely (but separately) so it does not become forgotten. Technically, tags work well for resource-level exclusions, but the automation should be in place to ensure only approved exception tags, applied to resources defined under the scope of the exception, receive exclusion.

11. When and where appropriate, auto remediate!

Keeping in mind the point of all of this is to save money, or at least, to have as little as possible going to wasteful ends – it ends up being the acts of remediation that take the effort across the finish line. Knowing what needs to be done is half the battle – getting people to act on those leads is the other.

When we know what needs to be done, and there’s a straightforward path to remediation, it begs the question, why not just fix it for them? The answer comes down to a new question of risk management.

Any change action, manual or automatic, brings with it risk of operational impact to the services or applications changed. There’s a fine balance of operational vs. financial risk which must be managed; part of what you need to do is help your organization find and maintain its ideal hinge point for each sensor.

By presenting opportunities but not acting upon them, you are offloading the risk management to the team closest to the infrastructure and best positioned to perform the analysis. When your systems are remediating issues on their behalf, you are taking on the risk of impact.

The risk equation can change as circumstances change. For instance, an automatic cost-saving measure that carries a small, but non-zero risk, may not make sense to run on the days leading up to Mother’s or Valentine’s Day, for a business that sells flowers online and counts on those narrow time windows for 90% of annual revenue.

It’s also important things be fixed the right way. In the classic example of an overprovisioned EC2 instance – it is not difficult to generate a routine which terminates, or even restacks the instance to a smaller type. However, in a modern infrastructure-as-code setting, all you’re doing is introducing configuration drift between the deployed infrastructure and the coded infrastructure; your fix will be undone the next time the infrastructure is restacked, which should be often.

We can’t do it from here, I’m telling you

Just like the electrician couldn’t properly shut off electricity to Nakatomi Plaza, there’ll be some cost inefficiencies you can’t fix by playing whack-a-mole in people’s accounts. It’s got to be done from downtown, or in our case, in the infrastructure’s source code.

For each type of waste uncovered, you’ll have to decide if there’s a path to automatic remediation, and when it is safe to act. You may be able to take a more aggressive stance on accounts or environments flagged as development/non-production, than you could in Production. You could also provide teams a means of tagging resources in a manner providing cost cleanup processes with additional guidance on what’s safe or not.

Increased investment and sophistication make more and more possible – for instance, if you can automate all the technical aspects of a fix but not the risk assessment, the remediation could be distilled down to a “One Click Repair” for the operating team. As with everything else in the broader cost opt space, survey the landscape and investment in the areas presenting the greatest potential return.

Conclusion

It can be a bit of work to build steam on a cloud efficiency program, but once some momentum has been built, excitement around its continued savings should help you keep it going. As with everything else cloud-based, data is abundant and your friend – use it to demonstrate the efficacy of your program and to reward teams for their success in eliminating waste. Nobody outside of Sales in your company can make as clear a case of their financial bottom-line contribution as you can in this role.

Unlike billing, you’re not starting off with the backbone of a solid data feed from the vendor, be prepared to have to build and maintain it internally.

Have some ideas for a great waste sensor? Contact me at jason@cloudbombe.com

In the next post, I intend to get into RIs and Savings Plans, which in my experience save 10x what a good waste reduction program can, despite requiring maybe 1/50th the effort. Despite this, people love waste sensors!