Q&A: Engineering Action

During a discussion in the FinOps Foundation‘s (seriously, haven’t you joined yet?) Slack space, fellow practitioner Mike Bradbury had some good questions about getting engineers to act on cost optimization.

Pinning a number on the expected gain from cost optimisation recommendations is an interesting idea and could certainly make it easier to compare the value of competing priorities. … Do you envisage a process in which ‘quantified’ cost optimisation recommendations are submitted to engineering leadership who then weigh the effort/value against similarly quantified alternative actions and select those that will be actioned and those that will not?

I wonder if cost optimisation should be thought of as ‘good housekeeping’ rather than an initiative to be compared with other competing priorities?  …  Should we be thinking of cost optimisation as simply ‘good housekeeping’ that should happen no matter what?

Mike Bradbury, Director @ Juno Services

Q1

First question first – how do we communicate optimization recommendations to engineering? Here I feel the answer depends on the size and makeup of the organization but the simple version is “in whatever way gets them to act best”.

For a small shop, and one where things are fairly static, the landscape of opportunities is not likely to change very fast. In this situation, it might make sense to distill down the observed recommendations to a manageable set, and to register them as backlog story items in the appropriate engineering team’s queue.

However my experience is in a large organization with several thousand developers spread across hundreds of development teams, and a large and highly dynamic cloud. At this scale, when you are examining a gigantic cloud along dozens of optimization patterns, you will find tens or hundreds of thousands of individual opportunities, with thousands coming and going each day. Flooding teams’ JIRA queues with those at the line level would be a quick way to get cost optimization uninvited from the engineering priority list.

Instead, what I’ve found works is to provide tooling allowing each dev team to see and understand their opportunity set along a variety of data dimensions. These include things like the type of waste (underutilized EC2 instances, orphaned EBS volumes, misconfigured S3 buckets, etc.), accounts, tags, related business dimensions, opportunity value, and of course, time. Even as the individual opportunity sets ebb and flow day by day, teams can zoom out to see how their actions have had net positive or negative impact to their waste profile.

As covered in Step 8 of the Waste Management post from this blog, it helps to communicate overall waste in context of the group’s spend. A team observed wasting $50k/mo with $500k/mo spend might be considered in poor shape (10% spend as waste); but if that team’s business has grown fast and now they’re spending $2M/mo but still wasting $50k/mo (2.5% spend as waste), we’d probably consider them in much better shape.

Dashboards aren’t a bad way to start and will always have use for rolled-up executive visibility. As your audience gets larger, you should expect to need to deliver information upon an ever-broadening set of channels to best align with internal customer needs. You might need to offer up the data via API, scheduled or event-driven delivery over email or IM or SMS, raw data exports for downstream analysis, maybe even complete Jupyter notebooks. The more you are able to flex in getting data to people, the less everyone else will have to flex to get it, and the more likely they will be to take action.

This extends into management space; some teams may have cycles to absorb cloud optimization as part of their routine – others may not. For teams that don’t, the center cost opt team may need to provide supplementary program management style assistance. This might take the shape of helping a group establish a taxonomy for organizing their corner of the organization’s cloud, attributing the right bits to the right people and teams, and teaching them how to prioritize their observed opportunities against competing business pressures.

Q2

Second, should we think of engineering-based cost optimization as “Good Housekeeping”, or as a set of discrete Initiatives? The short but not very helpful answer is: both.

To help illustrate efforts of both types, let’s use some examples. A simple Housekeeping example – orphaned resources left around after migration to a new technology. Maybe a set of RDS instances were fused together into a new Redshift data warehouse. The old instances were left running but are no longer in use. Here, the team need only terminate those instances (actions which should have been part of the original Redshift migration plan), and the issue is permanently resolved.

Another example is unattached EBS volumes. For the longest time (may still be the case) – when an EC2 instance was terminated, the default behavior was to not release its associated EBS data volume. With as frequently as deployments and re-stackings can occur, a team could unwittingly generate dozens or hundreds of orphaned volumes almost overnight. Resolving this pattern requires not only cleaning up the pool of already-orphaned volumes, but updating the source code responsible for the incomplete cleanup, amending it to cease generating new orphans.

The latter case – where code regularly generates new orphan or sub-optimal resources – is unfortunately the more common of the housekeeping variants. It’s also the reason why tools like Cloud Custodian (or other tools changing the runtime environment) have limitations when it comes to enforcing cost optimization.

Cloud Custodian and tools like it can address the wasteful resources as it sees them, but it’s essentially playing a game of “Cloud Whac-a-Mole“. One where the Moles never stop coming, and where every action they take is an additional element of configuration drift between the team’s original code and their true runtime environment.

If you’re playing Cloud Cost Whac-A-Mole, good luck getting high score!

The right solve for these situations is to identify and remediate the source code causing the creation of new waste. In some cases it may just take one line of code to fix; in others it may require extensive re-factoring, placing remediation at a higher “Initiative” level of effort as described next.

The first example I’ll give for Initiative-based cost optimization, might look exactly like the first Good Housekeeping example above: a set of unused RDS instances. Maybe in this case though, investigation into their disuse reveals they are intentionally on standby, as they’re the failover nodes in a multi-region DR (Disaster Recovery) strategy.

Now, maybe for your business this is acceptable. In that case, the right thing to do might be to allow for this waste, and to mask it from the set of actionable opportunities through a workflow-based exception process (see step 10 from the previous blog entry). In other situations, this might be seen as a cop-out, masking a shortcoming in the underlying application. Maybe the right thing to do, if the application needs multi-region fault tolerance, is to insist the engineering team work to make their application function in an active-active mode across regions. This way all resources would be utilized at all times.

This waste isn’t the result of sloppiness or bad code; it’s the result of a conscious decision based on limitations of current application architecture. For applications writing to a central database, refactoring for multi-region active-active can be a major undertaking – a big Initiative.

Side note: while I take care to not get too cloud-technical in this blog, AWS’s outward stance on DR and availability has evolved quite a bit over time. Many still-commonly-held perspectives on best practices have become outmoded. A recent AWS whitepaper on Cloud DR goes deeper on this, and may be of use to you in lobbying for more highly cost-optimized target technical states.

Another Initiative style opportunity set might surround the availability of new tech. A recent example I can think of is Amazon’s release of Graviton2. In times like this, the cost optimization team can influence engineering behavior much like how a country’s tax code influences the behavior of its citizenry. If research indicates a new technology like Graviton can reduce the organization’s operating costs with no operational downsides, then use of an Intel- or AMD-based instance would henceforth be considered waste.

One needs to be prudent in this process. It is a non-trivial amount of work (e.g., an “Initiative”) for a mission-critical application team to fully test new instance types against their workloads before planning and executing a switch. In the case of Graviton, managed services like RDS or ElastiCache have complete feature parity and require no code changes to migrate. In this case, one might be justified in moving quickly to quantify non-Graviton RDS or ElastiCache instances as waste. However with EC2, the implications are much more complex and factor in at the software library-level. Much much more testing will be required not just for operational stability but for compliance and security measures. For EC2, it’d be more appropriate to enact a gentler timeline before classifying compatible non-Graviton instances as waste. The means by which new waste is identified and levied against teams must be fair and consider level of effort. Push too hard, and the citizenry will revolt!

What you’re likely to see if you expect your enterprise to adopt Graviton in less than 3 months

Conclusion

Throughout, it is imperative to enumerate and communicate the world of cost-savings opportunities to engineering teams in a manner allowing them to quickly and easily evaluate level-of-effort in attainment, comparing each to the benefit of the work. Some they can fix in a few seconds, others might take months of work. The composition of opportunities is expected to be a combination of mess they’ve left behind from prior work or bad code, a reflection of known-bad architectural choices they’ve made, and the showcasing of opportunities to reduce cost in light of new or changed technology from the cloud provider.

Quick Take: State of FinOps 2021 and Getting Engineers to Take Action

The FinOps Foundation recently released the results of a wide-scale survey of Cloud Cost practitioners around the world. If you’re reading this blog and aren’t already a member, highly recommend joining the group to benefit from a broad array of perspectives, and connect with folks who may have experience solving the problems you face.

One of the key findings from the survey – the biggest challenges surrounded “Getting Engineers to Take Action”:

In this blog’s previous entry an approach for dealing with this challenge was described. While there’s a lot of actionable content there, given the prevalence of this as a pain point in the FinOps community, I thought it might make sense to zoom out and look at how to shift perspective and quickly soothe some pain amongst my peers.

Now, it would be the rare case where we as FinOps personnel are positioned organizationally to compel engineers to prioritize cost-saving measures above their competing pressures to grow/innovate, secure, and stabilize their services. This comes back to the four competing priorities engineering groups face, mentioned in the blog’s first post:

It’s human nature to believe what one toils away at all day long is critical to the organization, but an important truth to consider is that sometimes what we’re doing – while still of huge value and importance – may not be the most important thing to the organization, in that moment. If we are responsible for cost optimization but don’t see engineers actively performing cost-saving work, does that mean we’re failing at our jobs? If so what should we do about it?

My suggestion is to not judge your success in cost optimization by the completeness of every conceivable cost-opt checklist. Instead, measure it as the organization’s knowledge and comfort of its position on efficiency – its attainment of a self-aware equilibrium of efficiency against competing business forces.

Engineering inaction in cost optimization is not necessarily a failure of the cost opt function. If the cost opt team has surfaced the universe of cost-saving opportunities, attributed them to the right dev teams, quantified them using metrics allowing for apples-to-apples comparison against competing priorities, provided concise assisted pathways to remediation, and presented a business case to engineering leadership for action – then the team is doing its job.

If there’s inaction, but engineering teams have only been given vague guidelines (“have you tried turning stuff off at night?”), no harvesting or quantification of their opportunity set has been performed so they can’t enumerate and prioritize their choices, and no guidance or help exists on how to fix things – then the cost opt team needs to look inward, as it still has work to do.

Engineering leadership is constantly adjusting the balance between the above four domains based on business climate and direction. A healthy balance exists when there’s a continuous evaluation of opportunities with action taken only on those offering suitably high ROI. In some quarters, you should expect cost opt to get little to no attention; the flip side is in quarters where the budget is tight you may find huge appetite to attack cost saving opps.

You can be certain the Security, Operations, and Product Management engineering-please-do lists are never complete, we shouldn’t expect the Cost Opt one to be there either. In fact, I’d worry about the prospects of a company choosing to fully prioritize today’s million-dollar cost savings opportunity over tomorrow’s billion-dollar blockbuster.

An alternative title to this post could be, “How to Sleep at Night as a FinOps Practitioner, Knowing Your Org is Wasting Megabucks in the Cloud” 😉

Finding, Tracking, and Holding Teams Accountable for Savings Opportunities (aka Cloud Waste)

This was originally going to come after a post on maximizing savings with Reserved Instances and Savings Plans – but due to popular demand, moving it forward. This blog post is intended to equip its reader with a proven approach for measuring and driving reduction in wasteful cloud use and costs. While AWS services will be mentioned, none of the principles are specific to their cloud.

As an organization grows its cloud use – whether through migration or growth, parties involved reasonably expect costs to grow in kind. Despite this, most orgs inevitably face one or more “cost crises” on their journey – moments where costs pass certain thresholds (like the first time the cloud bill grows a new digit), or the first time costs exceed budgetary projections by a large amount. These occasions spawn fresh examinations of cloud use, seeking to “get things back under control”.

Deploying the magnifying glass reveals places where some costs contributing to the crisis could have been avoided, if only people had known about them sooner. These opportunities take the shape of orphaned or not-ideally-configured resources, which, with better and more timely remediation, could have saved the organization real money.

The all-too-common pattern is to address the surface findings of the crisis investigation, mark cloud efficiency “back on track”, and consider it done. In reality, your cloud probably had a dozen new inefficient resources provisioned before that Mission Accomplished! email even finished circulating. You need a process of continual monitoring, measurement, and remediation to ensure cloud efficiency is always on track.

Before getting deeper on this subject, a couple points. First, attribution is mandatory to succeed in a cloud waste management endeavor. You need to know who (could be a team, an org, or an individual) is responsible for the efficiency of every resource in your fleet. Resource doesn’t always mean an EC2 or other instance – it can be an S3 bucket, a DynamoDB table, an elastic IP address – there and hundreds of different types of resources available.

The next, as you go to address this space for your organization, is to be realistic and focused in what you work to make happen given your resources. I think of this as the “depth and breadth” problem. Some organizations intentionally limit the variety of cloud usage patterns for compliance or security purposes, while others might arrive at a homogenous technological approach based on the nature of their business. Other organizations might see tremendous variation from one team to the next not just in technology used, but in the means of implementing each technology. Where a wider variety of tech patterns exist (more “breadth”), you will be less able to provide robust pathways to remediation (“depth”) for a given level of resourcing in your program.

The 11 step program of cloud efficiency-

  1. Identify the technology costing your org the most $
  2. Within that technology, identify patterns of wasteful use
  3. Develop software to assess your entire cloud for these patterns
  4. Run the software regularly and comprehensively
  5. Calculate the savings opportunity of each observation
  6. Make the data available along appropriate channels
  7. For every facet of waste exposed, provide pathways to remediation
  8. Establish a metric for cloud efficiency and solidify executive support
  9. Repeat Steps 1-7, continuously expanding the set of wasteful patterns
  10. If you must, allow exceptions
  11. When and where appropriate, auto remediate!

1. Identify the technology costing your org the most $

This should be fairly common sense, but think of it this way – if your org is spending $1M/mo on EC2, and $10k/mo on SageMaker – even if somehow 100% of the SageMaker spend was waste and you brought it to $0, odds are even greater opportunities exist within that $1M/mo EC2 bill. Start with your biggest spend.

2. Within that technology, identify patterns of wasteful or sub-optimal use

Things get trickier here, and here’s where having cloud-tech-savvy members of the team (as discussed in the blog’s first post) pays dividends. In the case of AWS, Amazon can get you started with services like Trusted Advisor and Compute Optimizer. For most, underutilized EC2 instances are the biggest opportunity area, hence the variety of perspectives on EC2 right-sizing from AWS and others.

EC2 might not be the technical heart of your business – maybe it’s S3, DynamoDB, or any other service lacking rich native utilization efficiency indicators. In those cases, it’ll be up to you to understand how your teams are wielding these services, and determine what usage patterns are OK and which are not. You might also find the out-of-the-box recommendation sets insufficient and wish to generate your own.

3. Develop software to assess your entire cloud for these patterns

With a wasteful pattern identified, you now need software to generate and collect the data required to determine the efficiency of a service’s use. This data might be in the form of already-available metrics from AWS, but it might also be manifest in your own application or service logs and traces. It might be something new that needs code changes to create. Every one is its own adventure.

Within these processes, you’ll need to choose where to draw the line between efficient and inefficient use. In most cases usage efficiency falls upon a spectrum, and as the owner of the algorithm, you’ll need to be able to defend the choice you’ve made. Too low/conservative, and you end up ignoring potentially vast amounts of addressable waste and achievable savings. Too high/aggressive, and you risk incurring operational impacts, wasting time chasing low-value opportunities, and sparking the ire of teams held to unattainable standards. Don’t expect to get these all right out of the gate – do the best you can and iterate over time. Sometimes it makes sense to start with coarse, broadsword-style approaches, achieving greater surgical precision over time as more and better data becomes available. If you wait for perfect, you’ll miss months of real opportunity.

A completed software component evaluating a facet of cloud operation and returning opportunities for efficiency gains I call a Waste Sensor.

4. Run the software regularly and comprehensively

All this work spotting waste won’t bring full returns if only done occasionally, or against a subset of your cloud. Ensure your waste-sensing software runs regularly, in every corner, evaluating all occurrences of potentially wasteful usage.

Something you’ll likely find once measuring everyone with these sensors, is their desire to immediately see their remediation activities “move the needle” of measured waste. If the sensor is only running weekly, it could then be a week before teams see and can measure the results of their work. Daily is better, and even then you’re likely to have people asking for faster!

In addition, I recommend warehousing or otherwise tracking history from these sensors. With tracking you gain the ability to demonstrate how teams have taken action to reduce their measured waste over time. Along the lines of “If a tree falls in the forest…” is “If a cost savings effort was taken but not measured, did its savings really happen?” Clear, data-driven before & afters are feathers in the cap not only of the teams having taken the cleanup actions, but for you as a driver of the organization’s cloud efficiency.

5. Calculate the savings opportunity of each observation

Think of savings opportunity as the difference between the price as paid for the resource, vs. what it would have cost if the resource had been correctly configured, or retired if appropriate. Again some judgment may be required, there is no standard way of doing this.

I would suggest first choosing a standard timeframe of savings, for use across your entire fleet of waste sensors. It becomes a confusing mess if you’re measuring DynamoDB savings in $/hour, EC2 rightsizing savings opportunity in $/day, S3 savings in $/week, RDS in $/month. I find monthly works well because people can directly relate it to what they see on the invoice, or in their internal billing each month.

In AWS most costs are metered on an hourly basis, so it becomes a matter of calculating savings at that grain, then extrapolating out to your chosen common span. In the simplest case, consider an instance you’ve found severely underutilized; the team has deployed an m5.16xlarge instance, but based on the utilization metrics you’ve collected, an m5.2xlarge instance (1/8th the size) would provide more than sufficient resources to cover the system’s encountered load without incurring operational risk. If an m5.16xlarge costs $2/hour to run (making this cost up) then an m5.2xlarge should only be $.25/hour, for a savings opportunity of $1.75/hour. Extrapolated over an average 730-hour month, this is a $1277.50 opportunity.

If it were the case this resource was orphaned and a candidate for termination, the savings opportunity is the full $2/hour, or $1460/month.

In some situations the math on hourly pricing deltas is not as straightforward. In others, it may not be appropriate to perform a linear extrapolation out to a full month, if there’s reasonable expectation the resource and its descendants will not live that long.

By putting a dollar figure on every observed savings opportunity, teams are able to prioritize their efforts. If you have a good fleet of waste sensors, odds are nobody will be able to fix everything and get to zero waste. But with this common meter by which all different elements of their waste have been measured, they will be able to ensure they’ve at least addressed their greatest “bang for the buck” opportunities first.

6. Make the data available along appropriate channels

At this point in a database somewhere you have a listing of all the savings opportunities your collection of waste sensors have uncovered. What next?

At a minimum you’ll want to have a dashboard, report, or similar business intelligence artifact created to track everybody’s waste day by day. As mentioned above, attribution is foundational to this sort of effort, so at this point it should be academic to create org-wide rollup numbers, and provide drill-down into groups and subgroups throughout the org hierarchy, to see who’s wasting what.

Scorecard-style views, where the metrics of one group are plainly laid alongside another, can be great motivators to group leaders. Nobody wants to be at the top of the “wall of shame”.

In addition to rollup reporting for leadership (and to show off the efficacy of your program!) this data may also be suitable for programmatic consumption by individual development teams. Offering your findings through an API or similar programmatic methods allows the software and automation-focused dev teams to consume and process your insights in a manner familiar to them. The fewer manually-built spreadsheets that have to get passed around, the quicker and more comprehensively waste can be addressed.

7. For every facet of waste exposed, provide contextual pathways to remediation

Knowing is half the battle.

–G.I. Joe

The work thus far exposes and quantifies your organization’s opportunities to operate more efficiently and save money – you now know your waste – but so far nothing’s actually been saved. An important aspect of your program is that it facilitates quick, easy, low-risk, and lasting resolutions to observed inefficiencies. Every waste sensor must be accompanied by a treatment plan, with clear guidance provided on criteria of efficient operation.

This is where the depth/breadth factor comes into play. If you have a large and well-staffed team of cost optimization engineers addressing a tiny number of services all with common architectures, CICD pipelines, and programming languages, then you may be in a position to “reach in” quite a ways to the software dev teams. In theory you could provide suggestions for remediation at the source code level, but also anywhere in-between, like in their CloudFormation templates or TerraForm configs. For common patterns of underutilization (EC2), there exists third-party software you can purchase which can even do some of this for you.

In other situations, with a small team up against hundreds of approaches to dozens of technologies with almost nothing in common, you may not have resources to programmatically suggest higher-level pathways to remediation. This reality should be a factor for your organization in deciding what stance it takes on technology standardization. There may be benefits to allowing each team ultimate flexibility in how they code, deploy, and run their services, but that flexibility comes at the cost in additional effort and complexity when it comes time to optimize their unique approach.

8. Establish a metric for cloud efficiency and solidify executive support

As not all teams generate the same volume of spend in the cloud, it might not be fair to presnt the absolute $ waste of one team against another. Considering waste alone, a team incurring $100,000/month in waste might look awful next to one incurring only $10,000. However when it’s examined in the context of overall spend, where first team is spending $5M/month and the second only $50,000/month, the perspective shifts.

To account for this, I’ve found it valuable to establish a common metric thusly:

(Total Savings Opportunity) / (Total Monthly Spend) = Percent of spend wasted

Using this measure on the above two teams, the first:

$100,000 / $5,000,000 = 2%

The second:

$10,000 / $50,000 = 20%

This approach normalizes efficiency across teams and spends of various sizes, allowing for use uniformly throughout the organization. With attribution in place, this measurement can be made at any level within the hierarchy – the whole company, large groups, all the way down to individual teams.

With a system of measurement in place, the next step is to set goals and drive teams to achieve them. This is where the establishing executive sponsorship comes into play. As a practitioner or even leader of cloud excellence within your organization, you are likely not in a position of significant authority to compel or even encourage teams to strive for a cloud waste goal. You need backing from those of sufficient influence to get all cloud operators in your organization to agree to be measured not just by pure budget adherence (or whatever other metrics they might already be assessed by) but also by this new metric.

You also need to help make sure everyone party to this measurement understands the purpose and reasonability of the metric. You’ll need to help establish reasonable goals and timeframes. For an org with a substantial cloud footprint just starting out but new to addressing waste, “We’re gonna get to zero waste in 6 weeks!” may sound great, but probably isn’t reasonable. But cutting measured waste in half by the end of year 1? That might be achievable if teams are given adequate cycles to focus on it.

9. Repeat Steps 1-7, continuously expanding the set of wasteful patterns

Those now subject to your cloud waste management program need to be open to an ongoing evolution in the yardstick of efficient operation. Everyone needs to know you’re “moving the goalposts” continuously, on purpose, and with your team’s full effort, so as to discover and address new patterns of inefficiency. A rich new source of savings should always be surfaced as quickly as possible, and not be put off to the next quarterly (or worse) metric review cycle.

Leave No Stone Unturned In Your Desire for SuccessTony Fahkry ...
The underutilized EC2 instance stone

There are a couple analogies I think of in regards to this continuous expansion. The first is the classic “leave no stone unturned”. Every different style of use of every different cloud service is itself a stone, underneath which potential savings from inefficient operation may be hiding. The continuous change in the ways a large organization uses the cloud and the constant influx of new and expanded services from the cloud provider, mean a constant influx of new stones worthy of investigation.

Oil & Gas Exploration ETFs Surge on Chevron, Anadarko Deal
Offshore drilling of unattached elastic IP addresses

The other analogy, more relevant as the program matures, relates the process of cloud waste discovery to that of oil & gas exploration. A large organization’s cloud is a vast and rich place to discover people wasting money, just as the earth is a vast place with lots of fossil fuels tucked under the surface. But as with oil & gas, once you’ve tapped the big and obvious stores of value, subsequent discoveries are likely to generate smaller and smaller returns. At some point you’re going to be fracking for savings.

Eventually you may come across a use pattern generating only a few hundred dollars a month in waste in total across the enterprise. If the code to capture those opportunities was small and cheap to maintain and run, and the pathway to remediate quick, simple, and lasting for the teams affected, then it might be worth going after. But if not – if it required a complex new data pipeline, or if the opportunities were each very small, with risky and time-consuming steps required to remediate, then your organization may never see ROI on going after that waste.

At the same time, if you’re somewhere really big, try not to lose perspective. It may sound incredible if you’ve come from a small shop, but it might not be unheard of for a large enterprise to have millions of dollars a month in observed waste. In that setting, a new waste sensor exposing “only” $25k/month of new savings opportunity might feel not worth pursuing. That’s still $300k/year – so even if it takes an engineer a month to build and lots of time to maintain, there’s still a great return potential there.

10. If you must, allow exceptions

Once teams are held to a waste standard, you’ll see lots of interesting behavioral changes. The costs of technical decisions made in the name of disaster recovery, business continuity, or other – decisions potentially generating huge wasted cost in the form of underutilized standby resources – will be brought back to the surface for re-evaluation. It’s inevitable teams will seek exclusion for some of their resources from your waste sensors, and there’s one right and two wrong ways to handle this situation.

The first wrong way is to not allow any exceptions. In addition to making everybody mad at you and question the validity of the program, it ends up distracting people with savings opportunities which aren’t really addressable. Remember, it’s your objective to maximize the savings benefit to your organization, and that effort is impeded if teams are constantly having to filter-out things they can’t fix for various (and sometimes legitimate!) reasons.

The other wrong way is to allow anybody to exclude anything at any time without restriction. I can share an anecdote from a (large!) peer company who was struggling both to get tagging going, and to get their waste program launched. They eventually prepared and launched their first two waste sensors, one for EC2 instances and the other for S3 buckets. Along with the sensors, they allowed any resource receiving a certain tag key:value pair to be excluded from evaluation.

Beaker Muppet GIFs | Tenor
The company’s name, which I shan’t mention, always reminds me of Beaker the Muppet

Well wouldn’t you know it, practically overnight every single bucket and instance in their fleet was tagged for exclusion! Somehow the pressure of waste scrutiny cured their inability to effectively deploy tags. Unrestrained exclusion isn’t the answer either.

Instead, an approach allowing for appropriate and justified exclusions works best. A workflow begins when somebody requests an exception for a very specific technical pattern for a very specific business purpose. They provide thorough justification, and the request for exception should be reviewed for both its technical and business merits by approvers equipped to understand the situation and empowered to say no when warranted. Part of the process should include documentation of a plan to eventually address and remediate the wasteful pattern.

Only after the exception is approved should the waste be excluded from the team and overall company metrics. I recommend continuing to track this waste closely (but separately) so it does not become forgotten. Technically, tags work well for resource-level exclusions, but the automation should be in place to ensure only approved exception tags, applied to resources defined under the scope of the exception, receive exclusion.

11. When and where appropriate, auto remediate!

Keeping in mind the point of all of this is to save money, or at least, to have as little as possible going to wasteful ends – it ends up being the acts of remediation that take the effort across the finish line. Knowing what needs to be done is half the battle – getting people to act on those leads is the other.

When we know what needs to be done, and there’s a straightforward path to remediation, it begs the question, why not just fix it for them? The answer comes down to a new question of risk management.

Any change action, manual or automatic, brings with it risk of operational impact to the services or applications changed. There’s a fine balance of operational vs. financial risk which must be managed; part of what you need to do is help your organization find and maintain its ideal hinge point for each sensor.

By presenting opportunities but not acting upon them, you are offloading the risk management to the team closest to the infrastructure and best positioned to perform the analysis. When your systems are remediating issues on their behalf, you are taking on the risk of impact.

The risk equation can change as circumstances change. For instance, an automatic cost-saving measure that carries a small, but non-zero risk, may not make sense to run on the days leading up to Mother’s or Valentine’s Day, for a business that sells flowers online and counts on those narrow time windows for 90% of annual revenue.

It’s also important things be fixed the right way. In the classic example of an overprovisioned EC2 instance – it is not difficult to generate a routine which terminates, or even restacks the instance to a smaller type. However, in a modern infrastructure-as-code setting, all you’re doing is introducing configuration drift between the deployed infrastructure and the coded infrastructure; your fix will be undone the next time the infrastructure is restacked, which should be often.

We can’t do it from here, I’m telling you

Just like the electrician couldn’t properly shut off electricity to Nakatomi Plaza, there’ll be some cost inefficiencies you can’t fix by playing whack-a-mole in people’s accounts. It’s got to be done from downtown, or in our case, in the infrastructure’s source code.

For each type of waste uncovered, you’ll have to decide if there’s a path to automatic remediation, and when it is safe to act. You may be able to take a more aggressive stance on accounts or environments flagged as development/non-production, than you could in Production. You could also provide teams a means of tagging resources in a manner providing cost cleanup processes with additional guidance on what’s safe or not.

Increased investment and sophistication make more and more possible – for instance, if you can automate all the technical aspects of a fix but not the risk assessment, the remediation could be distilled down to a “One Click Repair” for the operating team. As with everything else in the broader cost opt space, survey the landscape and investment in the areas presenting the greatest potential return.

Conclusion

It can be a bit of work to build steam on a cloud efficiency program, but once some momentum has been built, excitement around its continued savings should help you keep it going. As with everything else cloud-based, data is abundant and your friend – use it to demonstrate the efficacy of your program and to reward teams for their success in eliminating waste. Nobody outside of Sales in your company can make as clear a case of their financial bottom-line contribution as you can in this role.

Unlike billing, you’re not starting off with the backbone of a solid data feed from the vendor, be prepared to have to build and maintain it internally.

Have some ideas for a great waste sensor? Contact me at jason@cloudbombe.com

In the next post, I intend to get into RIs and Savings Plans, which in my experience save 10x what a good waste reduction program can, despite requiring maybe 1/50th the effort. Despite this, people love waste sensors!