Q&A: Engineering Action

During a discussion in the FinOps Foundation‘s (seriously, haven’t you joined yet?) Slack space, fellow practitioner Mike Bradbury had some good questions about getting engineers to act on cost optimization.

Pinning a number on the expected gain from cost optimisation recommendations is an interesting idea and could certainly make it easier to compare the value of competing priorities. … Do you envisage a process in which ‘quantified’ cost optimisation recommendations are submitted to engineering leadership who then weigh the effort/value against similarly quantified alternative actions and select those that will be actioned and those that will not?

I wonder if cost optimisation should be thought of as ‘good housekeeping’ rather than an initiative to be compared with other competing priorities?  …  Should we be thinking of cost optimisation as simply ‘good housekeeping’ that should happen no matter what?

Mike Bradbury, Director @ Juno Services

Q1

First question first – how do we communicate optimization recommendations to engineering? Here I feel the answer depends on the size and makeup of the organization but the simple version is “in whatever way gets them to act best”.

For a small shop, and one where things are fairly static, the landscape of opportunities is not likely to change very fast. In this situation, it might make sense to distill down the observed recommendations to a manageable set, and to register them as backlog story items in the appropriate engineering team’s queue.

However my experience is in a large organization with several thousand developers spread across hundreds of development teams, and a large and highly dynamic cloud. At this scale, when you are examining a gigantic cloud along dozens of optimization patterns, you will find tens or hundreds of thousands of individual opportunities, with thousands coming and going each day. Flooding teams’ JIRA queues with those at the line level would be a quick way to get cost optimization uninvited from the engineering priority list.

Instead, what I’ve found works is to provide tooling allowing each dev team to see and understand their opportunity set along a variety of data dimensions. These include things like the type of waste (underutilized EC2 instances, orphaned EBS volumes, misconfigured S3 buckets, etc.), accounts, tags, related business dimensions, opportunity value, and of course, time. Even as the individual opportunity sets ebb and flow day by day, teams can zoom out to see how their actions have had net positive or negative impact to their waste profile.

As covered in Step 8 of the Waste Management post from this blog, it helps to communicate overall waste in context of the group’s spend. A team observed wasting $50k/mo with $500k/mo spend might be considered in poor shape (10% spend as waste); but if that team’s business has grown fast and now they’re spending $2M/mo but still wasting $50k/mo (2.5% spend as waste), we’d probably consider them in much better shape.

Dashboards aren’t a bad way to start and will always have use for rolled-up executive visibility. As your audience gets larger, you should expect to need to deliver information upon an ever-broadening set of channels to best align with internal customer needs. You might need to offer up the data via API, scheduled or event-driven delivery over email or IM or SMS, raw data exports for downstream analysis, maybe even complete Jupyter notebooks. The more you are able to flex in getting data to people, the less everyone else will have to flex to get it, and the more likely they will be to take action.

This extends into management space; some teams may have cycles to absorb cloud optimization as part of their routine – others may not. For teams that don’t, the center cost opt team may need to provide supplementary program management style assistance. This might take the shape of helping a group establish a taxonomy for organizing their corner of the organization’s cloud, attributing the right bits to the right people and teams, and teaching them how to prioritize their observed opportunities against competing business pressures.

Q2

Second, should we think of engineering-based cost optimization as “Good Housekeeping”, or as a set of discrete Initiatives? The short but not very helpful answer is: both.

To help illustrate efforts of both types, let’s use some examples. A simple Housekeeping example – orphaned resources left around after migration to a new technology. Maybe a set of RDS instances were fused together into a new Redshift data warehouse. The old instances were left running but are no longer in use. Here, the team need only terminate those instances (actions which should have been part of the original Redshift migration plan), and the issue is permanently resolved.

Another example is unattached EBS volumes. For the longest time (may still be the case) – when an EC2 instance was terminated, the default behavior was to not release its associated EBS data volume. With as frequently as deployments and re-stackings can occur, a team could unwittingly generate dozens or hundreds of orphaned volumes almost overnight. Resolving this pattern requires not only cleaning up the pool of already-orphaned volumes, but updating the source code responsible for the incomplete cleanup, amending it to cease generating new orphans.

The latter case – where code regularly generates new orphan or sub-optimal resources – is unfortunately the more common of the housekeeping variants. It’s also the reason why tools like Cloud Custodian (or other tools changing the runtime environment) have limitations when it comes to enforcing cost optimization.

Cloud Custodian and tools like it can address the wasteful resources as it sees them, but it’s essentially playing a game of “Cloud Whac-a-Mole“. One where the Moles never stop coming, and where every action they take is an additional element of configuration drift between the team’s original code and their true runtime environment.

If you’re playing Cloud Cost Whac-A-Mole, good luck getting high score!

The right solve for these situations is to identify and remediate the source code causing the creation of new waste. In some cases it may just take one line of code to fix; in others it may require extensive re-factoring, placing remediation at a higher “Initiative” level of effort as described next.

The first example I’ll give for Initiative-based cost optimization, might look exactly like the first Good Housekeeping example above: a set of unused RDS instances. Maybe in this case though, investigation into their disuse reveals they are intentionally on standby, as they’re the failover nodes in a multi-region DR (Disaster Recovery) strategy.

Now, maybe for your business this is acceptable. In that case, the right thing to do might be to allow for this waste, and to mask it from the set of actionable opportunities through a workflow-based exception process (see step 10 from the previous blog entry). In other situations, this might be seen as a cop-out, masking a shortcoming in the underlying application. Maybe the right thing to do, if the application needs multi-region fault tolerance, is to insist the engineering team work to make their application function in an active-active mode across regions. This way all resources would be utilized at all times.

This waste isn’t the result of sloppiness or bad code; it’s the result of a conscious decision based on limitations of current application architecture. For applications writing to a central database, refactoring for multi-region active-active can be a major undertaking – a big Initiative.

Side note: while I take care to not get too cloud-technical in this blog, AWS’s outward stance on DR and availability has evolved quite a bit over time. Many still-commonly-held perspectives on best practices have become outmoded. A recent AWS whitepaper on Cloud DR goes deeper on this, and may be of use to you in lobbying for more highly cost-optimized target technical states.

Another Initiative style opportunity set might surround the availability of new tech. A recent example I can think of is Amazon’s release of Graviton2. In times like this, the cost optimization team can influence engineering behavior much like how a country’s tax code influences the behavior of its citizenry. If research indicates a new technology like Graviton can reduce the organization’s operating costs with no operational downsides, then use of an Intel- or AMD-based instance would henceforth be considered waste.

One needs to be prudent in this process. It is a non-trivial amount of work (e.g., an “Initiative”) for a mission-critical application team to fully test new instance types against their workloads before planning and executing a switch. In the case of Graviton, managed services like RDS or ElastiCache have complete feature parity and require no code changes to migrate. In this case, one might be justified in moving quickly to quantify non-Graviton RDS or ElastiCache instances as waste. However with EC2, the implications are much more complex and factor in at the software library-level. Much much more testing will be required not just for operational stability but for compliance and security measures. For EC2, it’d be more appropriate to enact a gentler timeline before classifying compatible non-Graviton instances as waste. The means by which new waste is identified and levied against teams must be fair and consider level of effort. Push too hard, and the citizenry will revolt!

What you’re likely to see if you expect your enterprise to adopt Graviton in less than 3 months

Conclusion

Throughout, it is imperative to enumerate and communicate the world of cost-savings opportunities to engineering teams in a manner allowing them to quickly and easily evaluate level-of-effort in attainment, comparing each to the benefit of the work. Some they can fix in a few seconds, others might take months of work. The composition of opportunities is expected to be a combination of mess they’ve left behind from prior work or bad code, a reflection of known-bad architectural choices they’ve made, and the showcasing of opportunities to reduce cost in light of new or changed technology from the cloud provider.