Cost and infrastructure optimization in the cloud - what do you need to focus on before migration?
One topic we address with customers almost daily: cloud costs and their optimization. Also a well-known State of the Cloud Report confirms that this is a crucial topic for organisations of all sizes. So where exactly to look for savings before migration? The right choice of servers, disks and services is key.
Martin Gavanda
In this article I will primarily focus on Amazon Web Services (AWS) - after all, I am an AWS architect - but most of the recommendations mentioned are easily transferable to Microsoft Azure or Google Cloud.
The article is a loose continuation of my of the previous textin which I discussed cost optimization in the cloud in general and described its basic principles. This time we'll go into more detail and focus on technical aspects of cloud infrastructurefrom the point of view of its architect.
From the survey State of the Cloud Report shows that regardless of the extent of cloud use in an organization, this topic is absolutely key for everyone. Whether they are starting out with the cloud or already run many applications on it.
What's ahead before migration to the cloud
The cloud brings a huge number of services, each with its own specific pricing and characteristics. On the one hand, therefore, there is great opportunities for optimizationon the other one. much greater demands on architects.
When designing on-premise infrastructure we've counted processors, RAM size and disk size by default. Some of us may have also distinguished between different types of data arrays (after all, storage built from SSDs has a different price than storage built from magnetic disks).
In the cloud, however, the situation is more complex and we need to know many more parameters. Or we don't have to, but in that case the infrastructure will not be optimal (both technically and cost-wise). So what should we focus on?
Choose the right servers
Most of us no longer consider the cloud infrastructure when designing variant 1:1 with existing infrastructurebut let's stop there for a moment.
If we run a server with 4 CPUs and 16 GB of RAM on-premise, it does not mean (but it can!) that we need such a server in the cloud. Always design an infrastructure that meets realistic performance requirements our application. Which of course means that we have to monitor the existing infrastructure.
The following figure shows the real load of one database server (8 CPUs):
At first glance, we see that the server is "doing almost nothing" most of the time, but occasionally there is a small spike. How to deal with such a workload?
A conservative architect would conclude that the server utilization is over 50 % and therefore halving the number of CPUs (from 8 to 4) is not possible. This may be true.
But let's focus now on the actual processoron which this virtual server is running - in this case Intel Xeon Gold 6148. What is its real performance compared to, for example C7i instances in AWS?
The current state-of-the-art C7i compute-optimized instance runs on 8 488 Intel Xeon Platinum processors, which are much more powerful compared to older generation processors. It can therefore be assumed that reducing the number of processors from 8 to 4 will have no impact on performance.
Cost optimization before migration with burstable instances
In addition, the cloud architect can take into account the option of burstable instancesthat offer relatively high performance in the short term, but they are not able to achieve this "burst" continuously. I don't usually recommend them for long term use, but they can also be an interesting alternative.
The following table shows the maximum "baseline" performance of these instances in the second half of 2024:
Source: https://aws.amazon.com/ec2/instance-types/t3/
For example, instance t3a.2xlarge, which has 8 processors, can run constantly at 40 % and once in a while at full power. What would it look like in this case? AWS offers T3a burstable instancee (running on AMD EPYC 7571), which are roughly as powerful as the Intel Xeon 6148.
Source: https://www.cpubenchmark.net/compare/3543vs3176/AMD-EPYC-7571-vs-Intel-Xeon-Gold-6148
What can we take away from this example? That we have several options available to us, with each with its pros, cons and price:
- We can reach for a more powerful EC2 instance and reduce the number of CPUs (which can have a major impact on the price of licenses for a database server!).
- We can keep the 1:1 state and choose some "average" EC2 instance with 8 CPUs.
- We can use (cheaper) burstable instances to cover the peaks.
- We can possibly sacrifice a little performance in exchange for the best price.
Scenario | Preserving the state | More powerful instances | Burstable instance | Lower power |
---|---|---|---|---|
Instance type | m6a.2xlarge | m7i.xlarge | t3a.2xlarge | m7i.large |
Number of CPUs | 8 | 4 | 8 | 2 |
Price (PAYG) | 302 USD | 176 USD | 252 USD | 88 USD |
Source: https://calculator.aws/#/estimate?id=0f7a94f7d6de4bf81decae4d07213921986c0226
So the number of CPUs is not the key factor. When choosing the right server in EC2 we also take other aspects into account:
- How busy is my server?
- What processor am I using (or can I use)?
- Do I need a constant load or burstable?
- If I reduce the server performance, will it have any effect on the application itself?
- Does the number of CPUs affect the licensing of an application or server?
Choose the right discs
Choosing the right type of disk for a server can be as challenging as choosing the server itself. AWS has six types of discs with different characteristics:
- generic SSD (gp2, gp3)
- powerful SSDs (io1, io2)
- traditional HDD (st1, sc1)
How do the discs differ and what impact will our choice have on the price? Let's start with SSDs:
Source: https://docs.aws.amazon.com/ebs/latest/userguide/ebs-volume-types.html
How to read the above table? If you don't know what you need, choose a standard gp3 SSD. If you know that, then this section is not for you 🙂
General purpose SSDs
The gp3 discs offer the following advantages over gp2 discs better performance characteristics. However, while with gp2 disks you only need to know the size of the disk in GB, with gp3 you need to define other parameters (or you don't need to, if the "base" is enough):
It should also be noted that the original gp2 disks are burstable and do not provide continuous performance. Therefore I recommend not to use gp2 discs and reach for gp3 drives instead.
You can then decide according to disk sizeswhich defines "how long" the drives will provide the required IO performance:
Source: https://docs.aws.amazon.com/ebs/latest/userguide/general-purpose.html
Provissioned IOPS drives
If you need powerful driveyou'll probably reach for an io1 or io2 disk. This time you have to explicitly define the number of IOPS, which will be reflected in the price. IOPS is related to the size of the disk, and the relationship is as follows:
- io1: 50 IOPS per GB
- io2: 1 000 IOPS per GB
Two technical notes:
- io2 discs created before November 2023 are not "block express" and therefore have different limits. To upgrade, just modify the settings.
- Extremely large or extremely fast io1/2 disks can only be connected to certain types of instances - beware!
Sample characteristics of the io2 disk:
Source: https://docs.aws.amazon.com/ebs/latest/userguide/provisioned-iops.html
The graph above shows that disk throughput for small blocks (16 KB) is linear, whereas for large blocks (256 KB) you reach maximum throughput almost immediately.
IO drives are therefore suitable for extremely busy systemswhere you need really fast storage without any compromise.
Magnetic discs
There are two types of magnetic discs available on the market:
- st1 - the so-called throughput optimized discs - offer a compromise between price and performance,
- sc1 are cold disks suitable for archiving data and minimal access to them (low IO performance and low throughput).
In both cases, throughput is again "variable" and depends on the size of the disk.
Drive st1 has the following characteristics:
Source: https://docs.aws.amazon.com/ebs/latest/userguide/hdd-vols.html
Characteristics sc1 the disk looks like this:
Source: https://docs.aws.amazon.com/ebs/latest/userguide/hdd-vols.html
How to read these two charts? The st1 disk type offers higher performance than the sc1, but its price is logically higher.
So which drive to choose?
Brief instructions may look like this:
- If you are not sure, use gp3.
- When you're looking for an extra fast drive, a good candidate would be io1.
- If you need extremely high IOPS or extremely high IOPS vs. volume size, use io2.
- At the same time, if you need an extra-small drive with a lot of power, you'll most likely reach for io2.
- If you want to store large amounts of data for a long time, use sc1.
- If you'll be continuously storing and working with large amounts of data, choose st1.
- Do not use discs gp2.
Sample price of a 1,000 GB drive with 3,000 IOPS:
Source: https://calculator.aws/#/estimate?id=03f50922e338d20afa465411f6f2ccd4067944f4
Choose the right services
This area of decision-making is currently about most complicated and would go to a separate workshop. Moreover, the application itself may be rearchitected, which will make the whole change more expensive.
Example 1
Let's take a look at a hypothetical example in the data storage area:
- Imagine you are running an application like Document Management Systemwhich stores 10 TB of data.
- The app is used on a daily basis 100 users.
- On average, each user creates five new documents and works with 20 existing documents.
- With the majority of documents not working (90 % documents are only archived).
Where to save the data?
The first thing that comes to mind is the standard gp3 disk of 10 TB. However, this solution will not be ideal in terms of high availability because only one EC2 instance can work with the disk. If we need to work with data from multiple instances, we need to use a shared file system, in this case, for example, the standard Elastic File System (EFS). Or we can choose object storage Simple Storage Service (S3).
How much would those options cost us? For the calculation I consider identical throughput requirements for EBS and EFS (i.e.125 MBps):
Scenario | EBS | EFS | S3 |
---|---|---|---|
Price | 974 USD | 1 145 USD | 252 USD |
Note | Problematic high availability | 10 % of data is actively used | Price not only per GB, but also API operations |
Source: https://calculator.aws/#/estimate?id=958f3d773dfa31f0ea5c260d9f78384e734096dd
Here we can clearly see that the cost of storing and working with this amount of data can be up to 6 times higher (EFS vs. S3).
Example 2
The second example corresponds to a real scenario from a project we are currently working on at ORBIT. It demonstrates (without going into details) the huge influence of the scenario on the price - and, in this particular case, the processing time.
Our scenario is that we need to process roughly 1 million records. Processing a single record (several kb in size) requires about 10 seconds machine time (higher performance does not bring speedup, nor can it be parallelized in any reasonable way).
We discussed three options:
- We use standard EC2 and python script inside the instance to process the required records (without parallelization).
- We'll use a standard EC2 and python script inside the instance to process the required records, while trying to figure out how to parallelize the process.
- We use cloud-native technology, in this case massively parallelized lambda functions.
The conclusions were roughly as follows:
Scenario | EC2 without parallelization | EC2 and parallelization | cloud-native |
---|---|---|---|
Price of infrastructure | ± 35 USD | ±125 USD | ± 14 USD |
Infrastructure | 1 CPU (t3a.micro) | 8 CPUs (t3a.2xlarge) | lambda function |
Running the "task" | 115 days | 14 days | 3 hours |
Price of parallelisation | 0 MDs | 1 MD = 10 000 CZK | 0 MDs |
- If we don't parallelize the EC2 runtime, we need at least a large (and cheap) server, but the data processing will run extremely long.
- If we parallelize, we incur the cost of preparing parallelization, we reduce data processing by a factor of eight, but the price will be triple.
- If we use a cloud-native approach, the price will be minimal and the processing time extremely short.
Source: https://calculator.aws/#/estimate?id=42255e5155377257b4a3cd164f61e3063f674ec0
So what did the final architecture look like?
- The data to be processed are stored in DynamoDB (1 million records).
- Based on DynamoDB Streams the record ID is inserted by the lambda function into Simple Queue Service.
- Simple Queue Service is linked to another lambda function (1 000 simultaneous runs).
- The processing outputs are again stored in the DynamoDB.
What can we take away from this example? That complex architecture change can deliver extreme value - whether in terms of infrastructure cost or processing speed.
Cost and infrastructure optimization simply starts before migration
As our CEO says Lukas Klášterský"Cloud is a bit of a different animal." And that's true in this case as well.
The very designing the right cloud infrastructure can be a tough nut to crack. If you are unsure about it or would like to discuss some specific aspects of it (and there are some!), feel free to contact me.
And what can you look forward to next? Let's see, how to optimise costs and infrastructure after migration to the cloud, we will mainly focus on:
- new services and their impact on price,
- application architecture optimization,
- continuous revision of the architecture.