It all started with one single question “How can we save the cloud cost in the least possible time?”
Paytm is growing rapidly and so is the cloud infrastructure. The major reason for choosing cloud computing and processing is its ability to scale in and scale out as per highly fluctuating needs. Cloud infrastructure makes it highly efficient and easy to increase or decrease the IT infrastructure such as processing power, networking ability, storage, memory, etc. without hampering the operations. Paytm leverages third-party cloud providers such as AWS, Azure, and Google Clouds and has managed to provide a seamless user experience to millions of customers on a daily basis.
While growing horizontally and vertically at the same time, Paytm created half of its existing infrastructure in the last 5 years. There are 100+ AWS accounts, more than 100 verticals, thousands of servers, multiple Petabytes of cloud storage and thousands of micro-services running at the same time at this moment. Due to the lack of proper cloud resource creation processes in beginning, these micro-services were set up in a very complicated manner. It became really difficult to keep track of all the resources and their optimization. It was high time that Paytm needed a monitoring system throughout the organization. A system where all the vertical owners, technical leaders, and other stakeholders can take a look to find out the current state of cloud infrastructure and the exact cost associated with it. Further, it was expected that this tool should also suggest how the entire infrastructure can be optimised both cost-wise and performance-wise. We named this tool The ‘Paytm Optimizer’.
Introducing the Paytm Cloud Explorer Reports
The first step was the most difficult and time-consuming. It was to get every AWS account onboarded. After onboarding all the accounts, a program was written to check the current state of every AWS cloud resource. Based on the information received, the program calculated the cost associated with it, potential cost savings and other monitoring parameters to help decision making. These reports revealed potential cost savings for cloud infrastructure.
The cost savings program started sending the cost reports to all the AWS owners along with the technical leaders of Paytm. Soon, it was realized that the cost optimisation throughout Paytm can not be driven through email chains and email reports. This led to creating a tool where the cloud resources can be optimized for cost savings. One such use case where we decided to use EC2 spot instances instead of On-demand instances to reduce the compute cost by 50-60%.
The Paytm Optimizer- ASG Module
EC2 spot instances, allowed in only Auto Scaling Groups, are cheaper than the regular ones, but there is a risk. The risk of being interrupted/terminated by AWS at 2-minute notice. Spot instances are ideally designed to host stateless and fault-tolerant micro-services. In order to improve the reliability of these instances, an intermediate platform is required. This platform ensures the proper infrastructure configurations and leverages the data of thousands of other spot instances from all the onboarded accounts. Based on the data collected, the platforms predicts the interruption early enough to make the necessary changes in infrastructure for a graceful transition from one machine to a newly launched machine. The Paytm Optimizer became this intermediate platform to ensure the highest level of reliability in discounted/Spot virtual machines. The below image shows how the largest ASG of Paytm’s biggest AWS account is using Paytm Optimizer’s scheduler to adopt EC2 spot instances in non-peak and peak hours:
The Paytm Optimizer- Cost Module
The cost explorer program gave an in-depth resource level analysis of how the cost can be optimized. But the optimization became an overhead for the DevOps engineers. The cost module of Paytm Optimizer was launched to eliminate this overhead and fast track the optimization/fix of cost violating resources. The cost explorer dashboard also gave a bird view of the current state of the company’s infrastructure.
Under the Storage section of The Paytm Optimizer- Cost Module, infrastructure owners can see which of their resources are under cost violations such as volumes with ZeroIOPS, GP2 Volumes (costly and low performance than GP3), etc. Along with this, a fix button allows the user to Fix/Negate these cost violations with just one click. While fixing, the Paytm Optimizer has a checklist for each action to make sure that the fix process won’t make any impact on production. This checklist was prepared by analyzing many corner cases in the presence of cloud infrastructure experts. The image shows the list of 971 cost violating volumes and their potential savings if fixed:
Similarly, the Compute section of The Paytm Optimizer- Cost Module, allows AWS owners to dig deep into compute cost violations such as Old Generation Instances, Unused instances in non-peak hours, and Over-provisioned resources. Cost violations can be fixed using the Fix button. The fix button will throw an error if the Fix action has a potential threat to the production environment. In one of the major accounts, more than 1400 resources were detected for cost violation under compute:
The Paytm Optimizer- Cloud Protector
Cloud protector allows AWS account owners to be well managed in the cloud infrastructure. Very easy to understand YAML templates are used in this module to define rules that lead to more security and cost optimisation. Using this module users can select pre-defined scripts/policies that run periodically and make required changes in the cloud infrastructure. For example, as shown in the below image, the Service Limit Policy runs on a daily basis and automatically creates a request ticket to increase the limit with AWS if the usage of any service has crossed the threshold value. There are many other policies that have saved a huge amount of DevOps efforts.
Keeping the track
The authorization module protects all the cloud resources from being misconfigured or configured by an unauthorized person. The authentication allows only the Paytm employees to log in. Further, every action taken from the Paytm Optimizer dashboard is recorded and displayed under the Audit Trail tab.
The Impact
The spot module of this platform is currently hosting 70+ AWS accounts with 1800+ ASGs. Over 5000+ AWS resources are being monitored on a real-time basis to make the infrastructure highly reliable. 250+ ASGs are using Paytm Optimizer to adapt the spot instances. Over the period of two months after the release, this tool has managed to achieve double-digit spot instance adoption coverage in a non-prod environment. The highest ever recorded Paytm wide spot instance adoption coverage was 17% which is very high as compared to 3% before Paytm Optimizer. Huge monetary savings have been recorded just by using spot instances in the last three months. 90% of Paytm’s web traffic is being served by spot instances managed by Paytm Optimizer. Due to the constant monitoring of all the resources throughout all onboarded AWS accounts, Paytm Optimizer managed to predict instance termination before 10 minutes for 81% of the cases. This prediction was one of the highest as compared to paid tools available in the market.
A tremendous amount of time was saved for DevOps engineers due to the cost module. It became very easy to make decisions and pinpoint the high cost-violating resources. 3500+ AWS resources were fixed to stop the cost leakage.
The Future
The vision for the Paytm Optimizer is to become a one-stop solution when it comes to cloud infrastructure cost savings, optimization, and self-healing ability. In future, this platform will allow users to create automated scripts such as auto-tagging AWS creators on resources, stopping EC2 that are using unapproved AMIs, auto enforcing SSL for critical services, deleting the unencrypted volume, and many more other use cases. The spot module will support EMR clusters along with ASGs. Network cost violations & diagnosis will be added along with Fix/Negate features.