The simplicity of the Paytm app belies in its architecture that has been designed by our Infra Technology Labs Team. As Paytm is growing rapidly, the team is exploring the latest cloud optimization technologies to mitigate cloud infrastructure-related problems, introduce new frameworks, build a highly scalable and available solution to handle our scale and support business growth across the globe.
We have started searching for an open-source monitoring solution that would be highly scalable, available, cost-effective, and also suitable and easily supported across all Paytm stacks. While exploring for options, which can allow us to integrate with different channels, such as Slack, emails, SMS, Microsoft teams, PagerDuty, etc., for notification and alerting, our search led us to Prometheus.
Prometheus is a monitoring solution that gathers time-series based numerical data. It pulls (scrape) metrics from a client over http and places the data into its time series database that you can query using its own DSL.
Why Prometheus?
- Prometheus doesn’t require users to install custom software or configuration on servers, or container images to enable collecting metrics, and handles service failure/unavailability gracefully
- It provides a Pushgateway that allows applications to push metric data, in case pulling metrics is not feasible
- All components of Prometheus can be run in containers and offers better integration with Kubernetes.
- Integration with Grafana for data visualization and further it can be integrated to multiple data sources like CloudWatch and ElasticSearch which can help to build a common view dashboard of applications. Grafana includes built-in support for Prometheus, and allows you to query, visualize, and alert your metrics from other open source and commercial data sources.
- Prometheus Client libraries support multiple programming languages. Client libraries let you define and expose internal metrics via an HTTP endpoint on your application’s instance. We can drop sensitive or unwanted metrics/ labels from data.
Scaling Prometheus at Paytm
After exploring Prometheus, we found interesting solutions for our system. Since we were using paid monitoring tools that were challenging due to our system’s size and different application needs, Prometheus is an open source and popular monitoring tool for cloud native applications. It has a wide range of service discovery options to find your services and it starts retrieving metric data from them. We are using EC2 Service Discovery and Kubernetes for this.
We use Thanos as global querier to collect metrics or data from multiple Prometheus and Alter Manager to trigger the alerts on multiple channels like Slack, MS team, PagerDuty, Emails and custom webhooks.
Our NOC has a single view of all systems of our ecosystem and enforce the standardization which helps to streamline the Incident Management process. This solution can be hosted across AWS accounts and can have a centralized dashboards of metrics with help of grafana/thanos. It also meets data localization and compliance guidelines.
We also use blackbox exporter for health checks of our applications’ endpoints. Cardinality was a bigger challenge for us to manage the availability of Prometheus on such a large scale. For this, we have written our custom python code to monitor cardinality and take appropriate actions to avoid outages due to high cardinality. We have enforced scrap metrics limits for each job/target to avoid unexpected events or issues due to lots of metrics.
We are also using Simple Storage Service (S3) to store metrics for a longer period and leveraging Thanos store capability to pull the metrics/data from S3 if data is not available in local EBS.
As a Result
- We are monitoring more than millions of events per minute and billions of events on a daily basis.
- Since its launch, we have witnessed 100% availability of the system in the last 40 months.
- Our total cost of managing Prometheus is less than 2% of what we were paying for paid tools. Prometheus is managed by just one person.
Looking Ahead
We look forward to building a self-healing solution for some critical alerts. We have already built a solution to increase/decrease disk space automatically based on certain triggers to avoid any outage due to disk space full issue or avoid wastage of resources.
The blog was contributed by the Infra Tech Lab team at Paytm. Want to be a part of Paytm? Then explore our career opportunities here.