Samarth - Monitoring

Need

Monitoring a deployed system in production is an extremely important step.

Lack of monitoring leads to the below issues:

Deployed services & servers become black boxes
Systems can go down unexpectedly
Wait for users to inform about failures
Engineers / PMs are escalated to fix issues reactively
Constant firefighting
Pattern identification is super tough

With monitoring to following advantages are seen:

Deployed systems & servers are monitored actively & automatically
Systems going down trigger alerts
System health degradation can help proactively fix issues
With a stabler proactively fixed system we can actively work on features
Patterns are identified quickly

Our top goals to setup monitoring for the Samarth systems are:

Identify performance of backend systems
Setup alerting when a system is degrading or offline

Implementation

Monitoring & alerting for Samarth system is powered by a Grafana + Prometheus system. Grafana and prometheus are industry standards when it comes to production deployment monitoring and alerting. We chose them as they are open source and can be self hosted, which both are important requirements.

Monitoring

Monitoring Working Image Source

Host Monitoring

Here we monitor the server processes, load average, uptime, storage, etc..

Services Monitoring

Here we monitor the service response times, uptimes and load.

Load Monitoring

Here we monitor the load on the host. We use it extensively to see which service / container is consuming how much memory. It helps in identifying unnatural peaks in traffic.

Alerting

We have setup alerting to let us know when:

A system is above 80% load
A system is low on memory
A service goes down

These alerts are linked to our communication channels where devs can proactively check and fix issues rather than having to monitor the grafana dashboard constantly.

Benefits

Monitoring has helped us identify some excessive load issues and proactively fix issues before the system goes down completely. It has also helped us identify large response times from services which we can fix proactively.

Need​

Implementation​

Monitoring​

Host Monitoring​

Services Monitoring​

Load Monitoring​

Alerting​

Benefits​