Keep Django Under Control: Creating Custom Metrics with Prometheus

Table of Contents

What is Prometheus?
Building a Prometheus-based system

When we create an application, we often think that once it's deployed to production, our work is done. Nothing could be further from the truth!

What if an issue occurs that could not be foreseen during the implementation phase or detected during the tests? It’s our responsibility to fix this. Often, it means spending thousands of hours on debugging.

This process can be simplified and shortened thanks to the monitoring of our application. There are already a lot of useful monitoring tools that can be integrated with our application or infrastructure. But what if we want to monitor some custom features, which are not included in already existing solutions?

Here comes an amazing tool called Prometheus, which allows us to write our custom metric-based monitoring system.

In this blog post, I will show you how to make a sample Prometheus-based monitoring system using Django REST Api Application as an example - step by step.

What is Prometheus?

Prometheus is an open-source, flexible, customizable, and metrics-based monitoring system. The service collects and stores data as metrics. It has a simple but powerful data model and provides a query language PromQL that lets you analyze how your applications and infrastructure are performing.

Besides that, it can be integrated with various third-party tools or plugins. It also has an extension called Aertmanager which allows us to notify about current events by sending notifications on various channels such as email or Slack.

Building a Prometheus-based system

Let's see how to make a sample Prometheus-based monitoring system using Django REST API Application as an example.

Our Application will be called the Coupon App with the following requirements:

We have predefined various coupon types in the database
Users can draw a random coupon once every 5 minutes
Users can have only one valid coupon at a time
Users can use a coupon at any time
Coupons that have already been used cannot be reused

The code for the entire Application is available in the repository, and below I'll describe the most important parts and the configuration.

So based on requirements above, we will need a User model. For that purpose, we can use a default Django User model and also use the TokenAuthentication provided by the Django REST Frameworks library.

Please note that the core here will be the coupon application. We will have two base models: Coupon and UserCoupon. The Coupon model will be used for defining available coupon types, and UserCoupon will be used for storing the User's coupon instances related to the Coupon.

Our coupon models will look like this:

class Coupon(models.Model):
   AMOUNTS = (
       (500, "500"),
       (1000, "1000"),
       (1500, "1500"),
   )
   id = models.UUIDField(primary_key=True, default=uuid.uuid4)
   amount = models.PositiveSmallIntegerField(
       help_text="Amount in cents", unique=True,
validators=[validate_amount], choices=AMOUNTS
)


class UserCoupon(models.Model):
   id = models.UUIDField(primary_key=True, default=uuid.uuid4)
   code = models.CharField(max_length=16, default=get_random_string)
   valid = models.BooleanField(default=True)
   created_at = models.DateTimeField(default=now)
   coupon = models.ForeignKey(Coupon, on_delete=models.PROTECT, related_name="user_coupons")
   user = models.ForeignKey(
       "users.User", on_delete=models.PROTECT, related_name="coupons", blank=True, null=True
   )
 
   class Meta:
       constraints = [
           models.UniqueConstraint(
               fields=["valid", "user", "coupon"],
               condition=models.Q(valid=True),
               name="unique_valid_user_coupon",
           )
       ]

We also need some logic for managing coupons. Let's assume that managing Coupon types will be done from the Django Admin panel, and let’s focus on the API logic for User coupons. We will definitely need a view for:

Showing all user coupons
Showing a particular coupon with details
Creating a new coupon (drawing it)
Using a new coupon (changing a valid status)

We can accomplish that using Model Viewset with handling 3 HTTP methods: GET, POST, PATCH. Model ViewSet will return only the user coupons for the user who made a request. The way how the coupon data will be displayed will be defined via serializers. Let's go to them.

In the serializers, there will be UserCouponSerializer for the coupons list view and UserDetailsCouponSerializer for a single coupon view. Let’s also add a custom validation as a safety mechanism to prevent making an invalid coupon a valid one again.

The serializers will look like this:


class UserCouponSerializer(serializers.ModelSerializer):
   coupon = serializers.StringRelatedField()
 
   class Meta:
       model = UserCoupon
       fields = [
           "id",
           "code",
           "valid",
           "coupon",
       ]
  
class UserCouponDetailSerializer(UserCouponSerializer):
   amount = serializers.ReadOnlyField(source="coupon.amount")
 
   class Meta:
       model = UserCoupon
       fields = UserCouponSerializer.Meta.fields + [
           "amount",
       ]
       read_only_fields = ("id", "code")
 
   def validate(self, validated_data):
       if validated_data["valid"] is True:
           raise ValidationError({"valid": "This field might be only 'false'"})
 
       return validated_data

Let's go back to the views.

In the ViewSet get_querystet method, we will filter out coupons that don’t belong to the user. Then we will add conditions for getting the correct serializer depending on ViewSet action.


class UserCouponViewSet(viewsets.ModelViewSet):
   http_method_names = ["get", "patch", "post"]
 
   def get_queryset(self):
       return UserCoupon.objects.filter(user__username=self.request.user.username).select_related(
           "user"
       )
 
   def get_serializer_class(self):
       return UserCouponSerializer if self.action == "list" else UserCouponDetailSerializer

And now comes the part for the main event – monitoring metrics implementation!

Obviously, it would be nice to have the possibility of tracking requests in the application with the distinguish: HTTP method, user, and endpoint. We would also have the information on when the user was active for the last time and what was the status of their last try coupon draw. Last but not least, it would be useful to monitor its performance - the time of user coupon creation.

It’s possible to accomplish that thanks to the official Prometheus client library for Python, which provides basic Prometheus metrics type objects. So we can import from the library Counter, Enum, Gauge, Info, Summary and initiate our metrics. Let's take a look at the code snippet below:


from prometheus_client import Counter, Enum, Gauge, Info, Summary
 
 
info = Info(name="app", documentation="Information about the application")
info.info({"version": "1.0", "language": "python", "framework": "django"})
 
requests_total = Counter(
   name="app_requests_total",
   documentation="Total number of various requests.",
   labelnames=["endpoint", "method", "user"],
)
last_user_activity_time = Gauge(
   name="app_last_user_activity_time_seconds",
   documentation="The last time when user was active.",
   labelnames=["user"],
)
coupon_create_last_status = Enum(
   name="app_coupon_create_last_status",
   documentation="Status of last coupon create endpoint call.",
   labelnames=["user"],
   states=["success", "error"],
)
coupon_create_time = Summary(
   name="app_coupon_create_time_seconds", documentation="Time of creating coupon."
)

There are initiated app_requests_totalCounter type metric as a requests_total variable. This metric takes three labels: "endpoint", "method", "user."

Later we have app_last_user_activity_time_seconds Gauge type metrics as a last_user_activity_time variable. This will take only the user label.

Then app_coupon_create_last_status Enum type metric as a coupon_create_last_status variable. This is a special kind of metric because except for the label (user in this case), it requires defining a state so that it will be “success" and "error."

app_coupon_create_time_seconds is a summary type metric assigned to the coupon_create_time variable and for this metric, it isn't necessary to define any labels. Labels are optional and it's totally fine to have metrics without them.

The last metric will be an app which is an Info metric type. It is a simple dictionary, so it might be immediately set after initialization.

All other metrics will be set in different places in the ViewSet.

In the ViewSet, building initial method needs to be overwritten and extended of setting 2 metrics app_requests_total (requests_total) and app_last_user_activity_time_seconds (last_user_activity_time) with proper labels values:


def initial(self, request, pk=None, *args, **kwargs):
       user = self.request.user
       requests_total.labels(
           endpoint=self._get_endpoint(pk), method=self.request.method, user=user
       ).inc()
       last_user_activity_time.labels(user=user).set(now().timestamp())
       return super().initial(request, *args, **kwargs)

Also, it is required to write a create method logic. In this method, except for coupon creation, a few metrics will be set:

app_coupon_create_time_seconds (coupon_create_time) - it will takes a time of creating user_coupon instance and on the end
app_coupon_create_last_status (coupon_create_last_status) with proper user label value and state. Also, there are some validation methods calls for coupon creation requirements. In each validation method, there is also a setting app_coupon_create_last_status metric with a proper state value.


def create(self, request):
       user = self.request.user
 
       self._validate_valid_coupon_exists(user)
       self._validate_coupon_created_within_5_minutes(user)
 
       coupon = Coupon.objects.order_by("?").first()
 
       self._validate_coupon_type_exists(coupon, user)
 
       start_time = now().timestamp()
       user_coupon = UserCoupon.objects.create(coupon=coupon, user=user)
       coupon_create_time.observe(now().timestamp() - start_time)
       coupon_create_last_status.labels(user=user).state("success")
 
       return Response(UserCouponDetailSerializer(user_coupon).data)

The last thing we need is a metrics exposition in the endpoint. For that, let’s make a simple view function that will return the most up-to-date metrics each time, thanks to the method provided by the Prometheus client lib generate_latest.


def metrics_view(request):
   return HttpResponse(generate_latest(), content_type=CONTENT_TYPE_LATEST)

Alright, once we have the application ready, let's set up the docker services. Except for containers of the app server and the database, we will also need containers with Prometheus and Grafana. Optionally, it is possible to add an Alertmanager container.


version: '3.5'
 
x-environment: &common-environment
 POSTGRES_DB: postgres
 POSTGRES_USER: postgres
 POSTGRES_PASSWORD: postgres
 
services:
 # Prometheus services
 prometheus:
   image: prom/prometheus
   container_name: prometheus
   volumes:
     - ./provisioning/prometheus:/etc/prometheus
   ports:
     - '9090:9090'
 grafana:
   image: grafana/grafana:6.7.4
   container_name: grafana
   ports:
     - '3000:3000'
   volumes:
     - ./provisioning/grafana/datasources:/etc/grafana/provisioning/datasources
     - ./provisioning/grafana/dashboards:/etc/grafana/provisioning/dashboards
 # Web App services
 app:
   build:
     context: .
     dockerfile: ./app/docker/Dockerfile
   environment:
     <<: *common-environment
     PORT: 8000
     HOST: 0.0.0.0
     DJANGO_DEBUG: "True"
   volumes:
     - ./app:/app
   ports:
     - "8000:8000"
   depends_on:
     - db
   command: /start.sh
 db:
   image: postgres
   environment:
     <<: *common-environment

Prometheus is configurable via the YAML file, where you need to define scraped targets (in this case - our application). Example config file will look like this:


global:
 scrape_interval: 10s
 alerting:
 alertmanagers:
   - static_configs:
     - targets:
       - alertmanager:9093
 
rule_files:
- /etc/prometheus/rules.yml
 
scrape_configs:
- job_name: web-apps
 static_configs:
 - targets:
   - app:8000

When we have configured Prometheus and already done docker setup, let’s run the application environment.

Metrics will be available on endpoint /metrics and in raw form will look like this:

raw form of metrics

Prometheus instance is running on the 9090 port and we can explore metrics via the Prometheus explorer.

metrics in Prometeus explorer

The full metrics power can be reached thanks to Grafana. I've already prepared an example dashboard with various panels for metrics data aggregation. The dashboard is in the repository and will be loaded during the Prometheus setup, so the only one is to log in to Gafana and go to Home when it will be visible.

That's what the dashboard looks like:

gafana dashboard

We have panels with:

Basic app information
Counters with total GET/POST/PATCH requests
Table with API endpoint calls with user and HTTP method
Graph with the coupon creation time
Table with the user’s last coupon creation status

At the top, on the left side of the dashboard, there are User, Method, and Endpoint switches, which allow you to filter all data from panels by particular values (eg. HTTP request method).

dashboard - User, Method, and Endpoint switches

Those panels are just a simple example visualization of the provided metrics. It is possible to create other panels for metrics visualization, depending on our needs.

The last thing worth mentioning is configuring alerting rules. I have already predefined one for monitoring is the app server (container) alive. It is based on one of built-in Prometheus's metrics called up. It returns 1 when the monitoring target is reachable, and 0 when it doesn't.

Rules are defined in separate rules.yml file, which path is provided in prometheus.yml

That’s what the defined alerting rule looks like:


groups:
- name: nodes
 rules:
 - alert: App server is down
   expr: up{instance="app:8000"} == 0
   for: 1m
   annotations:
     description: App server is not running
     summary: App server is down and not running

After manually turning off the container with the coupons app server, the up metric for this target turned to 0:

app server up metric turned to 0

In the Alerts tab, all defined alerting rules are visible. I've created an example rule called ‘App server is down” which checks if the up metric for the app server instance (label instance="app:8000") equals 0 for more than one minute.

App server is down rule

For the first minute, when the app server container is unreachable, the alert is in a pending state (marked in yellow),

app server container pending state

and after a defined time, it turns into a firing state.

app server container firing state

Prometheus can be easily integrated with an extension called Alertmanager, which allows you to send notifications when alerts are firing using various channels (e.g. email, Slack), so we can be sure that we don't miss any critical event.

It’s just the tip of the iceberg of possibilities that Prometheus gives us. Except for the native language client libraries, there are existing tools called exporters that generate metrics for third-party software, for example, databases.

Due to the fact that Grafana has a huge community, there are already thousands of dashboards for exporters. So you can build your own monitoring system using these resources as building blocks. Additionally, thanks to the Alertmanager extension and defining its own alerting rules, you can be always be up-to-date with events.

Rafał Kondziela