MeshWorld India Logo MeshWorld.
promql prometheus grafana monitoring devops 4 min read

PromQL (Prometheus & Grafana) Alerting Cheatsheet: The Complete Reference

Jena
By Jena
PromQL (Prometheus & Grafana) Alerting Cheatsheet: The Complete Reference

Prometheus and its querying language, PromQL, have become the standard for cloud-native metrics collection and alerting. Unlike generic query systems, PromQL is tailored for evaluating time-series data dynamically, allowing site reliability engineers (SREs) to track compute utilization and define warning thresholds.

This reference sheet covers time-series vector selectors, subqueries, arithmetic rates, multi-vector label matching, and alerting YAML rules.


- **Vector Selectors**: Extract metrics using exact label matches, negative matches, and regular expression filters. - **Rates and Increases**: Calculate metric increments over time using standard wrappers like `rate` and `increase`. - **Vector Math**: Aggregate results across metrics using grouping modifiers like `by` and `without`. - **Vector Matching**: Connect separate time-series sets based on common keys using `on`, `group_left`, and `group_right`. - **Alerting Rules**: Setup standard warning parameters inside Prometheus system configurations.

Before diving into this cheatsheet, check out my previous deep-dive on Nginx Cheat Sheet: Routing, SSL & Performance Guide to see how we structured these patterns in practice.

Mastering Selectors & Instant Vectors

Instant vectors return the single newest evaluated value for all matching time-series entries. Range vectors extract values across a retrofitted historical period (e.g. [5m]).

# 1. Exact matches
http_requests_total{job="api-server"}

# 2. Negative and Regular Expression matches
# Match all dev or staging jobs, excluding the auth service
http_requests_total{env=~"dev|staging", handler!="auth"}

# 3. Range vector (extracts 5 minutes of data points; cannot be graphed directly)
http_requests_total{job="api-server"}[5m]

# 4. Offset modifier (look at data points from 1 hour ago)
http_requests_total{job="api-server"} offset 1h

Rate Calculations & Aggregations

Calculating averages and changes per second is key to monitoring load.

# 1. Calculate the per-second average rate of increase over a 5-minute range vector
# Ideal for volatile, fast-changing counter metrics
rate(http_requests_total{job="api-server"}[5m])

# 2. Calculate average increase per second over a range (faster calculation, best for slow metrics)
irate(http_requests_total{job="api-server"}[5m])

# 3. Sum up all rates across the application, grouped by custom labels
sum by (status_code) (rate(http_requests_total[5m]))

# 4. Calculate average metrics, omitting host/pod details
avg without (instance, pod) (node_cpu_seconds_total{mode="idle"})

Complex Multi-Vector Matching

When combining two distinct metrics (e.g., matching container memory limits with memory consumption), use vector matching filters.

# 1. One-to-One matching (requires identical label sets on both sides)
# Calculate memory usage percentage
container_memory_usage_bytes / container_spec_memory_limit_bytes

# 2. Match on subset of labels
container_memory_usage_bytes * on(pod, namespace) container_spec_memory_limit_bytes

# 3. Many-to-One Matching using group_left
# Multiplies request rate per path by the server's static metadata attributes
# group_left maps the left side (many) to the right side (one)
rate(http_requests_total[5m]) * on(instance) group_left(version, env) node_meta_info

Common SRE PromQL Queries

Use these query configurations to monitor core system health.

# 1. Calculate CPU Utilization Percentage
# Excludes idle, guest, and steal cycles
100 * (1 - sum by(instance) (increase(node_cpu_seconds_total{mode="idle"}[5m])) / sum by(instance) (increase(node_cpu_seconds_total[5m])))

# 2. Calculate Available Disk Space Percentage
node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100

# 3. System Request Failure Rate (Percentage of 5xx codes)
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

Structuring Prometheus Alerting Rules

Write alerting rules inside your Prometheus server configuration files.

groups:
  - name: api_alerts
    rules:
      # 1. Critical High Error Rate alert
      - alert: ApiHighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 5
        for: 2m # Condition must persist for 2 minutes before triggering
        labels:
          severity: critical
          team: sre
        annotations:
          summary: "High API Error Rate detected on {{ $labels.instance }}"
          description: "API 5xx errors represent {{ $value | printf \"%.2f\" }}% of total traffic over the last 5 minutes."

      # 2. Out of Memory alert
      - alert: HostOutOfMemory
        expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} is low on available memory"
          description: "Available memory has dropped below 10% (Current: {{ $value | printf \"%.2f\" }}%)."