PromQL Alerting Cheatsheet: Prometheus & Grafana

Prometheus and its querying language, PromQL, have become the standard for cloud-native metrics collection and alerting. Unlike generic query systems, PromQL is tailored for evaluating time-series data dynamically, allowing site reliability engineers (SREs) to track compute utilization and define warning thresholds.

This reference sheet covers time-series vector selectors, subqueries, arithmetic rates, multi-vector label matching, and alerting YAML rules.

Vector Selectors: Extract metrics using exact label matches, negative matches, and regular expression filters.
Rates and Increases: Calculate metric increments over time using standard wrappers like rate and increase.
Vector Math: Aggregate results across metrics using grouping modifiers like by and without.
Vector Matching: Connect separate time-series sets based on common keys using on, group_left, and group_right.
Alerting Rules: Setup standard warning parameters inside Prometheus system configurations.

Before diving into this cheatsheet, check out my previous deep-dive on Nginx Cheat Sheet: Routing, SSL & Performance Guide to see how we structured these patterns in practice.

Mastering Selectors & Instant Vectors

Instant vectors return the single newest evaluated value for all matching time-series entries. Range vectors extract values across a retrofitted historical period (e.g. [5m]).

# 1. Exact matches
http_requests_total{job="api-server"}

# 2. Negative and Regular Expression matches
# Match all dev or staging jobs, excluding the auth service
http_requests_total{env=~"dev|staging", handler!="auth"}

# 3. Range vector (extracts 5 minutes of data points; cannot be graphed directly)
http_requests_total{job="api-server"}[5m]

# 4. Offset modifier (look at data points from 1 hour ago)
http_requests_total{job="api-server"} offset 1h

Rate Calculations & Aggregations

Calculating averages and changes per second is key to monitoring load.

# 1. Calculate the per-second average rate of increase over a 5-minute range vector
# Ideal for volatile, fast-changing counter metrics
rate(http_requests_total{job="api-server"}[5m])

# 2. Calculate average increase per second over a range (faster calculation, best for slow metrics)
irate(http_requests_total{job="api-server"}[5m])

# 3. Sum up all rates across the application, grouped by custom labels
sum by (status_code) (rate(http_requests_total[5m]))

# 4. Calculate average metrics, omitting host/pod details
avg without (instance, pod) (node_cpu_seconds_total{mode="idle"})

Complex Multi-Vector Matching

When combining two distinct metrics (e.g., matching container memory limits with memory consumption), use vector matching filters.

# 1. One-to-One matching (requires identical label sets on both sides)
# Calculate memory usage percentage
container_memory_usage_bytes / container_spec_memory_limit_bytes

# 2. Match on subset of labels
container_memory_usage_bytes * on(pod, namespace) container_spec_memory_limit_bytes

# 3. Many-to-One Matching using group_left
# Multiplies request rate per path by the server's static metadata attributes
# group_left maps the left side (many) to the right side (one)
rate(http_requests_total[5m]) * on(instance) group_left(version, env) node_meta_info

Common SRE PromQL Queries

Use these query configurations to monitor core system health.

# 1. Calculate CPU Utilization Percentage
# Excludes idle, guest, and steal cycles
100 * (1 - sum by(instance) (increase(node_cpu_seconds_total{mode="idle"}[5m])) / sum by(instance) (increase(node_cpu_seconds_total[5m])))

# 2. Calculate Available Disk Space Percentage
node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100

# 3. System Request Failure Rate (Percentage of 5xx codes)
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

Structuring Prometheus Alerting Rules

Write alerting rules inside your Prometheus server configuration files.

groups:
  - name: api_alerts
    rules:
      # 1. Critical High Error Rate alert
      - alert: ApiHighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 5
        for: 2m # Condition must persist for 2 minutes before triggering
        labels:
          severity: critical
          team: sre
        annotations:
          summary: "High API Error Rate detected on {{ $labels.instance }}"
          description: "API 5xx errors represent {{ $value | printf \"%.2f\" }}% of total traffic over the last 5 minutes."

      # 2. Out of Memory alert
      - alert: HostOutOfMemory
        expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} is low on available memory"
          description: "Available memory has dropped below 10% (Current: {{ $value | printf \"%.2f\" }}%)."

Deepen your understanding with these curated continuations.

View All Articles

terraform5 min read

Advanced Terraform Cloud-Scale State Cheatsheet

Manage infrastructure at scale: Terraform workspaces, backend state locks, refactoring resources, dynamic blocks, and import workflows.

CobieJun 08, 2026

github-actions5 min read

GitHub Actions Advanced YAML Pipelines Cheatsheet

Optimize CI/CD pipelines: GitHub Actions environments, concurrency controls, custom reusable workflows, matrices, and cache optimizations.

CobieJun 07, 2026

aws5 min read

AWS IAM Policies & Boundaries Cheatsheet: The Complete Reference

Master advanced AWS security: IAM policy structure, Permissions Boundaries, Attribute-Based Access Control (ABAC), and Service Control Policies (SCPs).

ArjunJun 04, 2026

PromQL Alerting Cheatsheet: Prometheus & Grafana

Mastering Selectors & Instant Vectors

Rate Calculations & Aggregations

Complex Multi-Vector Matching

Common SRE PromQL Queries

Structuring Prometheus Alerting Rules

Related Articles

Advanced Terraform Cloud-Scale State Cheatsheet

GitHub Actions Advanced YAML Pipelines Cheatsheet

AWS IAM Policies & Boundaries Cheatsheet: The Complete Reference

Related Articles

Best AI Code Review Tools in 2026: Comparison & Guide

AWS EKS Production Tuning Cheatsheet: The Complete Reference

AWS IAM Policies & Boundaries Cheatsheet: The Complete Reference

Mastering Selectors & Instant Vectors

Rate Calculations & Aggregations

Complex Multi-Vector Matching

Common SRE PromQL Queries

Structuring Prometheus Alerting Rules

Related Articles

Advanced Terraform Cloud-Scale State Cheatsheet

GitHub Actions Advanced YAML Pipelines Cheatsheet

AWS IAM Policies & Boundaries Cheatsheet: The Complete Reference

Related Articles

Best AI Code Review Tools in 2026: Comparison & Guide

AWS EKS Production Tuning Cheatsheet: The Complete Reference

AWS IAM Policies & Boundaries Cheatsheet: The Complete Reference

Before you go...