Prometheus and its querying language, PromQL, have become the standard for cloud-native metrics collection and alerting. Unlike generic query systems, PromQL is tailored for evaluating time-series data dynamically, allowing site reliability engineers (SREs) to track compute utilization and define warning thresholds.
This reference sheet covers time-series vector selectors, subqueries, arithmetic rates, multi-vector label matching, and alerting YAML rules.
Before diving into this cheatsheet, check out my previous deep-dive on Nginx Cheat Sheet: Routing, SSL & Performance Guide to see how we structured these patterns in practice.
Mastering Selectors & Instant Vectors
Instant vectors return the single newest evaluated value for all matching time-series entries. Range vectors extract values across a retrofitted historical period (e.g. [5m]).
# 1. Exact matches
http_requests_total{job="api-server"}
# 2. Negative and Regular Expression matches
# Match all dev or staging jobs, excluding the auth service
http_requests_total{env=~"dev|staging", handler!="auth"}
# 3. Range vector (extracts 5 minutes of data points; cannot be graphed directly)
http_requests_total{job="api-server"}[5m]
# 4. Offset modifier (look at data points from 1 hour ago)
http_requests_total{job="api-server"} offset 1h
Rate Calculations & Aggregations
Calculating averages and changes per second is key to monitoring load.
# 1. Calculate the per-second average rate of increase over a 5-minute range vector
# Ideal for volatile, fast-changing counter metrics
rate(http_requests_total{job="api-server"}[5m])
# 2. Calculate average increase per second over a range (faster calculation, best for slow metrics)
irate(http_requests_total{job="api-server"}[5m])
# 3. Sum up all rates across the application, grouped by custom labels
sum by (status_code) (rate(http_requests_total[5m]))
# 4. Calculate average metrics, omitting host/pod details
avg without (instance, pod) (node_cpu_seconds_total{mode="idle"})
Complex Multi-Vector Matching
When combining two distinct metrics (e.g., matching container memory limits with memory consumption), use vector matching filters.
# 1. One-to-One matching (requires identical label sets on both sides)
# Calculate memory usage percentage
container_memory_usage_bytes / container_spec_memory_limit_bytes
# 2. Match on subset of labels
container_memory_usage_bytes * on(pod, namespace) container_spec_memory_limit_bytes
# 3. Many-to-One Matching using group_left
# Multiplies request rate per path by the server's static metadata attributes
# group_left maps the left side (many) to the right side (one)
rate(http_requests_total[5m]) * on(instance) group_left(version, env) node_meta_info
Common SRE PromQL Queries
Use these query configurations to monitor core system health.
# 1. Calculate CPU Utilization Percentage
# Excludes idle, guest, and steal cycles
100 * (1 - sum by(instance) (increase(node_cpu_seconds_total{mode="idle"}[5m])) / sum by(instance) (increase(node_cpu_seconds_total[5m])))
# 2. Calculate Available Disk Space Percentage
node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100
# 3. System Request Failure Rate (Percentage of 5xx codes)
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
Structuring Prometheus Alerting Rules
Write alerting rules inside your Prometheus server configuration files.
groups:
- name: api_alerts
rules:
# 1. Critical High Error Rate alert
- alert: ApiHighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 5
for: 2m # Condition must persist for 2 minutes before triggering
labels:
severity: critical
team: sre
annotations:
summary: "High API Error Rate detected on {{ $labels.instance }}"
description: "API 5xx errors represent {{ $value | printf \"%.2f\" }}% of total traffic over the last 5 minutes."
# 2. Out of Memory alert
- alert: HostOutOfMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} is low on available memory"
description: "Available memory has dropped below 10% (Current: {{ $value | printf \"%.2f\" }}%)." Related Articles
Deepen your understanding with these curated continuations.
Advanced Terraform Cloud-Scale State Cheatsheet: The Complete Reference
Manage infrastructure at scale: Terraform workspaces, backend state locks, refactoring resources, dynamic blocks, and import workflows.
GitHub Actions Advanced YAML Pipelines Cheatsheet: The Complete Reference
Optimize CI/CD pipelines: GitHub Actions environments, concurrency controls, custom reusable workflows, matrices, and cache optimizations.
AWS IAM Policies & Boundaries Cheatsheet: The Complete Reference
Master advanced AWS security: IAM policy structure, Permissions Boundaries, Attribute-Based Access Control (ABAC), and Service Control Policies (SCPs).