Prometheus 监控实战:构建现代化的监控体系
Prometheus 简介
Prometheus 是由 SoundCloud 开源的监控告警系统,2016 年加入 CNCF,是云原生监控的事实标准。
核心特性
- 多维数据模型:指标名称 + 标签(key/value)
- 灵活的查询语言:PromQL
- 不依赖分布式存储:单节点自治
- HTTP 拉取模型:主动抓取指标
- 支持推送网关:用于短生命周期任务
- 多种可视化方案:Grafana、内置表达式浏览器
- 高效存储:自定义的时序数据库
架构设计
┌─────────────────────────────────────────────────────────────┐
│ Prometheus Server │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Retrieval │ │ TSDB │ │ HTTP Server │ │
│ │ (抓取) │ │ (时序数据库) │ │ (API/UI) │ │
│ └──────┬──────┘ └─────────────┘ └─────────────────────┘ │
│ │ │
│ ┌──────┴──────────────────────────────────────────────┐ │
│ │ PromQL Engine │ │
│ └──────────────────────────────────────────────────────┘ │
└───────────────────────────┬─────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
↓ ↓ ↓
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Exporters │ │ Pushgateway │ │ Service │
│ (Node/Blackbox│ │ (短任务推送) │ │ Discovery │
│ /MySQL等) │ │ │ │ (服务发现) │
└───────────────┘ └───────────────┘ └───────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ AlertManager │
│ (告警分组、抑制、静默、路由) │
└─────────────────────────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ Grafana / UI │
│ (可视化展示) │
└─────────────────────────────────────────────────────────────┘核心概念
数据模型
指标名称{标签1="值1", 标签2="值2"} 样本值 时间戳
示例:
http_requests_total{method="GET", endpoint="/api/users", status="200"} 1027 1640000000000指标类型
1. Counter(计数器)
单调递增的累积值,适合记录请求数、错误数等。
go
package main
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
"net/http"
)
// 定义 Counter
var httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
}
func handler(w http.ResponseWriter, r *http.Request) {
// 增加计数
httpRequestsTotal.WithLabelValues(
r.Method,
r.URL.Path,
"200",
).Inc()
w.Write([]byte("Hello"))
}
func main() {
http.HandleFunc("/", handler)
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}2. Gauge(仪表盘)
可增可减的瞬时值,适合记录温度、内存使用量、并发连接数等。
go
// 定义 Gauge
var activeConnections = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "Number of active connections",
},
)
func handleConnection() {
activeConnections.Inc() // 连接建立
defer activeConnections.Dec() // 连接关闭
// 处理连接...
}
// 设置特定值
var temperature = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "cpu_temperature_celsius",
Help: "Current CPU temperature",
},
)
temperature.Set(45.2)3. Histogram(直方图)
采样观测值并分桶计数,适合记录请求延迟、响应大小等。
go
// 定义 Histogram
var requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: prometheus.DefBuckets, // 默认桶: .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10
},
[]string{"method", "endpoint"},
)
func handler(w http.ResponseWriter, r *http.Request) {
start := time.Now()
defer func() {
requestDuration.WithLabelValues(
r.Method,
r.URL.Path,
).Observe(time.Since(start).Seconds())
}()
// 处理请求...
}4. Summary(摘要)
类似 Histogram,但计算滑动时间窗口内的分位数。
go
// 定义 Summary
var requestLatency = prometheus.NewSummaryVec(
prometheus.SummaryOpts{
Name: "http_request_latency_seconds",
Help: "HTTP request latency",
Objectives: map[float64]float64{
0.5: 0.05, // 中位数,误差 5%
0.9: 0.01, // P90,误差 1%
0.99: 0.001, // P99,误差 0.1%
},
},
[]string{"method", "endpoint"},
)部署配置
Docker 部署
yaml
# docker-compose.yml
version: '3'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--web.enable-lifecycle'
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:配置文件
yaml
# prometheus.yml
global:
scrape_interval: 15s # 默认抓取间隔
evaluation_interval: 15s # 规则评估间隔
external_labels:
cluster: 'production'
replica: '{{.ExternalURL}}'
# 告警规则文件
rule_files:
- /etc/prometheus/rules/*.yml
# 抓取配置
scrape_configs:
# Prometheus 自身监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter - 主机监控
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
relabel_configs:
- source_labels: [__address__]
target_label: instance
# 应用监控
- job_name: 'my-app'
static_configs:
- targets: ['app:8080']
metrics_path: '/metrics'
scrape_interval: 5s
# Blackbox Exporter - 黑盒监控
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://api.example.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
# 远程存储(可选)
remote_write:
- url: "http://cortex:9009/api/prom/push"
queue_config:
max_samples_per_send: 1000
max_shards: 200
remote_read:
- url: "http://cortex:9009/api/prom/api/v1/read"PromQL 查询语言
基础查询
promql
# 查询指标
http_requests_total
# 带标签过滤
http_requests_total{method="GET", status="200"}
# 正则匹配
http_requests_total{status=~"2.."}
# 范围查询
http_requests_total[5m] # 最近 5 分钟的数据
# 偏移查询
http_requests_total offset 1h # 1 小时前的数据聚合操作
promql
# 求和
sum(http_requests_total)
# 按标签分组求和
sum by (method) (http_requests_total)
# 计算平均值
avg(http_request_duration_seconds)
# 计算分位数
histogram_quantile(0.95,
sum by (le) (
rate(http_request_duration_seconds_bucket[5m])
)
)
# 计数
count(up)
# 去重计数
count by (job) (up)速率计算
promql
# 每秒增长率(Counter 专用)
rate(http_requests_total[5m])
# 每秒增长率(处理计数器重置)
increase(http_requests_total[1h])
# 平滑速率
irate(http_requests_total[5m]) # 瞬时速率数学运算
promql
# 计算错误率
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# 计算内存使用率
100 - (
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100
)
# 预测磁盘将在 4 小时后满
predict_linear(
node_filesystem_avail_bytes{mountpoint="/"}[1h],
4 * 3600
) < 0告警配置
告警规则
yaml
# rules/alerts.yml
groups:
- name: node_alerts
rules:
# 实例宕机
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} has been down for more than 1 minute."
# 高 CPU 使用率
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 5 minutes."
# 高内存使用率
- alert: HighMemoryUsage
expr: |
(
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85% for more than 5 minutes."
# 磁盘即将满
- alert: DiskWillFillIn4Hours
expr: predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) < 0
for: 5m
labels:
severity: critical
annotations:
summary: "Disk on {{ $labels.instance }} will fill soon"
description: "Disk is predicted to fill within 4 hours."
# 应用错误率高
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate"
description: "Error rate is above 5% for more than 2 minutes."
# P99 延迟高
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency"
description: "P99 latency is above 1 second."AlertManager 配置
yaml
# alertmanager.yml
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alert@example.com'
templates:
- '/etc/alertmanager/templates/*.tmpl'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'default'
routes:
# 严重告警立即通知
- match:
severity: critical
receiver: 'pagerduty'
continue: true
# 警告级别发送邮件
- match:
severity: warning
receiver: 'email'
inhibit_rules:
# 抑制规则:如果实例宕机,抑制该实例的其他告警
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['instance', 'cluster']
receivers:
- name: 'default'
slack_configs:
- api_url: '{{ .SlackURL }}'
channel: '#alerts'
title: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'email'
email_configs:
- to: 'team@example.com'
subject: 'Prometheus Alert: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
- name: 'pagerduty'
pagerduty_configs:
- service_key: '{{ .PagerDutyKey }}'
description: '{{ .GroupLabels.alertname }}'服务发现
Kubernetes 服务发现
yaml
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- default
- production
relabel_configs:
# 只抓取带注解的 Pod
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# 获取端口
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
# 设置指标路径
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- job_name: 'kubernetes-services'
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true最佳实践
指标命名规范
# 格式:<namespace>_<subsystem>_<metric_name>_<unit>_<suffix>
# 好的命名
http_requests_total # 计数器
http_request_duration_seconds # 直方图
http_request_duration_seconds_bucket # 桶
http_request_duration_seconds_sum # 总和
http_request_duration_seconds_count # 计数
# 避免
num_requests # 缺少命名空间
request_time # 缺少单位标签设计
go
// 高基数标签(避免)
http_requests_total{user_id="12345"} // 用户 ID 会导致标签爆炸
// 低基数标签(推荐)
http_requests_total{method="GET", status="200", endpoint="/api/users"}性能优化
yaml
# 抓取配置优化
scrape_configs:
- job_name: 'my-app'
scrape_interval: 15s # 根据需求调整
scrape_timeout: 10s # 小于 scrape_interval
sample_limit: 10000 # 限制样本数
# 使用缓存
metric_relabel_configs:
- source_labels: [__name__]
regex: 'go_memstats.*'
action: drop # 丢弃不需要的指标总结
Prometheus 是构建现代化监控体系的强大工具:
| 组件 | 用途 |
|---|---|
| Prometheus Server | 指标抓取、存储、查询 |
| Exporters | 暴露系统/应用指标 |
| AlertManager | 告警管理 |
| Grafana | 可视化展示 |
| Pushgateway | 短任务指标推送 |
关键要点:
- 合理设计指标和标签,避免高基数
- 选择合适的指标类型(Counter/Gauge/Histogram/Summary)
- 配置合理的抓取间隔和保留策略
- 设置有效的告警规则,避免告警疲劳
- 使用 Grafana 创建直观的监控大盘
相关文章推荐:
- Grafana 可视化实战 - 监控可视化
- CI/CD 最佳实践 - 持续集成与交付
- Go 集成 Prometheus 监控 - 应用监控接入

