Announcement

👇Official Account👇

Welcome to join the group & private message

Article first/tail QR code

Skip to content

Lesson 5.6: 告警与值班

学习目标

  • 设计合理的告警体系

1. 告警分级

级别响应时间示例
P0 (Critical)立即服务宕机、数据丢失
P1 (High)15 minP99 延迟 > 5s
P2 (Medium)1 hr错误率 > 1%
P3 (Low)1 day磁盘使用 > 80%

2. 告警降噪原则

  • 告警必须有行动项(否则是噪音)
  • 用聚合代替逐条告警
  • 设置告警静默期
  • 告警要求确认(Acknowledge)

Prometheus AlertManager 示例

yaml
groups:
- name: go-services
  rules:
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Error rate {{ $value | humanizePercentage }}"

上次更新于: