项目概述

通过alertmanager和PrometheusAlert实现飞书监控告警

端口使用

服务名称端口
Prometheus9090
alertmanager9093
PrometheusAlert8080
grafana3000

监控指标

监控项目阈值
内存占用超过80%告警
磁盘占用超过80%告警
cpu所有核的占用率超过80%告警
磁盘读写速度超过100mb/s告警

一、PrometheusAlert以及相关配置

版本: v4.8.1

1.下载PrometheusAlert

cd /opt/aaawget https://github.com/feiyu563/PrometheusAlert/releases/download/v4.8.1/linux.zipunzip linux.zipmv linux ../PrometheusAlert

PrometheusAlert所在位置 /opt/PrometheusAlert

2.给启动脚本加x权限

cd /opt/PrometheusAlertchmod +x PrometheusAlert

3.修改配置文件

配置文件所在位置 /opt/PrometheusAlert/conf/app.conf

 74 #是否开启飞书告警通道,可同时开始多个通道0为关闭,1为开启 75 open-feishu=1   将飞书告警通道开启

4.通过systemd去控制PrometheusAlert

服务名为:PrometheusAlert.service

  • PrometheusAlert不能直接使用systemd管理,否则web端会报错,需要编写启动脚本,脚本内容如下:
vim /etc/init.d/PrometheusAlert#!/bin/bash# 作用:PrometheusAlert 启动脚本export PATH="$PATH"cd /opt/PrometheusAlertprometheusalert_dir="/opt/PrometheusAlert"prometheusalert_status=$(netstat -tunlp | grep PrometheusA | grep 8080 | wc -l)start() {    if [ "${prometheusalert_status}" -eq 0 ]; then        nohup ./PrometheusAlert > /dev/null 2>&1 &        return_value=$?        if [ "${return_value}" -eq 0 ]; then            echo "PrometheusAlert start successful"        else            echo "PrometheusAlert start fault, please check config file"        fi    else    echo "PrometheusAlert in running"    fi}stop() {    if [ ${prometheusalert_status} -gt 0 ]; then        prometheus_pid=$(netstat -tunlp | grep PrometheusA | grep 8080 | awk "NR==1" | awk '{print $7}' | awk -F "/" '{print $1}')    kill -9 ${prometheus_pid}    return_value=$?    if [ "${return_value}" -eq 0 ]; then        echo "PrometheusAlert stop successful"    else        echo "PrometheusAlert stop fault, please check redis status"    fi    else        echo "PrometheusAlert is not running, please execure $0 start"    fi}case "$1" instart)    start && exit 0    ;;stop)    stop || exit 2    ;;restart)    stop    sleep 10s     start && exit 0    ;;*)    echo $"Usage: $0 {start|stop|restart}"    exit 2    ;;esacexit $?---------------------------------------------启动方式为/etc/init.d/PrometheusAlert   {start|stop|restart}
  • 设置开机自启
编写配置文件cd  /usr/lib/systemd/systemcat >PrometheusAlert.service < EOF  [Unit]Description=Prometheus AlertAfter=network.target[Service]Type=forkingExecStart=/etc/init.d/PrometheusAlert startExecRestart=/etc/init.d/PrometheusAlert restartExecStop=/etc/init.d/PrometheusAlert stopRestart=on-failure[Install]WantedBy=multi-user.targetEOF设置开机自启并启动[root@m01-176 PrometheusAlert]# systemctl enable PrometheusAlert.serviceCreated symlink from /etc/systemd/system/multi-user.target.wants/PrometheusAlert.service to /usr/lib/systemd/system/PrometheusAlert.service.[root@m01-176 system]# systemctl start PrometheusAlert.service[root@m01-176 system]# systemctl status PrometheusAlert.service● PrometheusAlert.service - Prometheus   Loaded: loaded (/usr/lib/systemd/system/PrometheusAlert.service; enabled; vendor preset: disabled)   查看8080端口是否启动[root@m01-176 system]# ss -tnl|grep 8080LISTEN     0      128       [::]:18080                 [::]:*LISTEN     0      128       [::]:8080                  [::]:*

5.进入PrometheusAlert的webUI进行设置

webUI的初始密码可在PrometheusAlert配置文件中查看 /opt/PrometheusAlert/conf/app.conf

  3 #登录用户名  4 login_user=prometheusalert  5 #登录密码  6 login_password=prometheusalert
  • 创建自定义模板

  • 设置报警模板

  • 保存模板

  • 复制并保存模板的路径用于编写altermanager配置文件

二、altermanager部署及相关配置

版本: 0.24.0

1.下载altermanager

cd /optmkdir aaawget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gztar xf  alertmanager-0.24.0.linux-amd64.tar.gz -C /optmv alertmanager-0.24.0.linux-amd64/ alertmanager

2.修改配置文件

[root@m01-176 alertmanager]# cat alertmanager.ymlroute:  group_by: ['alertname']  group_wait: 30s  group_interval: 5m  repeat_interval: 1h  receiver: 'web.hook'receivers:  - name: 'web.hook'    webhook_configs:      - url: 'http://xxx.xxx.x.xxx:8080/prometheusalert?type=fs&tpl=test01&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'       #将其改为PrometheusAlert的模板路径inhibit_rules:  - source_match:      severity: 'critical'    target_match:      severity: 'warning'    equal: ['alertname', 'dev', 'instance']

3.修改Prometheus的配置文件连接alertmanager

[root@m01-176 prometheus]# vim prometheus.yml**************************  8 alerting:  9   alertmanagers: 10     - static_configs: 11         - targets: 12           # - alertmanager:9093 13           - 127.0.0.1:9093        #alertmanager地址**************************rule_files:  # - "first_rules.yml"  # - "second_rules.yml"  - "/opt/prometheus/rules/*.yml"         #报警规则路径必须为绝对路径

4.编写报警规则

报警规则配置文件路径为/opt/prometheus/rules

  • 进入Prometheus的WebUI页面使用PQL取值
# cpu在过去1分钟内的使用率(1 - sum by (instance) (increase(node_cpu_seconds_total{mode="idle"}[1m])) / sum by (instance) (increase(node_cpu_seconds_total[1m]))) * 100 # 内存使用率((node_memory_MemTotal_bytes-(node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes))/node_memory_MemTotal_bytes)*100# 磁盘使用率(node_filesystem_size_bytes {mountpoint ="/"} - node_filesystem_free_bytes {mountpoint ="/"}) / node_filesystem_size_bytes {mountpoint ="/"} * 100#磁盘读速度超过100mb/s告警sum by (instance) (rate(node_disk_read_bytes_total[2m]))/1024/1024 >100#磁盘写速度超过100mb/s告警sum by (instance) (rate(node_disk_written_bytes_total[2m]))/1024 /1024 >100
  • 进入Prometheus的WebUI页面使用PQL取值对比grafana查看取值是否准确

  • 写入报警规则配置文件
[root@m01-176 rules]# ll总用量 20-rw-r--r--. 1 root root 510 10月  3 21:05 cpu_used_rules.yml      #cpu在过去1分钟内的使用率-rw-r--r--. 1 root root 467 10月  3 21:27 disk_read_rules.yml       #磁盘读速度超过100mb/s告警 -rw-r--r--. 1 root root 477 10月  3 21:26 disk_written_rules.yml      #磁盘写速度超过100mb/s告警-rw-r--r--. 1 root root 536 10月  3 21:12 filesystem_used_rules.yml    # 磁盘使用率-rw-r--r--. 1 root root 510 10月  3 21:13 memory_used_rules.yml      # 内存使用率
  • 文件内容如下
[root@m01-176 rules]# cat *groups:- name: Cpu  rules:    - alert: Cpu01      expr: "(1 - sum(increase(node_cpu_seconds_total{mode='idle'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100 > 80"      for: 1m      labels:        severity: warning # 自定一个一个标签 info 级别      annotations:        summary: "服务 {{ $labels.instance }} cpu 使用率过高"        description: "{{ $labels.instance }} 的 cpu 在过去1分钟内使用过高,cpu 使用率 {{humanize $value}}."        ---------------------------------------------------------------groups:- name: Disk_read  rules:    - alert: disk_read      expr: "sum by (instance) (rate(node_disk_read_bytes_total[5m]))/1024/1024 > 100"      for: 1m      labels:        severity: warning # 自定一个一个标签 info 级别      annotations:        summary: "服务 {{ $labels.instance }}  磁盘读取速度状态异常"        description: "{{ $labels.instance }} 的磁盘读取速度状态异常,磁盘读取速度为 {{humanize $value}} MB/s"-------------------------------------------------------------------groups:- name: Disk_written  rules:    - alert: disk_written      expr: "sum by (instance) (rate(node_disk_written_bytes_total[2m]))/1024 /1024 >100"      for: 1m      labels:        severity: warning # 自定一个一个标签 info 级别      annotations:        summary: "服务 {{ $labels.instance }}  磁盘写入速度状态异常"        description: "{{ $labels.instance }} 的磁盘写入速度状态异常,磁盘写入速度为 {{humanize $value}} MB/s"---------------------------------------------------------------------groups:- name: Filesystem  rules:    - alert: filesystem      expr: '(node_filesystem_size_bytes {mountpoint ="/"} - node_filesystem_free_bytes {mountpoint ="/"}) / node_filesystem_size_bytes {mountpoint ="/"} * 100 > 10'      for: 1m      labels:        severity: warning # 自定一个一个标签 info 级别      annotations:        summary: "服务 {{ $labels.instance }} 磁盘使用率过高"        description: "{{ $labels.instance }} 的 磁盘在过去1分钟内使用过高,磁盘使用率 {{humanize $value}}."-----------------------------------------------------------------------groups:- name: Memory  rules:    - alert: memory      expr: "((node_memory_MemTotal_bytes-(node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes))/node_memory_MemTotal_bytes)*100 > 80"      for: 1m      labels:        severity: warning # 自定一个一个标签 info 级别      annotations:        summary: "服务 {{ $labels.instance }}  内存使用率过高"        description: "{{ $labels.instance }} 的 内存使用率过高,内存使用率为 {{humanize $value}}."
  • 可以通过Prometheus的WebUI查看状态信息

报警阈值修改

修改配置文件expr字段末尾的数字

修改配置文件expr字段[root@m01-176 rules]# cat *groups:- name: Cpu  rules:    - alert: Cpu01      expr: "(1 - sum(increase(node_cpu_seconds_total{mode='idle'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100 > 80"     #修改大于号后面的80,      for: 1m      labels:        severity: warning # 自定一个一个标签 info 级别      annotations:        summary: "服务 {{ $labels.instance }} cpu 使用率过高"        description: "{{ $labels.instance }} 的 cpu 在过去1分钟内使用过高,cpu 使用率 {{humanize $value}}."

相关技术文档

【PrometheusAlert入门】 – PrometheusAlert (gitbook.io)

PrometheusAlert: Prometheus Alert是开源的运维告警中心消息转发系统,支持主流的监控系统Prometheus,Zabbix,日志系统Graylog和数据可视化系统Grafana发出的预警消息,支持钉钉,微信,华为云短信,腾讯云短信,腾讯云电话,阿里云短信,阿里云电话等 (gitee.com)

项目
监控普罗米修斯的目标客户端是否存活
监控普罗米修斯的目标服务端是否存活
监控普罗米修斯的服务端在过去15分钟内重启了两次以上
[root@m01-176 rules]# cat prometheus_self-monitoring_rules.ymlgroups:- name: Prmetheus-self-monitoring  rules:  #普罗米修斯的目标已经消失了。导出程序可能已崩溃  - alert: PrometheusTargetMissing    expr: up == 0    for: 0m    labels:      severity: critical    annotations:      #summary: Prometheus target missing (instance {{ $labels.instance }})      summary: "普罗米修斯的目标客户端 {{ $labels.instance }} 已经消失了。导出程序可能已崩溃。"      description: "A Prometheus target has disappeared. An exporter might be crashed.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"  #普罗米修斯的工作消失了  - alert: PrometheusJobMissing    expr: absent(up{job="prometheus"})    for: 0m    labels:      severity: warning    annotations:      #summary: Prometheus job missing (instance {{ $labels.instance }})      summary: "{{ $labels.instance }} 普罗米修斯服务端状态异常。"      description: "A Prometheus job has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"  #普罗米修斯在过去15分钟内重启了两次以上。它可能是崩溃  - alert: PrometheusTooManyRestarts    expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2    for: 0m    labels:      severity: warning    annotations:      #summary: Prometheus too many restarts (instance {{ $labels.instance }})      summary: "{{ $labels.instance }} 普罗米修斯在过去15分钟内重启了两次以上。它可能是崩溃了."      description: "Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
项目阈值
主机网络接口可能接收到的数据过多(> 100 MB/s)
主机网络接口可能发送的数据过多(> 100 MB/s)
磁盘剩余的可用的 inode小于 10%
检测到 OOM 杀伤
[root@m01-176 rules]# cat basic_resource_monitoring.ymlgroups:- name: Basic_resource_monitoring  rules:  #主机网络接口可能接收到的数据过多(> 100 MB/s)  - alert: HostUnusualNetworkThroughputIn    expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100    for: 5m    labels:      severity: warning    annotations:      #summary: Host unusual network throughput in (instance {{ $labels.instance }})      summary: "{{ $labels.instance }} 网络接口接收到的数据过多,超过100Mb/s"      description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"  #主机网络接口可能发送的数据过多(> 100 MB/s)  - alert: HostUnusualNetworkThroughputOut    expr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100    for: 5m    labels:      severity: warning    annotations:      #summary: Host unusual network throughput out (instance {{ $labels.instance }}) "\n"      summary: "{{ $labels.instance }} 网络接口可能发送的数据过多,超过100Mb/s"      description: "Host network interfaces are probably sending too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"  #磁盘几乎用完了可用的 inode(<剩余 10%)  - alert: HostOutOfInodes    expr: node_filesystem_files_free{mountpoint ="/rootfs"} / node_filesystem_files{mountpoint="/rootfs"} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly{mountpoint="/rootfs"} == 0    for: 2m    labels:      severity: warning    annotations:      #summary: Host out of inodes (instance {{ $labels.instance }})      summary:  "{{ $labels.instance }} 磁盘剩余的可用的 inode 小于 10%"      description: "Disk is almost running out of available inodes ( 0    for: 0m    labels:      severity: warning    annotations:      #summary: Host OOM kill detected (instance {{ $labels.instance }})      summary: "{{ $labels.instance }} 检测到存在内存杀伤。"      description: "OOM kill detected\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"