一、各软件功能简介
prometheus:Prometheus(是由go语言(golang)开发)是一套开源的监控&报警&时间序列数 据库的组合。主要优点:外部依赖安装使用超简单、系统集成 多等
grafana:Grafana 是一款采用 go 语言编写的开源应用,主要用于大规模指标数据的 可视化展现,是网络架构和应用分析中最流行的时序数据展示工具,目前已经支 持绝大部分常用的时序数据库。主要优点:展示方便、数据源种类多、内置通知提醒功能
altermanager:AlterManager是一个基于开源框架Prometheus和Grafana的告警管理系统。它可以帮助我们轻松地实现监控告警功能,并支持多种告警方式。主要优点:告警方式多样
prometheus-webhook-dingtalk:prometheus-webhook-dingtalk 是一个用于将 Prometheus 告警通知发送到钉钉群组的 webhook 模块。它提供了一种与钉钉无缝集成的方式,使监控团队能够及时接收和处理告警通知,并进行有效的团队协作。
二、基础准备工作
docker安装:这个太简单了我就不介绍了,后期有时间再出一篇专门安装docker的
服务器:由于我是测试就是用了一台机器centos7 ip:10.10.30.34(后期中配置文件需要用到)
安装目录:创建目录mkdir -p /data/prometheus
切换目录cd /data/prometheus/
,主要用于各种软件的配置和数据文件存放
三、prometheus准备工作
3.1、prometheus配置文件和数据文件
数据文件准备:创建目录mkdir -p prometheus/data
修改权限chmod 777 prometheus/data
实际应用可以根据具体使用情况设置权限
配置文件准备:vim prometheus/prometheus.yml
global:scrape_interval: 15s # 多久收集一次数据evaluation_interval: 15s # 多久评估一次规则scrape_timeout: 10s # 每次收集数据的超时时间scrape_configs: #收集数据配置列表- job_name: prometheus # 必须配置, 自动附加的job labels, 必须唯一static_configs:- targets: ['10.10.30.34:9090'] # 指定prometheusip端口labels:instance: prometheus #标签- job_name: ehospital-exploit-database #监控客户端static_configs:- targets: ['10.10.30.34:9100']labels:instance: eehospital-exploit-databasealerting: #Alertmanager相关的配置alertmanagers:- static_configs:- targets:- 10.10.30.34:9093 #指定告警模块rule_files: #告警规则文件, 可以使用通配符 - "/etc/prometheus/rules/*.yml"
3.2、prometheus告警配置
创建目录:mkdir rules
用于存放告警和触发文件
通用规则:vim rules/alert-rules.yml
groups:- name: prometheus-alertrules:- alert: prometheus-downexpr: prometheus:up == 0for: 1mlabels:severity: 'critical'annotations:summary: "instance: {{ $labels.instance }} 宕机了"description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} 关机了, 时间已经1分钟了。"value: "{{ $value }}"instance: "{{ $labels.instance }}"- alert: prometheus-cpu-highexpr: prometheus:cpu:total:percent > 80for: 3mlabels:severity: infoannotations:summary: "instance: {{ $labels.instance }} cpu 使用率高于 {{ $value }}"description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} CPU使用率已经持续一分钟高过80% 。"value: "{{ $value }}"instance: "{{ $labels.instance }}"- alert: prometheus-cpu-iowait-highexpr: prometheus:cpu:iowait:percent >= 12for: 3mlabels:severity: infoannotations:summary: "instance: {{ $labels.instance }} cpu iowait 使用率高于 {{ $value }}"description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} cpu iowait使用率已经持续三分钟高过12%"value: "{{ $value }}"instance: "{{ $labels.instance }}"- alert: prometheus-load-load1-highexpr: (prometheus:load:load1) > (prometheus:cpu:count) * 1.2for: 3mlabels:severity: infoannotations:summary: "instance: {{ $labels.instance }} load1 使用率高于 {{ $value }}"description: ""value: "{{ $value }}"instance: "{{ $labels.instance }}"- alert: prometheus-memory-highexpr: prometheus:memory:used:percent > 85for: 3mlabels:severity: infoannotations:summary: "instance: {{ $labels.instance }} memory 使用率高于 {{ $value }}"description: ""value: "{{ $value }}"instance: "{{ $labels.instance }}"- alert: prometheus-disk-highexpr: prometheus:disk:used:percent > 80for: 10mlabels:severity: infoannotations:summary: "instance: {{ $labels.instance }} disk 使用率高于 {{ $value }}"description: ""value: "{{ $value }}"instance: "{{ $labels.instance }}"- alert: prometheus-disk-read:count-highexpr: prometheus:disk:read:count:rate > 2000for: 2mlabels:severity: infoannotations:summary: "instance: {{ $labels.instance }} iops read 使用率高于 {{ $value }}"description: ""value: "{{ $value }}"instance: "{{ $labels.instance }}"- alert: prometheus-disk-write-count-highexpr: prometheus:disk:write:count:rate > 2000for: 2mlabels:severity: infoannotations:summary: "instance: {{ $labels.instance }} iops write 使用率高于 {{ $value }}"description: ""value: "{{ $value }}"instance: "{{ $labels.instance }}"- alert: prometheus-disk-read-mb-highexpr: prometheus:disk:read:mb:rate > 60for: 2mlabels:severity: infoannotations:summary: "instance: {{ $labels.instance }} 读取字节数 高于 {{ $value }}"description: ""instance: "{{ $labels.instance }}"value: "{{ $value }}"- alert: prometheus-disk-write-mb-highexpr: prometheus:disk:write:mb:rate > 60for: 2mlabels:severity: infoannotations:summary: "instance: {{ $labels.instance }} 写入字节数 高于 {{ $value }}"description: ""value: "{{ $value }}"instance: "{{ $labels.instance }}"- alert: prometheus-filefd-allocated-percent-highexpr: prometheus:filefd_allocated:percent > 80for: 10mlabels:severity: infoannotations:summary: "instance: {{ $labels.instance }} 打开文件描述符 高于 {{ $value }}"description: ""value: "{{ $value }}"instance: "{{ $labels.instance }}"- alert: prometheus-network-netin-error-rate-highexpr: prometheus:network:netin:error:rate > 4for: 1mlabels:severity: infoannotations:summary: "instance: {{ $labels.instance }} 包进入的错误速率 高于 {{ $value }}"description: ""value: "{{ $value }}"instance: "{{ $labels.instance }}"- alert: prometheus-network-netin-packet-rate-highexpr: prometheus:network:netin:packet:rate > 35000for: 1mlabels:severity: infoannotations:summary: "instance: {{ $labels.instance }} 包进入速率 高于 {{ $value }}"description: ""value: "{{ $value }}"instance: "{{ $labels.instance }}"- alert: prometheus-network-netout-packet-rate-highexpr: prometheus:network:netout:packet:rate > 35000for: 1mlabels:severity: infoannotations:summary: "instance: {{ $labels.instance }} 包流出速率 高于 {{ $value }}"description: ""value: "{{ $value }}"instance: "{{ $labels.instance }}"- alert: prometheus-network-tcp-total-count-highexpr: prometheus:network:tcp:total:count > 40000for: 1mlabels:severity: infoannotations:summary: "instance: {{ $labels.instance }} tcp连接数量 高于 {{ $value }}"description: ""value: "{{ $value }}"instance: "{{ $labels.instance }}"- alert: prometheus-process-zoom-total-count-highexpr: prometheus:process:zoom:total:count > 10for: 10mlabels:severity: infoannotations:summary: "instance: {{ $labels.instance }} 僵死进程数量 高于 {{ $value }}"description: ""value: "{{ $value }}"instance: "{{ $labels.instance }}"- alert: prometheus-time-offset-highexpr: prometheus:time:offset > 0.03for: 2mlabels:severity: infoannotations:summary: "instance: {{ $labels.instance }} {{ $labels.desc }} {{ $value }} {{ $labels.unit }}"description: ""value: "{{ $value }}"instance: "{{ $labels.instance }}"
细化规则:vim rules/record-rules.yml
groups:- name: prometheus-recordrules:- expr: up{job!="prometheus"}record: prometheus:uplabels:desc: "节点是否在线, 在线1,不在线0"unit: " "job: "prometheus"- expr: time() - node_boot_time_seconds{}record: prometheus:node_uptimelabels:desc: "节点的运行时间"unit: "s"job: "prometheus"- expr: (1 - avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="idle"}[5m]))) * 100record: prometheus:cpu:total:percentlabels:desc: "节点的cpu总消耗百分比"unit: "%"job: "prometheus"- expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="idle"}[5m]))) * 100record: prometheus:cpu:idle:percentlabels:desc: "节点的cpu idle百分比"unit: "%"job: "prometheus"- expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="iowait"}[5m]))) * 100record: prometheus:cpu:iowait:percentlabels:desc: "节点的cpu iowait百分比"unit: "%"job: "prometheus"- expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="system"}[5m]))) * 100record: prometheus:cpu:system:percentlabels:desc: "节点的cpu system百分比"unit: "%"job: "prometheus"- expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="user"}[5m]))) * 100record: prometheus:cpu:user:percentlabels:desc: "节点的cpu user百分比"unit: "%"job: "prometheus"- expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode=~"softirq|nice|irq|steal"}[5m]))) * 100record: prometheus:cpu:other:percentlabels:desc: "节点的cpu 其他的百分比"unit: "%"job: "prometheus"- expr: node_memory_MemTotal_bytes{job!="prometheus"}record: prometheus:memory:totallabels:desc: "节点的内存总量"unit: bytejob: "prometheus"- expr: node_memory_MemFree_bytes{job!="prometheus"}record: prometheus:memory:freelabels:desc: "节点的剩余内存量"unit: bytejob: "prometheus"- expr: node_memory_MemTotal_bytes{job!="prometheus"} - node_memory_MemFree_bytes{job!="prometheus"}record: prometheus:memory:usedlabels:desc: "节点的已使用内存量"unit: bytejob: "prometheus"- expr: node_memory_MemTotal_bytes{job!="prometheus"} - node_memory_MemAvailable_bytes{job!="prometheus"}record: prometheus:memory:actualusedlabels:desc: "节点用户实际使用的内存量"unit: bytejob: "prometheus"- expr: (1-(node_memory_MemAvailable_bytes{job!="prometheus"} / (node_memory_MemTotal_bytes{job!="prometheus"})))* 100record: prometheus:memory:used:percentlabels:desc: "节点的内存使用百分比"unit: "%"job: "prometheus"- expr: ((node_memory_MemAvailable_bytes{job!="prometheus"} / (node_memory_MemTotal_bytes{job!="prometheus"})))* 100record: prometheus:memory:free:percentlabels:desc: "节点的内存剩余百分比"unit: "%"job: "prometheus"- expr: sum by (instance) (node_load1{job!="prometheus"})record: prometheus:load:load1labels:desc: "系统1分钟负载"unit: " "job: "prometheus"- expr: sum by (instance) (node_load5{job!="prometheus"})record: prometheus:load:load5labels:desc: "系统5分钟负载"unit: " "job: "prometheus"- expr: sum by (instance) (node_load15{job!="prometheus"})record: prometheus:load:load15labels:desc: "系统15分钟负载"unit: " "job: "prometheus"- expr: node_filesystem_size_bytes{job!="prometheus" ,fstype=~"ext4|xfs"}record: prometheus:disk:usage:totallabels:desc: "节点的磁盘总量"unit: bytejob: "prometheus"- expr: node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"}record: prometheus:disk:usage:freelabels:desc: "节点的磁盘剩余空间"unit: bytejob: "prometheus"- expr: node_filesystem_size_bytes{job!="prometheus",fstype=~"ext4|xfs"} - node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"}record: prometheus:disk:usage:usedlabels:desc: "节点的磁盘使用的空间"unit: bytejob: "prometheus"- expr: (1 - node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"} / node_filesystem_size_bytes{job!="prometheus",fstype=~"ext4|xfs"}) * 100record: prometheus:disk:used:percentlabels:desc: "节点的磁盘的使用百分比"unit: "%"job: "prometheus"- expr: irate(node_disk_reads_completed_total{job!="prometheus"}[1m])record: prometheus:disk:read:count:ratelabels:desc: "节点的磁盘读取速率"unit: "次/秒"job: "prometheus"- expr: irate(node_disk_writes_completed_total{job!="prometheus"}[1m])record: prometheus:disk:write:count:ratelabels:desc: "节点的磁盘写入速率"unit: "次/秒"job: "prometheus"- expr: (irate(node_disk_written_bytes_total{job!="prometheus"}[1m]))/1024/1024record: prometheus:disk:read:mb:ratelabels:desc: "节点的设备读取MB速率"unit: "MB/s"job: "prometheus"- expr: (irate(node_disk_read_bytes_total{job!="prometheus"}[1m]))/1024/1024record: prometheus:disk:write:mb:ratelabels:desc: "节点的设备写入MB速率"unit: "MB/s"job: "prometheus"- expr: (1 -node_filesystem_files_free{job!="prometheus",fstype=~"ext4|xfs"} / node_filesystem_files{job!="prometheus",fstype=~"ext4|xfs"}) * 100record: prometheus:filesystem:used:percentlabels:desc: "节点的inode的剩余可用的百分比"unit: "%"job: "prometheus"- expr: node_filefd_allocated{job!="prometheus"}record: prometheus:filefd_allocated:countlabels:desc: "节点的文件描述符打开个数"unit: "%"job: "prometheus"- expr: node_filefd_allocated{job!="prometheus"}/node_filefd_maximum{job!="prometheus"} * 100record: prometheus:filefd_allocated:percentlabels:desc: "节点的文件描述符打开百分比"unit: "%"job: "prometheus"- expr: avg by (environment,instance,device) (irate(node_network_receive_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]))record: prometheus:network:netin:bit:ratelabels:desc: "节点网卡eth0每秒接收的比特数"unit: "bit/s"job: "prometheus"- expr: avg by (environment,instance,device) (irate(node_network_transmit_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]))record: prometheus:network:netout:bit:ratelabels:desc: "节点网卡eth0每秒发送的比特数"unit: "bit/s"job: "prometheus"- expr: avg by (environment,instance,device) (irate(node_network_receive_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m]))record: prometheus:network:netin:packet:ratelabels:desc: "节点网卡每秒接收的数据包个数"unit: "个/秒"job: "prometheus"- expr: avg by (environment,instance,device) (irate(node_network_transmit_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m]))record: prometheus:network:netout:packet:ratelabels:desc: "节点网卡发送的数据包个数"unit: "个/秒"job: "prometheus"- expr: avg by (environment,instance,device) (irate(node_network_receive_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m]))record: prometheus:network:netin:error:ratelabels:desc: "节点设备驱动器检测到的接收错误包的数量"unit: "个/秒"job: "prometheus"- expr: avg by (environment,instance,device) (irate(node_network_transmit_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m]))record: prometheus:network:netout:error:ratelabels:desc: "节点设备驱动器检测到的发送错误包的数量"unit: "个/秒"job: "prometheus"- expr: node_tcp_connection_states{job!="prometheus", state="established"}record: prometheus:network:tcp:established:countlabels:desc: "节点当前established的个数"unit: "个"job: "prometheus"- expr: node_tcp_connection_states{job!="prometheus", state="time_wait"}record: prometheus:network:tcp:timewait:countlabels:desc: "节点timewait的连接数"unit: "个"job: "prometheus"- expr: sum by (environment,instance) (node_tcp_connection_states{job!="prometheus"})record: prometheus:network:tcp:total:countlabels:desc: "节点tcp连接总数"unit: "个"job: "prometheus"
四、grafana配置
创建目录:mkdir -p grafana/grafana-storage
修改权限:chmod 777 grafana/grafana-storage
grafana.ini准备:先启动一个grafana容器docker run -d --name=grafana -p 3000:3000 grafana/grafana
,然后拷贝文件docker cp 93ac9e93e97a:/etc/grafana/grafana.ini ./grafana/
五、alertmanager配置
创建文件:mkdir alert
配置文件:vim alert/alertmanager.yml
route:group_by: ['dingding'] #根据告警规则名进行分组group_wait: 30s #在组内等待配置时间,如组内30s出现同一报警,在一个组内出现group_interval: 1h #告警频率,一条告警消息发送后,等待1h发送第二组报警repeat_interval: 1h #报警间隔时间,如果1h内未修复,重新发送告警receiver: 'dingding.webhook1'routes:- receiver: 'dingding.webhook1'match_re:alertname: ".*"
receivers:- name: 'dingding.webhook1' #可设置多个接收方webhook_configs:- url: 'http://10.10.30.34:8060/dingtalk/webhook1/send'send_resolved: true #恢复后收到告警
inhibit_rules:- source_match: #配置了仰制告警severity: 'critical'target_match:severity: 'warning'equal: ['alertname', 'dev', 'instance']
六、webhook配置
创建目录:mkdir webhook
配置文件:vim webhook/config.yml
## Request timeout
# timeout: 5s## Uncomment following line in order to write template from scratch (be careful!)
#no_builtin_template: true## Customizable templates path
templates:
# - contrib/templates/legacy/template.tmpl- /etc/prometheus-webhook-dingtalk/templates/default.tmpl## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
#default_message:
# title: '{{ template "legacy.title" . }}'
# text: '{{ template "legacy.content" . }}'## Targets, previously was known as "profiles"
targets:webhook1:url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxx #钉钉机器人路径# secret for signaturesecret: SEC74939daa62xxx.xxxxxx #钉钉机器人加密标签webhook2:url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxxwebhook_legacy:url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx# Customize template contentmessage:# Use legacy templatetitle: '{{ template "legacy.title" . }}'text: '{{ template "legacy.content" . }}'webhook_mention_all:url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxxmention:all: truewebhook_mention_users:url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxxmention:mobiles: ['156xxxx8827', '189xxxx8325']
prometheus-webhook-dingtalk模板
创建目录:mkdir webhook/template
模板创建:vim webhook/template/default.tmpl
{{ define "__subject" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ end }}{{ define "__alert_list" }}{{ range . }}
---
{{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }}**告警主题**: {{ .Annotations.summary }}**告警类型**: {{ .Labels.alertname }}**告警级别**: {{ .Labels.severity }} **告警主机**: {{ .Labels.instance }} **告警信息**: {{ index .Annotations "description" }}**告警时间**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}
{{ end }}{{ end }}{{ define "__resolved_list" }}{{ range . }}
---
{{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }}**告警主题**: {{ .Annotations.summary }}**告警类型**: {{ .Labels.alertname }} **告警级别**: {{ .Labels.severity }}**告警主机**: {{ .Labels.instance }}**告警信息**: {{ index .Annotations "description" }}**告警时间**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}**恢复时间**: {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }}
{{ end }}{{ end }}{{ define "default.title" }}
{{ template "__subject" . }}
{{ end }}{{ define "default.content" }}
{{ if gt (len .Alerts.Firing) 0 }}
**====侦测到{{ .Alerts.Firing | len }}个故障====**
{{ template "__alert_list" .Alerts.Firing }}
---
{{ end }}{{ if gt (len .Alerts.Resolved) 0 }}
**====恢复{{ .Alerts.Resolved | len }}个故障====**
{{ template "__resolved_list" .Alerts.Resolved }}
{{ end }}
{{ end }}{{/* Following names for compatibility */}}
{{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }}
{{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }}
{{ template "default.title" . }}
{{ template "default.content" . }}
七、docker实例创建
创建yml文件:vim docker-compose.yml
version: '3.2'
services:prometheus:image: prom/prometheusrestart: "always"ports:- 9090:9090container_name: "prometheus"volumes:- "./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml"- "./rules:/etc/prometheus/rules"- "./prometheus/data:/prometheus"command:- '--config.file=/etc/prometheus/prometheus.yml' # 设置yml路径 跟上面挂载对应- '--storage.tsdb.path=/prometheus' #设置数据路径 跟上面挂载对应alertmanager:image: prom/alertmanager:latestrestart: "always"ports:- 9093:9093container_name: "alertmanager"volumes:- "./alert/alertmanager.yml:/etc/alertmanager/alertmanager.yml"command:- '--config.file=/etc/alertmanager/alertmanager.yml' # 设置yml路径 跟上面挂载对应webhook:image: timonwong/prometheus-webhook-dingtalkrestart: "always"ports:- 8060:8060container_name: "webhook" #token指定钉钉volumes:- "./webhook/config.yml:/etc/prometheus-webhook-dingtalk/config.yml"- "./webhook/template/default.tmpl:/etc/prometheus-webhook-dingtalk/templates/default.tmpl"command:- '--config.file=/etc/prometheus-webhook-dingtalk/config.yml' # 设置yml路径 跟上面挂载对应grafana:image: grafana/grafanarestart: "always"ports:- 3000:3000container_name: "grafana"volumes:- "./grafana/grafana.ini:/etc/grafana/grafana.ini" #配置文件自行拷贝出来- "./grafana/grafana-storage:/var/lib/grafana"
创建docker:docker-compose -f docker-compose.yml up -d
八、钉钉添加机器人
自己创建一个群呗,至少两个人才能建群呀!然后按下面图片操作就行了,反正点点就行就不细说了
后面就是啥名字、链接、加签啥的我懒得打马了就不截图了,还不会的就呵呵了
九、验证
安都安装好了那不得验证一下啊!稍微懂点的人应该已经知道了,我配置了监控服务器的指标,但是我没有启动node-export,那肯定会报警呀!没错报警信息如下啊!
安装node-exporter:vim node-exporter-compose.yml
version: '3.2'
services:node-exporter:image: prom/node-exporterrestart: "always"ports:- 9100:9100container_name: "node-exporter"volumes:- "/proc:/host/proc:ro"- "/sys:/host/sys:ro"- "/:/rootfs:ro"
启动node-exporter:docker-compose -f node-exporter-compose.yml up -d
这不启动了节点吗?那肯定有修复告警呀!没错,收到了,只是我设置了1h后才再告警收到的慢了点啊!大家可以根据需求自己设置时间啊!
ps:水平高的大神自己看官方文档去整啊!小弟这给大家参考参考就行了啊!加油!!!