-
监控集群中应用
-
监控集群本身
- Control-Plane Components(api-server,coredns,kube-scheduler)
- Kubelet(cAdvisor)-暴露容器metrics
- Kube-state-metrics-集群层面metrics(deployments,pods metrics)
- Node-exporter-Host相关metrics(cpu,mem,network)
部署
- helm是k8s的包管理工具
prometheus-operator
Install helm
$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
$ chmod 700 get_helm.sh
$ ./get_helm.sh
Install Prometheus Chart
kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Service Monitors
controlplane ~ ➜ kubectl get crd
NAME CREATED AT
addons.k3s.cattle.io 2024-08-15T00:00:11Z
helmcharts.helm.cattle.io 2024-08-15T00:00:11Z
helmchartconfigs.helm.cattle.io 2024-08-15T00:00:11Z
traefikservices.traefik.containo.us 2024-08-15T00:01:07Z
ingressroutes.traefik.containo.us 2024-08-15T00:01:07Z
middlewaretcps.traefik.containo.us 2024-08-15T00:01:07Z
ingressrouteudps.traefik.containo.us 2024-08-15T00:01:07Z
serverstransports.traefik.containo.us 2024-08-15T00:01:07Z
tlsoptions.traefik.containo.us 2024-08-15T00:01:07Z
tlsstores.traefik.containo.us 2024-08-15T00:01:07Z
middlewares.traefik.containo.us 2024-08-15T00:01:07Z
ingressroutetcps.traefik.containo.us 2024-08-15T00:01:07Z
alertmanagerconfigs.monitoring.coreos.com 2024-08-15T09:10:15Z
alertmanagers.monitoring.coreos.com 2024-08-15T09:10:16Z
podmonitors.monitoring.coreos.com 2024-08-15T09:10:16Z
probes.monitoring.coreos.com 2024-08-15T09:10:16Z
prometheuses.monitoring.coreos.com 2024-08-15T09:10:16Z #创建prometheuses instance
prometheusrules.monitoring.coreos.com 2024-08-15T09:10:17Z
servicemonitors.monitoring.coreos.com 2024-08-15T09:10:17Z #添加targets用来让prometheus抓取
thanosrulers.monitoring.coreos.com 2024-08-15T09:10:17Z
- Service monitors定义prometheus用于监控和抓取的targets集合
Pod
# 创建一个持久化的rocky pod
kubectl run my-host --image=rockylinux/rockylinux --command -- /bin/bash -c "while true; do sleep 3600; done"
pod/my-host created
# 进入container
controlplane ~ ➜ kubectl exec my-host -it -- /bin/bash
[root@my-host /]# yum install wget -y
[root@my-host /]# wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
[root@my-host /]# tar -xzf node_exporter-1.8.2.linux-amd64.tar.gz
[root@my-host /]# cd node_exporter-1.8.2.linux-amd64nohup ./node_exporter &
# 后台启动nohup ./node_exporter# 测试
controlplane ~ ➜ kubectl get pod my-host -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
my-host 1/1 Running 0 13m 10.42.0.12 controlplane <none> <none>
# 尝试访问服务
controlplane ~ ✖ curl 10.42.0.12:9100
<html lang="en"><head><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>Node Exporter</title><style>body {font-family: -apple-system,BlinkMacSystemFont,Segoe UI,Roboto,Helvetica Neue,Arial,Noto Sans,Liberation Sans,sans-serif,Apple Color Emoji,Segoe UI Emoji,Segoe UI Symbol,Noto Color Emoji;margin: 0;
}
header {background-color: #e6522c;color: #fff;font-size: 1rem;padding: 1rem;
}
main {padding: 1rem;
}
label {display: inline-block;width: 0.5em;
}</style></head><body><header><h1>Node Exporter</h1></header><main><h2>Prometheus Node Exporter</h2><div>Version: (version=1.8.2, branch=HEAD, revision=f1e0e8360aa60b6cb5e5cc1560bed348fc2c1895)</div><div><ul><li><a href="/metrics">Metrics</a></li></ul></div></main></body>
</html>
controlplane ~ ➜ curl 10.42.0.12:9100/metrics
···
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19
# HELP promhttp_metric_handler_errors_total Total number of internal errors encountered by the promhttp metric handler.
# TYPE promhttp_metric_handler_errors_total counter
promhttp_metric_handler_errors_total{cause="encoding"} 0
promhttp_metric_handler_errors_total{cause="gathering"} 0
···
创建Service
# 查看Prometheus instance配置
controlplane ~ ➜ kubectl get prometheuses.monitoring.coreos.com -o yaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1kind: Prometheusmetadata:annotations:meta.helm.sh/release-name: prometheus-stackmeta.helm.sh/release-namespace: defaultcreationTimestamp: "2024-08-16T02:52:40Z"generation: 1labels:app: kube-prometheus-stack-prometheusapp.kubernetes.io/instance: prometheus-stackapp.kubernetes.io/managed-by: Helmapp.kubernetes.io/part-of: kube-prometheus-stackapp.kubernetes.io/version: 45.6.0chart: kube-prometheus-stack-45.6.0heritage: Helmrelease: prometheus-stackname: prometheus-stack-kube-prom-prometheusnamespace: defaultresourceVersion: "4393"uid: b1e782a0-c7d6-4f05-b5b3-20502623a9dfspec:alerting:alertmanagers:- apiVersion: v2name: prometheus-stack-kube-prom-alertmanagernamespace: defaultpathPrefix: /port: http-webenableAdminAPI: falseevaluationInterval: 30sexternalUrl: http://prometheus-stack-kube-prom-prometheus.default:9090hostNetwork: falseimage: quay.io/prometheus/prometheus:v2.42.0listenLocal: falselogFormat: logfmtlogLevel: infopaused: falsepodMonitorNamespaceSelector: {}podMonitorSelector:matchLabels:release: prometheus-stackportName: http-webprobeNamespaceSelector: {}probeSelector:matchLabels:release: prometheus-stackreplicas: 1retention: 10droutePrefix: /ruleNamespaceSelector: {}ruleSelector:matchLabels:release: prometheus-stack # 创建的rule包含这个标签,才能被prometheus instance发现scrapeInterval: 30ssecurityContext:fsGroup: 2000runAsGroup: 2000runAsNonRoot: truerunAsUser: 1000serviceAccountName: prometheus-stack-kube-prom-prometheusserviceMonitorNamespaceSelector: {}serviceMonitorSelector:matchLabels:release: prometheus-stack# 创建的serviceMonitor也要包含这个标签,才能被prometheus instance发现shards: 1version: v2.42.0walCompression: truestatus:availableReplicas: 1conditions:- lastTransitionTime: "2024-08-16T02:53:21Z"observedGeneration: 1status: "True"type: Available- lastTransitionTime: "2024-08-16T02:52:50Z"observedGeneration: 1status: "True"type: Reconciledpaused: falsereplicas: 1shardStatuses:- availableReplicas: 1replicas: 1shardID: "0"unavailableReplicas: 0updatedReplicas: 1unavailableReplicas: 0updatedReplicas: 1
kind: List
metadata:resourceVersion: ""
# 获取pod标签
controlplane ~ ➜ kubectl get pod my-host --show-labels NAME READY STATUS RESTARTS AGE LABELS
my-host 1/1 Running 0 16m run=my-host
#### svc.yml
apiVersion: v1
kind: Service
metadata:name: my-host-exporter-svclabels:job: my-host-exporterapp: my-host-exporter-svc
spec:selector:run: my-hostports:- name: exporterprotocol: TCPport: 9100targetPort: 9100
### 测试
controlplane ~ ➜ kubectl get svc my-host-exporter-svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
my-host-exporter-svc ClusterIP 10.43.32.68 <none> 9100/TCP 9s
controlplane ~ ✖ curl 10.43.32.68:9100
<html lang="en"><head><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>Node Exporter</title><style>body {font-family: -apple-system,BlinkMacSystemFont,Segoe UI,Roboto,Helvetica Neue,Arial,Noto Sans,Liberation Sans,sans-serif,Apple Color Emoji,Segoe UI Emoji,Segoe UI Symbol,Noto Color Emoji;margin: 0;
}
header {background-color: #e6522c;color: #fff;font-size: 1rem;padding: 1rem;
}
main {padding: 1rem;
}
label {display: inline-block;width: 0.5em;
}</style></head><body><header><h1>Node Exporter</h1></header><main><h2>Prometheus Node Exporter</h2><div>Version: (version=1.8.2, branch=HEAD, revision=f1e0e8360aa60b6cb5e5cc1560bed348fc2c1895)</div><div><ul><li><a href="/metrics">Metrics</a></li></ul></div></main></body>
</html>
ServiceMonitor Template
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:name: my-host-svc-monlabels:release: prometheus-stack
spec:jobLabel: jobendpoints:- port: exporter # 同svcinterval: 30spath: /metrics # metrics路径selector:matchLabels:app: my-host-exporter-svc
创建之后发现我们定义的ServiceMonitor已经被prometheus instance发现并可以抓取metrics
我们也可以执行查询,对my-host Pod的一些指标进行观察
Rules
为了添加规则,Operator拥有一个prometheusrule
的CRD,用来向prometheus instance注册新规则
Template examples
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:labels:release: prometheus-stack # 保证可以被prometheus instance找到并注册name: my-host-rules
spec:groups:- name: apirules:- alert: InstanceDownexpr: up == 0for: 5mlabels:severity: criticalannotations:summary: "Instance {{$labels.instance}} down"