提升 Prometheus 的可用性:一文读懂Thanos 多集群监控
介绍
-
https://github.com/particuleio/teks/tree/main/terragrunt/live/thanos
-
https://github.com/particuleio/terraform-kubernetes-addons/tree/main/modules/aws
Kubernetes技术栈
-
Prometheus:收集度量标准
-
告警管理器:根据指标查询向各种提供者发送警报
-
Grafana:可视化豪华仪表板
Thanos,它来了
-
Thanos Store
-
Thanos Sidecar
-
Thanos Query
多集群架构
-
一个观察者集群[3]
-
一个被观察集群[4]
.
├── env_tags.yaml
├── eu-west-1
│ ├── clusters
│ │ └── observer
│ │ ├── eks
│ │ │ ├── kubeconfig
│ │ │ └── terragrunt.hcl
│ │ ├── eks-addons
│ │ │ └── terragrunt.hcl
│ │ └── vpc
│ │ └── terragrunt.hcl
│ └── region_values.yaml
└── eu-west-3
├── clusters
│ └── observee
│ ├── cluster_values.yaml
│ ├── eks
│ │ ├── kubeconfig
│ │ └── terragrunt.hcl
│ ├── eks-addons
│ │ └── terragrunt.hcl
│ └── vpc
│ └── terragrunt.hcl
└── region_values.yaml
-
Grafana启用
-
Thanos边车上传到特定的桶
kube-prometheus-stack = {
enabled = true
allowed_cidrs = dependency.vpc.outputs.private_subnets_cidr_blocks
thanos_sidecar_enabled = true
thanos_bucket_force_destroy = true
extra_values = <<-EXTRA_VALUES
grafana:
deploymentStrategy:
type: Recreate
ingress:
enabled: true
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: "letsencrypt"
hosts:
- grafana.${local.default_domain_suffix}
tls:
- secretName: grafana.${local.default_domain_suffix}
hosts:
- grafana.${local.default_domain_suffix}
persistence:
enabled: true
storageClassName: ebs-sc
accessModes:
- ReadWriteOnce
size: 1Gi
prometheus:
prometheusSpec:
replicas: 1
retention: 2d
retentionSize: "10GB"
ruleSelectorNilUsesHelmValues: false
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: ebs-sc
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
EXTRA_VALUES
-
这个CA将被进入sidecar的被观察集群所信任
-
为Thanos querier组件生成TLS证书,这些组件将查询被观察集群
-
Thanos组件全部部署完成
-
查询前端,作为Grafana的数据源端点
-
存储网关用于查询观察者桶
-
Query将对存储网关和其他查询器执行查询
-
配置了TLS的Thanos查询器对每个被观察集群进行查询
thanos-tls-querier = {
"observee" = {
enabled = true
default_global_requests = true
default_global_limits = false
stores = [
"thanos-sidecar.${local.default_domain_suffix}:443"
]
}
}
thanos-storegateway = {
"observee" = {
enabled = true
default_global_requests = true
default_global_limits = false
bucket = "thanos-store-pio-thanos-observee"
region = "eu-west-3"
}
-
Thanos这边就是上传给观察者特定的桶
-
Thanos边车与TLS客户端认证的入口对象一起发布,并信任观察者集群CA
kube-prometheus-stack = {
enabled = true
allowed_cidrs = dependency.vpc.outputs.private_subnets_cidr_blocks
thanos_sidecar_enabled = true
thanos_bucket_force_destroy = true
extra_values = <<-EXTRA_VALUES
grafana:
enabled: false
prometheus:
thanosIngress:
enabled: true
ingressClassName: nginx
annotations:
cert-manager.io/cluster-issuer: "letsencrypt"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
nginx.ingress.kubernetes.io/auth-tls-verify-client: "on"
nginx.ingress.kubernetes.io/auth-tls-secret: "monitoring/thanos-ca"
hosts:
- thanos-sidecar.${local.default_domain_suffix}
paths:
- /
tls:
- secretName: thanos-sidecar.${local.default_domain_suffix}
hosts:
- thanos-sidecar.${local.default_domain_suffix}
prometheusSpec:
replicas: 1
retention: 2d
retentionSize: "6GB"
ruleSelectorNilUsesHelmValues: false
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: ebs-sc
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
EXTRA_VALUES
-
Thanos压缩器来管理这个特定集群的下采样
thanos = {
enabled = true
bucket_force_destroy = true
trusted_ca_content = dependency.thanos-ca.outputs.thanos_ca
extra_values = <<-EXTRA_VALUES
compactor:
retentionResolution5m: 90d
query:
enabled: false
queryFrontend:
enabled: false
storegateway:
enabled: false
EXTRA_VALUES
}
再深入一点
kubectl -n monitoring get pods
NAME READY STATUS RESTARTS AGE
alertmanager-kube-prometheus-stack-alertmanager-0 2/2 Running 0 120m
kube-prometheus-stack-grafana-c8768466b-rd8wm 2/2 Running 0 120m
kube-prometheus-stack-kube-state-metrics-5cf575d8f8-x59rd 1/1 Running 0 120m
kube-prometheus-stack-operator-6856b9bb58-hdrb2 1/1 Running 0 119m
kube-prometheus-stack-prometheus-node-exporter-8hvmv 1/1 Running 0 117m
kube-prometheus-stack-prometheus-node-exporter-cwlfd 1/1 Running 0 120m
kube-prometheus-stack-prometheus-node-exporter-rsss5 1/1 Running 0 120m
kube-prometheus-stack-prometheus-node-exporter-rzgr9 1/1 Running 0 120m
prometheus-kube-prometheus-stack-prometheus-0 3/3 Running 1 120m
thanos-compactor-74784bd59d-vmvps 1/1 Running 0 119m
thanos-query-7c74db546c-d7bp8 1/1 Running 0 12m
thanos-query-7c74db546c-ndnx2 1/1 Running 0 12m
thanos-query-frontend-5cbcb65b57-5sx8z 1/1 Running 0 119m
thanos-query-frontend-5cbcb65b57-qjhxg 1/1 Running 0 119m
thanos-storegateway-0 1/1 Running 0 119m
thanos-storegateway-1 1/1 Running 0 118m
thanos-storegateway-observee-storegateway-0 1/1 Running 0 12m
thanos-storegateway-observee-storegateway-1 1/1 Running 0 11m
thanos-tls-querier-observee-query-dfb9f79f9-4str8 1/1 Running 0 29m
thanos-tls-querier-observee-query-dfb9f79f9-xsq24 1/1 Running 0 29m
kubectl -n monitoring get ingress
NAME CLASS HOSTS ADDRESS PORTS AGE
kube-prometheus-stack-grafana <none> grafana.thanos.teks-tg.clusterfrak-dynamics.io k8s-ingressn-ingressn-afa0a48374-f507283b6cd101c5.elb.eu-west-1.amazonaws.com 80, 443 123m
kubectl -n monitoring get pods
NAME READY STATUS RESTARTS AGE
alertmanager-kube-prometheus-stack-alertmanager-0 2/2 Running 0 39m
kube-prometheus-stack-kube-state-metrics-5cf575d8f8-ct292 1/1 Running 0 39m
kube-prometheus-stack-operator-6856b9bb58-4cngc 1/1 Running 0 39m
kube-prometheus-stack-prometheus-node-exporter-bs4wp 1/1 Running 0 39m
kube-prometheus-stack-prometheus-node-exporter-c57ss 1/1 Running 0 39m
kube-prometheus-stack-prometheus-node-exporter-cp5ch 1/1 Running 0 39m
kube-prometheus-stack-prometheus-node-exporter-tnqvq 1/1 Running 0 39m
kube-prometheus-stack-prometheus-node-exporter-z2p49 1/1 Running 0 39m
kube-prometheus-stack-prometheus-node-exporter-zzqp7 1/1 Running 0 39m
prometheus-kube-prometheus-stack-prometheus-0 3/3 Running 1 39m
thanos-compactor-7576dcbcfc-6pd4v 1/1 Running 0 38m
kubectl -n monitoring get ingress
NAME CLASS HOSTS ADDRESS PORTS AGE
kube-prometheus-stack-thanos-gateway nginx thanos-sidecar.thanos.teks-tg.clusterfrak-dynamics.io k8s-ingressn-ingressn-95903f6102-d2ce9013ac068b9e.elb.eu-west-3.amazonaws.com 80, 443 40m
k -n monitoring logs -f thanos-tls-querier-observee-query-687dd88ff5-nzpdh
level=info ts=2021-02-23T15:37:35.692346206Z caller=storeset.go:387 component=storeset msg="adding new storeAPI to query storeset" address=thanos-sidecar.thanos.teks-tg.clusterfrak-dynamics.io:443 extLset="{cluster="pio-thanos-observee", prometheus="monitoring/kube-prometheus-stack-prometheus", prometheus_replica="prometheus-kube-prometheus-stack-prometheus-0"}"
kubectl -n monitoring port-forward thanos-tls-querier-observee-query-687dd88ff5-nzpdh 10902
kubectl -n monitoring port-forward thanos-query-7c74db546c-d7bp8 10902
-
观察者把本地Thanos聚集
-
我们的存储网关(一个用于远程观测者集群,一个用于本地观测者集群)
-
本地TLS查询器,它可以查询被观察的sidecar
在Grafana可视化
总结
-
https://docs.google.com/document/d/1H47v7WfyKkSLMrR8_iku6u9VB73WrVzBHb2SB6dL9_g/edit#heading=h.2v27snv0lsur
-
https://github.com/particuleio/teks
-
https://github.com/particuleio/teks/tree/main/terragrunt/live/thanos/eu-west-1/clusters/observer
-
https://github.com/particuleio/teks/tree/main/terragrunt/live/thanos/eu-west-3/clusters/observee
-
https://thanos.io/tip/operating/cross-cluster-tls-communication.md/
发表评论