面试题：微服务架构里Prometheus针对复杂依赖关系的指标监控策略

1. 服务端指标暴露

各微服务集成Prometheus客户端库：在A、B、C、D等各个微服务中，集成Prometheus客户端库（如Prometheus Java Client、Prometheus Go Client等），以便将自身的性能指标（响应时间、错误率等）暴露为Prometheus可识别的格式。例如在Java应用中，使用Micrometer库集成Prometheus，通过以下方式定义响应时间指标：

MeterRegistry registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
Timer.builder("service_response_time")
   .description("Response time of the service")
   .register(registry);

定义指标标签：为每个指标添加有意义的标签，用于标识不同的服务和依赖关系。如为响应时间指标添加source_service和destination_service标签，以明确是哪个服务调用了哪个服务。例如在Python应用中：

from prometheus_client import Counter, Gauge, Histogram

response_time_histogram = Histogram(
    'http_response_time_seconds',
    'Response time of HTTP requests',
    ['source_service', 'destination_service']
)

2. Prometheus配置

抓取配置：在Prometheus的prometheus.yml配置文件中，配置对各个微服务指标端点的抓取任务。例如：

scrape_configs:
  - job_name:'service_a'
    static_configs:
      - targets: ['service-a:9090']
  - job_name:'service_b'
    static_configs:
      - targets: ['service-b:9091']
  - job_name:'service_c'
    static_configs:
      - targets: ['service-c:9092']
  - job_name:'service_d'
    static_configs:
      - targets: ['service-d:9093']

关联指标设置：通过Prometheus的查询语言PromQL，定义关联指标。例如，计算B服务调用C服务的平均响应时间，可以使用以下PromQL：

avg(rate(service_response_time_bucket{source_service="service_b", destination_service="service_c"}[5m])) by (source_service, destination_service)

计算B服务调用C和D服务的错误率：

sum(rate(service_error_total{source_service="service_b", destination_service=~"service_c|service_d"}[5m])) / sum(rate(service_request_total{source_service="service_b", destination_service=~"service_c|service_d"}[5m]))

3. Grafana可视化

导入Dashboards：在Grafana中，导入适合微服务监控的Dashboards，或者自定义Dashboard。在Dashboard中添加Prometheus数据源，并使用上述PromQL查询来展示不同微服务间的性能指标，如创建一个图表展示A调用B的响应时间趋势，另一个图表展示B调用C和D的错误率。
依赖关系图绘制：可以使用Grafana插件（如Graphviz for Grafana），根据Prometheus数据绘制微服务依赖关系图，并在图上关联性能指标，直观展示各服务间的调用关系及性能状况。

4. 告警规则设置

在Prometheus中定义告警规则：在rules.yml文件中定义告警规则。例如，当B服务调用C服务的平均响应时间超过500毫秒时触发告警：

groups:
- name: service_dependencies_alerts
  rules:
  - alert: HighResponseTime
    expr: avg(rate(service_response_time_bucket{source_service="service_b", destination_service="service_c"}[5m])) by (source_service, destination_service) > 0.5
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High response time in service B -> C"
      description: "The average response time from service B to service C is over 500ms"

与Alertmanager集成：配置Prometheus与Alertmanager集成，将告警信息发送到指定的渠道（如邮件、Slack等）。在prometheus.yml中添加Alertmanager配置：

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['alertmanager:9093']

并在Alertmanager中配置具体的告警接收渠道。

面试题：微服务架构里Prometheus针对复杂依赖关系的指标监控策略

知识考点

面试题答案

1. 服务端指标暴露

2. Prometheus配置

3. Grafana可视化

4. 告警规则设置