Jaeger Tracing 采样算法分析

2019-04-23  本文已影响0人  0d1b415a365b

先看接口

public interface Sampler {
  /**
   * @param operation The operation name set on the span
   * @param id The traceId on the span
   * @return whether or not the new trace should be sampled
   */
  SamplingStatus sample(String operation, long id);

  /**
   * Release any resources used by the sampler.
   */
  void close();
}

两个方法,注释说的很清楚了,不废话。close 基本都是空实现,因为实在是没有用到啥需要关闭的资源。


image.png

官方6个实现类,一个一个来

ConstSampler

  public static final String TYPE = "const";
  public ConstSampler(boolean decision) {
    this.decision = decision;
    ...
  }

构造 ConstSampler 时需要传入 decision 参数

  public SamplingStatus sample(String operation, long id) {
    return SamplingStatus.of(decision, tags);
  }

sample 方法中直接就返回这个 decision 了。是个 boolean,所以要么全采样,要么不采样。只会在测试环境用到。

RateLimitingSampler

  public static final String TYPE = "ratelimiting";
  public RateLimitingSampler(double maxTracesPerSecond) {
    this.maxTracesPerSecond = maxTracesPerSecond;
    double maxBalance = maxTracesPerSecond < 1.0 ? 1.0 : maxTracesPerSecond;
    this.rateLimiter = new RateLimiter(maxTracesPerSecond, maxBalance);
    ...
  }

一个参数 maxTracesPerSecond,每秒最大采样数

  public SamplingStatus sample(String operation, long id) {
    return SamplingStatus.of(this.rateLimiter.checkCredit(1.0), tags);
  }

sample 方法是否采样取决于漏桶满没满。

  private final double creditsPerNanosecond;
  private final Clock clock;
  private double balance;
  private double maxBalance;
  private long lastTick;
  public RateLimiter(double creditsPerSecond, double maxBalance, Clock clock) {
    this.clock = clock;
    this.balance = maxBalance;
    this.maxBalance = maxBalance;
    this.creditsPerNanosecond = creditsPerSecond / 1.0e9;
  }

参数说明:
-- clock 封装 System 时间操作
-- balance 桶中资源(后续分析中已水代替资源,便于理解)数量
-- maxBalance 桶的最大容量
-- creditsPerNanosecond 放行速率
-- lastTick 上次加水时间
桶最大容量就是每秒采样数,然后以纳秒级的速率均匀放过请求(这里说的均匀并不是说严格按照每 creditsPerNanosecond 纳秒放过一个采样请求,见下面分析)

  public boolean checkCredit(double itemCost) {
    long currentTime = clock.currentNanoTicks();
    // 距上次加水的时间
    double elapsedTime = currentTime - lastTick;
    // 更新加水时间
    lastTick = currentTime;
    // 尝试加水,剩余水量 + 过去这段时间漏出去的水量
    balance += elapsedTime * creditsPerNanosecond;
    // 如果尝试加水会导致桶满溢出,就把桶加到满
    if (balance > maxBalance) {
      balance = maxBalance;
    }
    // 剩的水比要漏出去的水多,就漏过去,即放过这个请求
    if (balance >= itemCost) {
      balance -= itemCost;
      return true;
    }
    // 否则不漏,拒绝请求
    return false;
  }

这个漏桶有点奇怪,反而有点像令牌桶。反复往桶里加水,加到足够漏出一次才漏出。。另外注意这个 RateLimiter 不是线程安全的,Jaeger 之所以直接用,是因为他在 synchronized 修饰的方法里。如果想拿出来另做他用的话,需要注意。

ProbabilisticSampler

随机采样,每个请求都有一定的概率被采样,掷硬币

  public ProbabilisticSampler(double samplingRate) {
    if (samplingRate < 0.0 || samplingRate > 1.0) {
      throw new IllegalArgumentException(
          "The sampling rate must be greater than 0.0 and less than 1.0");
    }

    this.samplingRate = samplingRate;
    this.positiveSamplingBoundary = (long) (((1L << 63) - 1) * samplingRate);
    this.negativeSamplingBoundary = (long) ((1L << 63) * samplingRate);
    ...
  }

构造参数为采样率,正边界为最大 long * 采样率,负边界为最小 long * 采样率,有什么用?

  public SamplingStatus sample(String operation, long id) {
    if (id > 0) {
      return SamplingStatus.of(id <= this.positiveSamplingBoundary, tags);
    } else {
      return SamplingStatus.of(id >= this.negativeSamplingBoundary, tags);
    }
  }

id 跟采样边界比较,决定是否采样。我们已经知道 sample 的入参 id 是 span 的 id,而它是通过 ThreadlocalRandom 生成的一个随机 long,所以这个比较就相当于掷硬币,也就实现了随机采样。

GuaranteedThroughputSampler

从名字可以看出来,它会保证 Throughput 并采样,这是什么东西?继续看

  public static final String TYPE = "lowerbound";

  private ProbabilisticSampler probabilisticSampler;
  private RateLimitingSampler lowerBoundSampler;
  private Map<String, Object> tags;

  public GuaranteedThroughputSampler(double samplingRate, double lowerBound) {
    ....
    probabilisticSampler = new ProbabilisticSampler(samplingRate);
    lowerBoundSampler = new RateLimitingSampler(lowerBound);
  }

很明显,这是一个复合采样器。内部同时持有概率采样器和漏桶采样器,他们的特性前文已经说过了

  public synchronized SamplingStatus sample(String operation, long id) {
    SamplingStatus probabilisticSamplingStatus = probabilisticSampler.sample(operation, id);
    SamplingStatus lowerBoundSamplingStatus = lowerBoundSampler.sample(operation, id);

    if (probabilisticSamplingStatus.isSampled()) {
      return probabilisticSamplingStatus;
    }

    return SamplingStatus.of(lowerBoundSamplingStatus.isSampled(), tags);
  }

看 sample 方法,优先使用概率采样器采集,概率采样器没采集到的会漏到漏桶采样器,然后再由漏桶采样器来控制采样比率,这样基本上可以保证每种 operation(即一类调用,一般是接口/方法名) 都可以被采样到。

PerOperationSampler

还是顾名思义,每种操作一个 sampler ?

  private final int maxOperations;
  private final HashMap<String, GuaranteedThroughputSampler> operationNameToSampler;
  private ProbabilisticSampler defaultSampler;
  private double lowerBound;

  public PerOperationSampler(int maxOperations, OperationSamplingParameters strategies) {
    this(maxOperations,
         new HashMap<String, GuaranteedThroughputSampler>(),
         new ProbabilisticSampler(strategies.getDefaultSamplingProbability()),
         strategies.getDefaultLowerBoundTracesPerSecond());
    update(strategies);
  }

属性说明:
-- maxOperations 最大支持的 operation 数
-- operationNameToSampler operation to simpler 的缓存 map,可以看出最终都是GuaranteedThroughputSampler 这个复合采样器
-- defaultSampler 默认采样器
-- lowerBound GuaranteedThroughputSampler 的 lowerBound((ノ`Д)ノ 实在不知道这个翻成什么合适,反正就是复合采样器里漏桶采样器的每秒采样数)
构造参数说明:
-- maxOperations 最大支持的 operation 数
-- strategies 需要提前构造好的一堆采样器
构造方法中,首先初始化采样器 operation - sampler 的映射 map 和 默认的概率采样器
然后在 update 方法中按照 strategies 传入的数据统一初始化,最终都是GuaranteedThroughput复合采样器,并放入 map,这个不详说了

  public synchronized SamplingStatus sample(String operation, long id) {
    GuaranteedThroughputSampler sampler = operationNameToSampler.get(operation);
    if (sampler != null) {
      return sampler.sample(operation, id);
    }

    if (operationNameToSampler.size() < maxOperations) {
      sampler = new GuaranteedThroughputSampler(defaultSampler.getSamplingRate(), lowerBound);
      operationNameToSampler.put(operation, sampler);
      return sampler.sample(operation, id);
    }

    return defaultSampler.sample(operation, id);
  }

按照 operation 从映射 map 取采样器并由它判断是否采样,没有提前创建采样器的就新建并放入缓存,但不得超过最大值,否则不缓存,不抛异常。
这种采样器自己编码用不到。

RemoteControlledSampler

远程控制的采样器,这个是默认采样器,由 jaeger-controller 的 strategies.json 配置,并下发到 jaeger-agent,然后 jaeger-client 会从 jaeger-agent 拉取。

  void updateSampler() {
    SamplingStrategyResponse response;
    try {
      response = manager.getSamplingStrategy(serviceName);
      metrics.samplerRetrieved.inc(1);
    } catch (SamplingStrategyErrorException e) {
      metrics.samplerQueryFailure.inc(1);
      return;
    }

    if (response.getOperationSampling() != null) {
      updatePerOperationSampler(response.getOperationSampling());
    } else {
      updateRateLimitingOrProbabilisticSampler(response);
    }
  }

主要看一下 updateSampler 方法就行了,启动一个 RemoteControlledSampler 后,他会定时去 jaeger-agent 拉取配置,更新采样器,然后由配置的采样器规则决定是否采样。

上一篇下一篇

猜你喜欢

热点阅读