JVM源码分析之线程局部缓存TLAB

2017-05-04 本文已影响4156人美团Java

简书占小狼
转载请注明原创出处，谢谢！

背景

介绍TLAB之前先思考一个问题：
创建对象时，需要在堆上申请指定大小的内存，如果同时有大量线程申请内存的话，可以通过锁机制或者指针碰撞的方式确保不会申请到同一块内存，在JVM运行中，内存分配是一个极其频繁的动作，这种方式势必会降低性能。

因此，在Hotspot 1.6的实现中引入了TLAB技术。

什么是TLAB

TLAB全称ThreadLocalAllocBuffer，是线程的一块私有内存，如果设置了虚拟机参数 -XX:UseTLAB，在线程初始化时，同时也会申请一块指定大小的内存，只给当前线程使用，这样每个线程都单独拥有一个Buffer，如果需要分配内存，就在自己的Buffer上分配，这样就不存在竞争的情况，可以大大提升分配效率，当Buffer容量不够的时候，再重新从Eden区域申请一块继续使用，这个申请动作还是需要原子操作的。

TLAB的目的是在为新对象分配内存空间时，让每个Java应用线程能在使用自己专属的分配指针来分配空间，均摊对GC堆（eden区）里共享的分配指针做更新而带来的同步开销。

TLAB只是让每个线程有私有的分配指针，但底下存对象的内存空间还是给所有线程访问的，只是其它线程无法在这个区域分配而已。当一个TLAB用满（分配指针top撞上分配极限end了），就新申请一个TLAB，而在老TLAB里的对象还留在原地什么都不用管——它们无法感知自己是否是曾经从TLAB分配出来的，而只关心自己是在eden里分配的。

TLAB实现

实现位于/Users/zhanjun/openjdk/hotspot/src/share/vm/memory/threadLocalAllocBuffer.hpp

// ThreadLocalAllocBuffer: a descriptor for thread-local storage used by
// the threads for allocation.
//            It is thread-private at any time, but maybe multiplexed over
//            time across multiple threads. The park()/unpark() pair is
//            used to make it avaiable for such multiplexing.
class ThreadLocalAllocBuffer: public CHeapObj<mtThread> {
  friend class VMStructs;
private:
  HeapWord* _start;                              // address of TLAB
  HeapWord* _top;                                // address after last allocation
  HeapWord* _pf_top;                             // allocation prefetch watermark
  HeapWord* _end;                                // allocation end (excluding alignment_reserve)
  size_t    _desired_size;                       // desired size   (including alignment_reserve)
  size_t    _refill_waste_limit;                 // hold onto tlab if free() is larger than this

TLAB简单来说本质上就是三个指针：start，top 和 end，每个线程都会从Eden分配一大块空间，例如说100KB，作为自己的TLAB，其中 start 和 end 是占位用的，标识出 eden 里被这个 TLAB 所管理的区域，卡住eden里的一块空间不让其它线程来这里分配。而 top 就是里面的分配指针，一开始指向跟 start 同样的位置，然后逐渐分配，直到再要分配下一个对象就会撞上 end 的时候就会触发一次 TLAB refill，refill过程后续会解释。

_desired_size 是指TLAB的内存大小。

_refill_waste_limit 是指最大的浪费空间，假设为5KB，通俗一点讲就是：
1、假如当前TLAB已经分配96KB，还剩下4KB，但是现在new了一个对象需要6KB的空间，显然TLAB的内存不够了，这时可以简单的重新申请一个TLAB，原先的TLAB交给Eden管理，这时只浪费4KB的空间，在_refill_waste_limit 之内。
2、假如当前TLAB已经分配90KB，还剩下10KB，现在new了一个对象需要11KB，显然TLAB的内存不够了，这时就不能简单的抛弃当前TLAB，这11KB会被安排到Eden区进行申请。

在Java代码中执行new Thread()的时候，会触发以下代码


// The first routine called by a new Java thread
void JavaThread::run() {
  // initialize thread-local alloc buffer related fields
  this->initialize_tlab();
  // used to test validitity of stack trace backs
  this->record_base_of_stack_pointer();
  // Record real stack base and size.
  this->record_stack_base_and_size();
  // Initialize thread local storage; set before calling MutexLocker
  this->initialize_thread_local_storage();
  this->create_stack_guard_pages();
  this->cache_global_variables();

JavaThread的run方法中，第一步就是调用this->initialize_tlab();方法初始化TLAB，initialize_tlab实现如下：

void initialize_tlab() {
    if (UseTLAB) {
      tlab().initialize();
    }
  }

其中tlab()返回的就是一个ThreadLocalAllocBuffer对象，调用initialize()初始化TLAB，实现如下：

void ThreadLocalAllocBuffer::initialize() {
  initialize(NULL,                    // start
             NULL,                    // top
             NULL);                   // end

  set_desired_size(initial_desired_size());

  // Following check is needed because at startup the main (primordial)
  // thread is initialized before the heap is.  The initialization for
  // this thread is redone in startup_initialization below.
  if (Universe::heap() != NULL) {
    size_t capacity   = Universe::heap()->tlab_capacity(myThread()) / HeapWordSize;
    double alloc_frac = desired_size() * target_refills() / (double) capacity;
    _allocation_fraction.sample(alloc_frac);
  }

  set_refill_waste_limit(initial_refill_waste_limit());

  initialize_statistics();
}

1、设置当前TLAB的_desired_size，该值通过initial_desired_size()方法计算；
2、设置当前TLAB的_refill_waste_limit，该值通过initial_refill_waste_limit()方法计算；
3、初始化一些统计字段，如_number_of_refills、_fast_refill_waste、_slow_refill_waste、_gc_waste和_slow_allocations；

字段_desired_size的计算过程分析

size_t ThreadLocalAllocBuffer::initial_desired_size() {
  size_t init_sz;

  if (TLABSize > 0) {
    init_sz = MIN2(TLABSize / HeapWordSize, max_size());
  } else if (global_stats() == NULL) {
    // Startup issue - main thread initialized before heap initialized.
    init_sz = min_size();
  } else {
    // Initial size is a function of the average number of allocating threads.
    unsigned nof_threads = global_stats()->allocating_threads_avg();

    init_sz  = (Universe::heap()->tlab_capacity(myThread()) / HeapWordSize) /
                      (nof_threads * target_refills());
    init_sz = align_object_size(init_sz);
    init_sz = MIN2(MAX2(init_sz, min_size()), max_size());
  }
  return init_sz;
}

TLABSize在argument模块中默认会设置大小为 256 * K，也可以通过JVM参数选择进行设置，不过即使设置了也会和一个最大值max_size进行比较，然后取一个较小值，其中max_size计算如下：

const size_t ThreadLocalAllocBuffer::max_size() {
  // TLABs can't be bigger than we can fill with a int[Integer.MAX_VALUE].
  // This restriction could be removed by enabling filling with multiple arrays.
  // If we compute that the reasonable way as
  //    header_size + ((sizeof(jint) * max_jint) / HeapWordSize)
  // we'll overflow on the multiply, so we do the divide first.
  // We actually lose a little by dividing first,
  // but that just makes the TLAB  somewhat smaller than the biggest array,
  // which is fine, since we'll be able to fill that.

  size_t unaligned_max_size = typeArrayOopDesc::header_size(T_INT) +
                              sizeof(jint) *
                              ((juint) max_jint / (size_t) HeapWordSize);
  return align_size_down(unaligned_max_size, MinObjAlignment);
}

这里明确说明了TLAB的大小不能超过可以容纳 int[Integer.MAX_VALUE]，有点疑惑，why?

字段_refill_waste_limit计算分析

size_t initial_refill_waste_limit()  { 
    return desired_size() / TLABRefillWasteFraction; 
}

计算逻辑很简单，其中TLABRefillWasteFraction默认 64

内存分配

new一个对象，假设需要1K的大小，我们一步一步看看是如何分配的。

instanceOop instanceKlass::allocate_instance(TRAPS) {
  assert(!oop_is_instanceMirror(), "wrong allocation path");
  bool has_finalizer_flag = has_finalizer(); // Query before possible GC
  int size = size_helper();  // Query before forming handle.
  KlassHandle h_k(THREAD, as_klassOop());
  instanceOop i;
  i = (instanceOop)CollectedHeap::obj_allocate(h_k, size, CHECK_NULL);
  if (has_finalizer_flag && !RegisterFinalizersAtInit) {
    i = register_finalizer(i, CHECK_NULL);
  }
  return i;
}

对象的内存分配入口为instanceKlass::allocate_instance()，通过CollectedHeap::obj_allocate()方法在堆内存上进行分配

oop CollectedHeap::obj_allocate(KlassHandle klass, int size, TRAPS) {
  debug_only(check_for_valid_allocation_state());
  assert(!Universe::heap()->is_gc_active(), "Allocation during gc not allowed");
  assert(size >= 0, "int won't convert to size_t");
  HeapWord* obj = common_mem_allocate_init(klass, size, CHECK_NULL);
  post_allocation_setup_obj(klass, obj);
  NOT_PRODUCT(Universe::heap()->check_for_bad_heap_word_value(obj, size));
  return (oop)obj;
}

其中common_mem_allocate_init()方法最终会调用CollectedHeap::common_mem_allocate_noinit()方法，实现如下：

HeapWord* CollectedHeap::common_mem_allocate_noinit(KlassHandle klass, size_t size, TRAPS) {

  // Clear unhandled oops for memory allocation.  Memory allocation might
  // not take out a lock if from tlab, so clear here.
  CHECK_UNHANDLED_OOPS_ONLY(THREAD->clear_unhandled_oops();)

  if (HAS_PENDING_EXCEPTION) {
    NOT_PRODUCT(guarantee(false, "Should not allocate with exception pending"));
    return NULL;  // caller does a CHECK_0 too
  }

  HeapWord* result = NULL;
  if (UseTLAB) {
    result = allocate_from_tlab(klass, THREAD, size);
    if (result != NULL) {
      assert(!HAS_PENDING_EXCEPTION,
             "Unexpected exception, will result in uninitialized storage");
      return result;
    }
  }
  bool gc_overhead_limit_was_exceeded = false;
  result = Universe::heap()->mem_allocate(size,
                                          &gc_overhead_limit_was_exceeded);

根据UseTLAB的值，决定是否在TLAB上进行内存分配，如果JVM参数中没有手动取消UseTLAB，会调用allocate_from_tlab()在TLAB上尝试分配，因为可能存在分配失败的情况，比如TLAB容量不足，看下allocate_from_tlab()的实现：

HeapWord* CollectedHeap::allocate_from_tlab(KlassHandle klass, Thread* thread, size_t size) {
  assert(UseTLAB, "should use UseTLAB");

  HeapWord* obj = thread->tlab().allocate(size);
  if (obj != NULL) {
    return obj;
  }
  // Otherwise...
  return allocate_from_tlab_slow(klass, thread, size);

从上述实现可以看出，先会尝试调用ThreadLocalAllocBuffer 的 allocate 方法，如果返回为空，再执行allocate_from_tlab_slow()进行分配，从这个方法命名可以看出这是比较慢的分配路径。

ThreadLocalAllocBuffer 的 allocate 方法实现如下：

inline HeapWord* ThreadLocalAllocBuffer::allocate(size_t size) {
  invariants();
  HeapWord* obj = top();
  if (pointer_delta(end(), obj) >= size) {
    // successful thread-local allocation
#ifdef ASSERT
    // Skip mangling the space corresponding to the object header to
    // ensure that the returned space is not considered parsable by
    // any concurrent GC thread.
    size_t hdr_size = oopDesc::header_size();
    Copy::fill_to_words(obj + hdr_size, size - hdr_size, badHeapWordVal);
#endif // ASSERT
    // This addition is safe because we know that top is
    // at least size below end, so the add can't wrap.
    set_top(obj + size);

    invariants();
    return obj;
  }
  return NULL;
}

通过判断当前TLAB的剩余容量是否大于需要分配的大小，来决定分配结果，如果当前剩余容量不够，就返回NULL，表示分配失败。

慢分配allocate_from_tlab_slow()实现如下：

HeapWord* CollectedHeap::allocate_from_tlab_slow(KlassHandle klass, Thread* thread, size_t size) {

  // Retain tlab and allocate object in shared space if
  // the amount free in the tlab is too large to discard.
  if (thread->tlab().free() > thread->tlab().refill_waste_limit()) {
    thread->tlab().record_slow_allocation(size);
    return NULL;
  }

  // Discard tlab and allocate a new one.
  // To minimize fragmentation, the last TLAB may be smaller than the rest.
  size_t new_tlab_size = thread->tlab().compute_size(size);

  thread->tlab().clear_before_allocation();

  if (new_tlab_size == 0) {
    return NULL;
  }

  // Allocate a new TLAB...
  HeapWord* obj = Universe::heap()->allocate_new_tlab(new_tlab_size);
  if (obj == NULL) {
    return NULL;
  }

 // 删除了一些代码
  thread->tlab().fill(obj, obj + size, new_tlab_size);
  return obj;
}

1、如果当前TLAB的剩余容量大于浪费阈值，就不在当前TLAB分配，直接在共享的Eden区进行分配，并且记录慢分配的内存大小；
2、如果剩余容量小于浪费阈值，说明可以丢弃当前TLAB了；
3、通过allocate_new_tlab()方法，从eden新分配一块裸的空间出来（这一步可能会失败），如果失败说明eden没有足够空间来分配这个新TLAB，就会触发YGC。

申请好新的TLAB内存之后，执行TLAB的fill()方法，实现如下：

void ThreadLocalAllocBuffer::fill(HeapWord* start,
                                  HeapWord* top,
                                  size_t    new_size) {
  _number_of_refills++;
  if (PrintTLAB && Verbose) {
    print_stats("fill");
  }
  assert(top <= start + new_size - alignment_reserve(), "size too small");
  initialize(start, top, start + new_size - alignment_reserve());

  // Reset amount of internal fragmentation
  set_refill_waste_limit(initial_refill_waste_limit());
}

包括下述几个动作：
1、统计refill的次数
2、初始化重新申请到的内存块
3、将当前TLAB抛弃（retire）掉，这个过程中最重要的动作是将TLAB末尾尚未分配给Java对象的空间（浪费掉的空间）分配成一个假的“filler object”（目前是用int[]作为filler object）。这是为了保持GC堆可以线性parse（heap parseability）用的。