搶拍神器的關鍵：最佳化提升Java執行緒區域性隨機數ThreadLocalRandom高併發技巧 - alidg

在本文中，探討將Java隨機數演算法最佳化為高吞吐量和低延遲的各種技巧。技巧包括更有效的物件分配，更有效的記憶體訪問，消除不必要的間接訪問以及機械同情。(對於分散式環境的搶拍很重要)
Java 7引入了，ThreadLocalRandom以在競爭激烈的環境中提高隨機數生成的吞吐量。
背後的原理ThreadLocalRandom很簡單：Random每個執行緒都維護自己的版本，而不是共享一個全域性例項Random。反過來，這減少了爭用，從而提高了吞吐量。
由於這是一個簡單的想法，因此我們應該能夠袖手旁觀，並ThreadLocalRandom以類似的效能實現類似的功能，對嗎？
讓我們來看看。

第一次嘗試
在我們的第一次嘗試中，我們將使用簡單的方法ThreadLocal<Random>：

// A few annotations
public class RandomBenchmark {

    private final Random random = new Random();
    private final ThreadLocal<Random> simpleThreadLocal = ThreadLocal.withInitial(Random::new);

    @Benchmark
    @BenchmarkMode(Throughput)
    public int regularRandom() {
        return random.nextInt();
    }

    @Benchmark
    @BenchmarkMode(Throughput)
    public int simpleThreadLocal() {
        return simpleThreadLocal.get().nextInt();
    }

    @Benchmark
    @BenchmarkMode(Throughput)
    public int builtinThreadLocal() {
        return ThreadLocalRandom.current().nextInt();
    }

    // omitted
}

在此基準測試中，我們正在比較Random，我們自己簡單的ThreadLocal<Random>和內建的ThreadLocalRandom：

Benchmark                             Mode  Cnt           Score          Error  Units
RandomBenchmark.builtinThreadLocal   thrpt   40  1023676193.004 ± 26617584.814  ops/s
RandomBenchmark.regularRandom        thrpt   40     7487301.035 ±   244268.309  ops/s
RandomBenchmark.simpleThreadLocal    thrpt   40   382674281.696 ± 13197821.344  ops/s

ThreadLocalRandom生成每秒約1十億隨機數。

線性同餘法
迄今為止，當今使用最廣泛的隨機數生成器是DH Lehmer在1949年推出的線性同餘偽隨機數生成器。
(具體演算法見原文)，Java實現：

protected int next(int bits) {
    long oldseed, nextseed;
    AtomicLong seed = this.seed;
    do {
        oldseed = seed.get();
        nextseed = (oldseed * multiplier + addend) & mask;
    } while (!seed.compareAndSet(oldseed, nextseed));
    return (int)(nextseed >>> (48 - bits));
}

由於多個執行緒可以潛在地同時更新值seed，因此我們需要某種同步來協調併發訪問。在這裡，Java 在原子的幫助下使用了無鎖方法。
基本上，每個執行緒都會嘗試透過原子地將種子值更改為一個新值compareAndSet。如果執行緒無法執行此操作，它將重試相同的過程，直到可以成功提交更新。

當爭用較高時，CAS失敗的次數會增加。這是Random併發環境中效能低下的主要原因。

沒有更多的CAS
在基於ThreadLocal的實現中，seed值限於每個執行緒。因此，由於沒有共享的可變狀態，因此我們不需要原子或任何其他形式的同步：

public class MyThreadLocalRandom extends Random {

    // omitted
    private static final ThreadLocal<MyThreadLocalRandom> threadLocal = 
        ThreadLocal.withInitial(MyThreadLocalRandom::new);

    private MyThreadLocalRandom() {}

    public static MyThreadLocalRandom current() {
        return threadLocal.get();
    }

    @Override
    protected int next(int bits) {
        seed = (seed * multiplier + addend) & mask;
        return (int) (seed >>> (48 - bits));
    }
}

如果我們再次執行相同的基準測試：

Benchmark Mode Cnt Score Error Units RandomBenchmark.builtinThreadLocal thrpt 40 1023676193.004 ± 26617584.814 ops/s RandomBenchmark.lockFreeThreadLocal thrpt 40 695217843.076 ± 17455041.160 ops/s RandomBenchmark.regularRandom thrpt 40 7487301.035 ± 244268.309 ops/s RandomBenchmark.simpleThreadLocal thrpt 40 382674281.696 ± 13197821.344 ops/s

MyThreadLocalRandom的吞吐量幾乎是簡單ThreadLocal<Random>的兩倍。
在compareAndSet提供了原子和記憶體排序保證，我們只是在一個執行緒上下文限制也不需要。由於這些保證是昂貴且不必要的，因此刪除保證會大大提高吞吐量。
但是，我們仍然落後於內建功能ThreadLocalRandom！

刪除間接

事實證明，每個執行緒都不需要自己的單獨且完整的副本Random。它只需要最新seed值即可。
從Java 8開始，這些值已新增到Thread類本身：

/** The current seed for a ThreadLocalRandom */
@jdk.internal.vm.annotation.Contended("tlr")
long threadLocalRandomSeed;

/** Probe hash value; nonzero if threadLocalRandomSeed initialized */
@jdk.internal.vm.annotation.Contended("tlr")
int threadLocalRandomProbe;

/** Secondary seed isolated from public ThreadLocalRandom sequence */
@jdk.internal.vm.annotation.Contended("tlr")
int threadLocalRandomSecondarySeed;

MyThreadLocalRandom每個執行緒例項都在threadLocalRandomSeed變數中維護其當前值seed。結果，ThreadLocalRandom類成為單例：

/** The common ThreadLocalRandom */
static final ThreadLocalRandom instance = new ThreadLocalRandom();

每次我們呼叫ThreadLocalRandom.current()的時候，它懶初始化threadLocalRandomSeed，然後返回singelton：

public static ThreadLocalRandom current() {
    if (U.getInt(Thread.currentThread(), PROBE) == 0)
        localInit();
    return instance;
}

使用MyThreadLocalRandom，每次對current()factory方法的呼叫都會轉換為ThreadLocal例項的雜湊值計算和在基礎雜湊表中的查詢。
相反，使用這種新的Java 8+方法，我們要做的就是直接讀取threadLocalRandomSeed值，然後再對其進行更新。

高效的記憶體訪問
為了更新種子值，java.util.concurrent.ThreadLocalRandom需要更改類中的threadLocalRandomSeed狀態java.lang.Thread。如果我們設定為state public，那麼每個人都可能更新threadLocalRandomSeed，這不是很好。
我們可以使用反射來更新非公開狀態，但是僅僅因為我們可以，並不意味著我們應該！
事實證明，ThreadLocalRandom可以使用本機方法Unsafe.putLong有效地更新threadLocalRandomSeed狀態：

/**
* The seed increment.
*/
private static final long GAMMA = 0x9e3779b97f4a7c15L;
private static final Unsafe U = Unsafe.getUnsafe();
private static final long SEED = U.objectFieldOffset(Thread.class, "threadLocalRandomSeed");

final long nextSeed() {
    Thread t; 
    long r; // read and update per-thread seed
    U.putLong(t = Thread.currentThread(), SEED, r = U.getLong(t, SEED) + GAMMA);

    return r;
}

putLong方法將r值寫入相對於當前執行緒的某個記憶體地址。記憶體偏移量已經透過呼叫另一個本機方法計算得出Unsafe.objectFieldOffset。
與反射相反，所有這些方法都具有本機實現，並且非常有效。

虛假共享False Sharing
CPU快取記憶體根據快取記憶體行進行工作。即，快取記憶體行是CPU快取記憶體和主儲存器之間的傳輸單位。
基本上，處理器傾向於將一些其他值與請求的值一起快取。這種空間區域性性最佳化通常可以提高吞吐量和記憶體訪問延遲。
但是，當兩個或多個執行緒競爭同一條快取行時，多執行緒可能會產生適得其反的效果。
為了更好地理解這一點，讓我們假設以下變數位於同一快取行中：

public class Thread implements Runnable {
    private final long tid;
    long threadLocalRandomSeed;
    int threadLocalRandomProbe;
    int threadLocalRandomSecondarySeed;

    // omitted
}

一些執行緒tid出於某些未知目的而使用or執行緒ID。現在，如果我們更新threadLocalRandomSeed執行緒中的值以生成隨機數，那麼應該不會發生什麼不好的事情，對嗎？聽起來好像沒什麼大不了的，因為有些執行緒正在讀取tid，而另一個執行緒則將整個執行緒寫入另一個記憶體位置。
儘管我們可能會想，但由於所有這些值都在同一快取行中，因此讀取執行緒將遇到快取未命中。編寫器需要重新整理其儲存緩衝區。這種現象稱為錯誤共享False Sharing，會給我們的多執行緒應用程式帶來效能下降。
為了避免錯誤的共享問題，我們可以在有爭議的值周圍新增一些填充。這樣，每個競爭激烈的值將駐留在自己的快取行中：

public class Thread implements Runnable {
    private final long tid;
    private long p11, p12, p13, p14, p15, p16, p17 = 0; // one 64 bit long + 7 more => 64 Bytes

    long threadLocalRandomSeed;
    private long p21, p22, p23, p24, p25, p26, p27 = 0;

    int threadLocalRandomProbe;
    private long p31, p32, p33, p34, p35, p36, p37 = 0;

    int threadLocalRandomSecondarySeed;
    private long p41, p42, p43, p44, p45, p46, p47 = 0;

    // omitted
}

在大多數現代處理器中，快取行大小通常為64或128位元組。在我的機器上，它是64個位元組，因此long在tid宣告之後，我又新增了7個啞數值。
通常，這些threadLocal*變數將在同一執行緒中更新。因此，最好隔離一下tid：

public class Thread implements Runnable {
    private final long tid;
    private long p11, p12, p13, p14, p15, p16, p17 = 0;

    long threadLocalRandomSeed;
    int threadLocalRandomProbe;
    int threadLocalRandomSecondarySeed;

    // omitted
}

讀取器執行緒不會遇到快取記憶體未命中的情況，而寫入器則不需要立即清除其儲存緩衝區，因為這些區域性變數不是volatile。

競爭註釋
jdk.internal.vm.annotation.Contended註解（如果你是在Java8則是sun.misc.Contended）是JVM隔離註釋欄位，以避免錯誤共享的提示。因此，以下內容應該更有意義：

/** The current seed for a ThreadLocalRandom */
@jdk.internal.vm.annotation.Contended("tlr")
long threadLocalRandomSeed;

/** Probe hash value; nonzero if threadLocalRandomSeed initialized */
@jdk.internal.vm.annotation.Contended("tlr")
int threadLocalRandomProbe;

/** Secondary seed isolated from public ThreadLocalRandom sequence */
@jdk.internal.vm.annotation.Contended("tlr")
int threadLocalRandomSecondarySeed;

藉助ContendedPaddingWidth調整標記，我們可以控制填充寬度。
threadLocalRandomSecondarySeed是ForkJoinPool或ConcurrentSkipListMap的內部使用的seed。另外，threadLocalRandomProbe表示當前執行緒是否已初始化其seed。
在本文中，我們探討了將RNG最佳化為高吞吐量和低延遲的各種技巧。技巧包括更有效的物件分配，更有效的記憶體訪問，消除不必要的間接訪問以及機械同情。

搶拍神器的關鍵：最佳化提升Java執行緒區域性隨機數ThreadLocalRandom高併發技巧 - alidg

相關文章