為什麼Java String雜湊乘數為31？

二進位制之路發表於2018-09-23

原文網址 : https://juejin.im/post/5ba75d165188255c6a043b96

前面簡單介紹了[ 經典的Times 33 雜湊演算法 ]，這篇我們通過分析Java 1.8 String類的雜湊演算法，繼續聊聊對乘數的選擇。

String類的hashCode()原始碼

/** Cache the hash code for the string */
private int hash;

/** 
Returns a hash code for this string. The hash code for a String object is computed as 
 s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
using int arithmetic, where s[i] is the ith character of the string, 
n is the length of the string, and ^ indicates exponentiation. 
(The hash value of the empty string is zero.) 
*/
public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}
複製程式碼

可以看到，String的雜湊演算法也是採用了Times 33的思路，只不過乘數選擇了31。

其中

hash預設值為0.
判斷h == 0是為了快取雜湊值.
判斷value.length > 0是因為空字串的雜湊值為0.

用資料說話

前一篇我們提到：

這個神奇的數字33，為什麼用來計算雜湊的效果會比其他許多常數（無論是否為質數）更有效，並沒有人給過足夠充分的解釋。因此，Ralf S. Engelschall嘗試通過自己的方法解釋其原因。通過對1到256中的每個數字進行測試，發現偶數的雜湊效果非常差，根據用不了。而剩下的128個奇數，除了1之外，效果都差不多。這些奇數在分佈上都表現不錯，對雜湊表的填充覆蓋大概在86%。

從雜湊效果來看（Chi^2應該是指卡方分佈），雖然33並不一定是最好的數值。但17、31、33、63、127和129等相對其他的奇數的一個很明顯的優勢是，由於這些奇數與16、32、64、128只相差1，可以通過移位（如1 << 4 = 16）和加減1來代替乘法，速度更快。

那麼接下來，我們通過實驗資料，來看看偶數、奇數，以及17、31、33、63、127和129等這些神奇數字的雜湊效果，來驗證Ralf S. Engelschall的說法。

環境準備

個人筆記本，Windows 7作業系統，酷睿i5雙核64位CPU。

測試資料：CentOS Linux release 7.5.1804的/usr/share/dict/words字典檔案對應的所有單詞。

由於CentOS上找不到該字典檔案，通過yum -y install words進行了安裝。

/usr/share/dict/words共有479828個單詞，該檔案連結的原始檔案為linux.words。

計算衝突率與雜湊耗時

測試程式碼

/**
 * 以1-256為乘數，分別計算/usr/share/dict/words所有單詞的雜湊衝突率、總耗時.
 * 
 * @throws IOException
 */
@Test
public void testHash() throws IOException {
	List<String> words = getWords();
	
	System.out.println();
	System.out.println("multiplier, conflictSize, conflictRate, timeCost, listSize, minHash, maxHash");
	for (int i = 1; i <=256; i++) {
		computeConflictRate(words, i);
	}
}

/**
 * 讀取/usr/share/dict/words所有單詞
 * 
 * @return
 * @throws IOException
 */
private List<String> getWords() throws IOException {
	// read file
	InputStream is = HashConflictTester.class.getClassLoader().getResourceAsStream("linux.words");
	List<String> lines = IOUtils.readLines(is, "UTF-8");
	return lines;
}

/**
 * 計算衝突率
 * 
 * @param lines
 */
private void computeConflictRate(List<String> lines, int multiplier) {
	// compute hash
	long startTime = System.currentTimeMillis();
	List<Integer> hashList = computeHashes(lines, multiplier);
	long timeCost = System.currentTimeMillis() - startTime;
	
	// find max and min hash
	Comparator<Integer> comparator = (x,y) -> x > y ? 1 : (x < y ? -1 : 0);
	int maxHash = hashList.parallelStream().max(comparator).get();
	int minHash = hashList.parallelStream().min(comparator).get();
	
	// hash set
	Set<Integer> hashSet = hashList.parallelStream().collect(Collectors.toSet());
	
	int conflictSize = lines.size() - hashSet.size();
	float conflictRate = conflictSize * 1.0f / lines.size();
	System.out.println(String.format("%s, %s, %s, %s, %s, %s, %s", multiplier, conflictSize, conflictRate, timeCost, lines.size(), minHash, maxHash));
}

/**
 * 根據乘數計算hash值
 * 
 * @param lines
 * @param multiplier
 * @return
 */
private List<Integer> computeHashes(List<String> lines, int multiplier) {
	Function<String, Integer> hashFunction = x -> {
		int hash = 0;
		for (int i = 0; i < x.length(); i++) {
			hash = (multiplier * hash) + x.charAt(i);
		}
		return hash;
	};
	return lines.parallelStream().map(hashFunction).collect(Collectors.toList());
}
複製程式碼

執行測試方法testHash()，稍等片刻後，我們將得到一份測試報告。

雜湊衝突率降序排序

通過對雜湊衝突率進行降序排序，得到下面的結果。

結果分析

偶數的衝突率基本都很高，只有少數例外。
較小的乘數，衝突率也比較高，如1至20。
乘數1、2、256的分佈不均勻。Java雜湊值為32位int型別，取值範圍為[-2147483648,2147483647]。

雜湊耗時降序排序

我們再對衝突數量為1000以內的乘數進行分析，通過對執行耗時進行降序排序，得到下面的結果。

分析17、31、33、63、127和129

17在上一輪已經出局。
63執行計算耗時比較長。
31、33的衝突率分別為0.13%、0.14%，執行耗時分別為10、11，實時基本相當。
127、129的衝突率分別為0.01%、0.004%，執行耗時分別為9、10。

總體上看，129執行耗時低，衝突率也是最小的，似乎先擇它更為合適？

雜湊分佈情況

將整個雜湊空間[-2147483648,2147483647]分為128個分割槽，分別統計每個分割槽的雜湊值數量，以此來觀察各個乘數的分佈情況。每個分割槽的雜湊桶位為2^32 / 128 = 33554432。

之所以通過分割槽來統計，主要是因為單詞數太多，嘗試過畫成圖表後密密麻麻的，無法直觀的觀察對比。

計算雜湊分佈程式碼

@Test
public void testHashDistribution() throws IOException {
	int[] multipliers = {2, 17, 31, 33, 63, 127, 73, 133, 237, 161};
	List<String> words = getWords();
	for (int multiplier : multipliers) {
		List<Integer> hashList = computeHashes(words, multiplier);
		Map<Integer, Integer> hashMap = partition(hashList);
		System.out.println("\n" + multiplier + "\n,count");
		hashMap.forEach((x, y) -> System.out.println(x + "," + y));
	}
}

/**
 * 將整個雜湊空間等分成128份，統計每個空間內的雜湊值數量
 * 
 * @param hashs
 */
public static Map<Integer, Integer> partition(List<Integer> hashs) {
	// step = 2^32 / 128 = 33554432
	final int step = 33554432;
	List<Integer> nums = new ArrayList<>();
	Map<Integer, Integer> statistics = new LinkedHashMap<>();
	int start = 0;
	for (long i = Integer.MIN_VALUE; i <= Integer.MAX_VALUE; i += step) {
		final long min = i;
		final long max = min + step;
		int num = (int) hashs.parallelStream().filter(x -> x >= min && x < max).count();

		statistics.put(start++, num);
		nums.add(num);
	}

	// 為了防止計算出錯，這裡驗證一下
	int hashNum = nums.stream().reduce((x, y) -> x + y).get();
	assert hashNum == hashs.size();

	return statistics;
}
複製程式碼