greenplum分佈鍵的hash值計算分析

努力爬呀爬發表於2021-11-01

原文網址 : https://www.cnblogs.com/payapa/p/15493507.html

greenplum 資料分佈策略

greenplum 是一個 MPP 架構的資料庫，由一個 master 和多個 segment 組成（還可選配置一個 standby master），其資料會根據設定的分佈策略分佈到在不同的 segment 上。

在 6 版本中，gp 提供了 3 個策略：隨機分佈、複製分佈、hash 分佈。

隨機分佈

在建立表的時候，使用 "DISTRIBUTED RANDOMLY" 子句。

該策略會使資料隨機分佈到各個 segment，即使是完全一樣的兩行資料，也可能會被分散至不同的 segment。雖然隨機分佈可以使資料平均的分散至所有的 segment（不會出現資料傾斜），但進行表關聯分析時，仍然會按照關聯鍵進行重分佈資料，所以該策略在生產環境中很少使用。

複製分佈

在建立表的時候，使用 "DISTRIBUTED REPLICATED" 子句。

該策略會把資料傳送至所有的 segment，即所有的 segment 都擁有該表的所有資料，所以在表關聯分析時，可以減少資料重分佈，但該資料會儲存到所有的 segment，所以會產生大量的重複資料。所以，該策略適合一些小表使用。

hash 分佈

在重建表的時候，使用 "DISTRIBUTED BY (column，[...])" 子句。

該策略需要使用者指定哪些列作為分佈鍵，且分佈鍵必須是主鍵的子集。gp 會根據分佈鍵的值，進行計算得出 hash key 值，再根據該 key 值計算得出該資料被分配到哪個 segment上。使用者可以結合自己的資料特點，以及以後資料分析的規律，為不同的表指定不同的分佈鍵，以提供良好的資料儲存以及資料分析效能。

hash 流程

這裡直接貼出呼叫堆疊，重點分析 directDispatchCalculateHash 函式：

呼叫堆疊


#0  cdbhashinit (h=0x2e4e738) at cdbhash.c:161
#1  0x0000000000b05017 in directDispatchCalculateHash (plan=0x2e4dce8, targetPolicy=0x2e4e178, hashfuncs=0x2e4e6b8)
    at cdbmutate.c:197
#2  0x0000000000b0a989 in sri_optimize_for_result (root=0x2e4cf18, plan=0x2e4dce8, rte=0x2e4cd88,
    targetPolicy=0x7ffe5fca0ec0, hashExprs_p=0x7ffe5fca0ed0, hashOpfamilies_p=0x7ffe5fca0ec8) at cdbmutate.c:3560
#3  0x0000000000810d6e in adjust_modifytable_flow (root=0x2e4cf18, node=0x2e4e068, is_split_updates=0x2e4d9b8)
    at createplan.c:6608
#4  0x00000000008108bd in make_modifytable (root=0x2e4cf18, operation=CMD_INSERT, canSetTag=1 '\001',
    resultRelations=0x2e4e038, subplans=0x2e4dfe8, withCheckOptionLists=0x0, returningLists=0x0,
    is_split_updates=0x2e4d9b8, rowMarks=0x0, epqParam=0) at createplan.c:6471
#5  0x0000000000817e24 in subquery_planner (glob=0x2cbcf70, parse=0x2d7cd80, parent_root=0x0, hasRecursion=0 '\000',
    tuple_fraction=0, subroot=0x7ffe5fca11b8, config=0x2e4cee8) at planner.c:907
#6  0x0000000000816d1d in standard_planner (parse=0x2d7cd80, cursorOptions=0, boundParams=0x0) at planner.c:345
#7  0x0000000000816904 in planner (parse=0x2cbd080, cursorOptions=0, boundParams=0x0) at planner.c:200
#8  0x00000000008e8f4a in pg_plan_query (querytree=0x2cbd080, cursorOptions=0, boundParams=0x0) at postgres.c:959
#9  0x00000000008e8ffd in pg_plan_queries (querytrees=0x2d7b458, cursorOptions=0, boundParams=0x0) at postgres.c:1018
#10 0x00000000008ea3e8 in exec_simple_query (
    query_string=0x2cbc0d8 "insert INTO hash values (1,'asdf','fdsa','qwer');") at postgres.c:1748
#11 0x00000000008ef189 in PostgresMain (argc=1, argv=0x2c9bc10, dbname=0x2c9bac0 "postgres",
    username=0x2c9baa8 "gpadmin") at postgres.c:5242
#12 0x000000000086db12 in BackendRun (port=0x2cc5830) at postmaster.c:4811
#13 0x000000000086d1da in BackendStartup (port=0x2cc5830) at postmaster.c:4468
#14 0x0000000000869424 in ServerLoop () at postmaster.c:1948
#15 0x00000000008689c3 in PostmasterMain (argc=6, argv=0x2c99c20) at postmaster.c:1518
#16 0x0000000000774e33 in main (argc=6, argv=0x2c99c20) at main.c:245

directDispatchCalculateHash

這裡只貼出 directDispatchCalculateHash 函式的重點程式碼及註釋：


static void
directDispatchCalculateHash(Plan *plan, GpPolicy *targetPolicy, Oid *hashfuncs)
{
	// .....以上程式碼省略

		// 為當前插入的資料會話建立 cdbHash 環境
		// 主要包括：
		// 1、當前 gp 的 segment 個數
		// 2、hash key 值到 segment 的 reduce 函式
		// 3、該表的分佈鍵，以及該分佈鍵型別對應計算 hash key 的函式
		h = makeCdbHash(targetPolicy->numsegments, targetPolicy->nattrs, hashfuncs);

		// 初始化 cdbHash，主要是初始化 hashkey 值
		cdbhashinit(h);

		// 遍歷所有的分佈鍵
		// nattrs 是分佈鍵個數
		for (i = 0; i < targetPolicy->nattrs; i++)
		{
			// 進行 hash key 值計算
			cdbhash(h, i + 1, values[i], nulls[i]);
		}

		// 根據前面計算出來的 hash key, 
		// 再算出該資料資料應該對映到哪個 segment
		hashcode = cdbhashreduce(h);

	// ......以下程式碼省略
}

cdbhash

void
cdbhash(CdbHash *h, int attno, Datum datum, bool isnull)
{
	uint32		hashkey = h->hash;

	// ......省略一些非關鍵程式碼

		/* rotate hashkey left 1 bit at each step */
		hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);

		if (!isnull)
		{
			FunctionCallInfoData fcinfo;
			uint32		hkey;

			InitFunctionCallInfoData(fcinfo, &h->hashfuncs[attno - 1], 1,
									 InvalidOid,
									 NULL, NULL);

			fcinfo.arg[0] = datum;
			fcinfo.argnull[0] = false;

			hkey = DatumGetUInt32(FunctionCallInvoke(&fcinfo));

			/* Check for null result, since caller is clearly not expecting one */
			if (fcinfo.isnull)
				elog(ERROR, "function %u returned NULL", fcinfo.flinfo->fn_oid);

			hashkey ^= hkey;
		}
	
	// ......省略一些非關鍵程式碼
	
	h->hash = hashkey;
}

分析：

1、InitFunctionCallInfoData 該巨集展開為：

#define InitFunctionCallInfoData(Fcinfo, Flinfo, Nargs, Collation, Context, Resultinfo) \
	do { \
		(Fcinfo).flinfo = (Flinfo); \
		(Fcinfo).context = (Context); \
		(Fcinfo).resultinfo = (Resultinfo); \
		(Fcinfo).fncollation = (Collation); \
		(Fcinfo).isnull = false; \
		(Fcinfo).nargs = (Nargs); \
	} while (0)

這裡主要是用來初始化 Fcinfo 結構體， fcinfo 型別為 FunctionCallInfoData，其定義為： typedef Datum (*PGFunction) (FunctionCallInfo fcinfo);。

FunctionCallInfoData是一個通用的用於傳遞迴調函式的入參結構體，

其中：

a、flinfo 欄位是一個結構體，型別為 FmgrInfo ，該結構體裡面最重要的是 fn_addr 欄位，它儲存了後面真正呼叫的 hash 回撥函式的地址。

b、nargs 欄位表示回撥函式的入參個數，這裡固定為1，說明所有的 hash 函式的入參個數都只有1個。

2、 FunctionCallInfoData中的 arg 欄位表示回撥函式入參列表，這裡只使用了 datum 賦值，從外層函式可以看出來，該值即為當前列的值。

所以從這裡可以確定，分佈鍵使用的 hash 回撥函式的入參通過封裝的 FunctionCallInfoData結構體進行傳輸，且最終裡面使用的 hash 函式的入參只有 1 個，就是分佈鍵的值。

3、 FunctionCallInvoke 展開後為 ((* (fcinfo)->flinfo->fn_addr) (fcinfo)) ，即這裡真正呼叫了 hash 回撥函式，並使用前面賦值好的 fcinfo 作為引數。

4、最終把 hash 回撥函式的返回值強轉為 uint32 型別，再與之前計算出來的 hash key 做異或操作後，作為最後的 hash key 儲存到當前 cdbHash 環境中的 hash 裡，即最後的賦值： h->hash = hashkey。

總結

外層，先對當前的會話建立一個 hash 環境，然後遍歷每個分佈鍵做一次 hash 計算，根據最終的 hash key 值，做一次 reduce，計算出 segment id。

內層，先初始化通用的回撥函式入參，再呼叫回撥函式，並與之前的 hash key 值做一次異或操作，得出當前的 hash key。

hash 回撥函式分析

smallint / int / bigint 型別

smallint 型別，對應的 hash 函式是 hashint2，

int 型別，對應的 hash 函式是 hashint4，

bigint 型別，對應的 hash 函式是 hashint8，

具體實現如下：

#define PG_GETARG_DATUM(n)	 (fcinfo->arg[n])
#define PG_GETARG_INT16(n)	 DatumGetInt16(PG_GETARG_DATUM(n))
#define PG_GETARG_INT32(n)	 DatumGetInt32(PG_GETARG_DATUM(n))
#define PG_GETARG_INT64(n)	 DatumGetInt64(PG_GETARG_DATUM(n))

Datum
hashint2(PG_FUNCTION_ARGS)
{
	return hash_uint32((int32) PG_GETARG_INT16(0));
}

Datum
hashint4(PG_FUNCTION_ARGS)
{
	return hash_uint32(PG_GETARG_INT32(0));
}

Datum
hashint8(PG_FUNCTION_ARGS)
{
	/*
	 * The idea here is to produce a hash value compatible with the values
	 * produced by hashint4 and hashint2 for logically equal inputs; this is
	 * necessary to support cross-type hash joins across these input types.
	 * Since all three types are signed, we can xor the high half of the int8
	 * value if the sign is positive, or the complement of the high half when
	 * the sign is negative.
	 */
	int64		val = PG_GETARG_INT64(0);
	uint32		lohalf = (uint32) val;
	uint32		hihalf = (uint32) (val >> 32);

	lohalf ^= (val >= 0) ? hihalf : ~hihalf;

	return hash_uint32(lohalf);
}

把巨集展開後，可以觀察到，smallint 、int 和 bigint 實際上底層呼叫的 hash 函式都是 hash_uint32，唯一有區別的是 hash_uint32 的入參。

當型別是 smallint 或 int 時，入參就是其本身，而當型別是 bigint 時，該型別長度為8位元組，所以需要對其處理一下：當被 hash 的值大於等於0時，則使用高4位元組與第4位元組異或的值進行 hash；當被 hash 的值小於0時，則使用高4位元組的相反數，與低4位元組異或的值進行 hash。

char / varchar / text 型別

char 型別，對應的 hash 函式是 hashbpchar，

text / varchar 型別，對應的 hash 函式是：hashtext，