亞馬遜DRKG使用體驗

Bo_hemian發表於2020-09-09

基於文章:探索「老藥新用」最短路徑:亞馬遜AI Lab開源大規模藥物重定位知識圖譜DRKG,記錄了該專案的實際部署與探索過程,供參考。

1. DRKG介紹

大規模藥物重定位知識圖譜 Drug Repurposing Knowledge Graph (DRKG) 是一種涉及基因、化合物、疾病、生物學過程、副作用和症狀的綜合性生物知識圖。DRKG包括來自六個現有資料庫的資訊,包括DrugBank、Hetionet、GNBR、String、IntAct 、DGIdb,以及從最近的出版物收集的資料,特別是與Covid19相關的資料。它包括屬於13種實體型別的97238個實體,以及分屬於107種關係型別的5874261 個三元組資料。還包括一堆關於如何使用DRKG完成探索和分析統計,基於機器學習方法完成知識圖嵌入等任務。

2. 開發環境

  • python==3.6.1
  • torch==1.2.0+cpu
  • dgl==0.4.3
  • dglke==0.1.1

3. 資料來源

DRKG包括Hetionet等 6個生物資料庫資訊以及最近出版的論文資料集,然而,具體資料獲取方式並未公開,以下是部分生物資料庫介紹。

3.1 Hetionet

地址: https://het.io/about/

3.2 DrugBank 

https://www.drugbank.ca/

 

3.3 String

 

4. 知識圖譜資訊

 DRKG.tsv:包含了知識圖譜的所有三元組,包含13類實體,如下:

 

17類實體關係

5. 預備知識

基於TransE演算法完成實體、關係、邊嵌入向量構造。

關於知識圖譜嵌入模型可點選以下連結瞭解:知識圖譜嵌入的Translate模型彙總(TransE,TransH,TransR,TransD)

6. DRKG使用

6.1 知識圖譜嵌入向量預訓練

在亞馬遜DRKG中,提供了封裝好的指令碼實現知識圖譜嵌入向量的訓練模型:

在原始程式碼中,執行指令碼為:

 

 由於我們採用windows環境 ,且沒有配置GPU叢集,需要修改執行指令碼為:

!dglke_train --dataset DRKG --data_files DRKG_train.tsv DRKG_valid.tsv DRKG_test.tsv --format raw_udd_hrt --model_name TransE_l2 --batch_size 2048 \
--neg_sample_size 256 --hidden_dim 400 --gamma 12.0 --lr 0.1 --max_step 100000 --log_interval 1000 --batch_size_eval 16 -adv --regularization_coef 1.00E-07 --test --num_thread 1 --neg_sample_size_eval 10000 --async_update

 其中dglke_train 的可選引數有以下:

  • --model_name {TransE,TransE_l1,TransE_l2,TransR,RESCAL,DistMult,ComplEx,RotatE}. The models provided by DGL-KE.
  • --data_path DATA_PATH. The path of the directory where DGL-KE loads knowledge graph data.
  • --dataset DATASET The name of the builtin knowledge graph. Currently,the builtin knowledge graphs include FB15k, FB15k-237,wn18, wn18rr and Freebase. DGL-KE automatically downloads the knowledge graph and keep it under data_path.
  • --format FORMAT The format of the dataset. For builtin knowledge graphs,the foramt should be built_in. For users own knowledge graphs,it needs to be raw_udd_{htr} or udd_{htr}.
  • --data_files DATA_FILES [DATA_FILES ...]. A list of data file names. This is used if users want to train KGEon their own datasets. If the format is raw_udd_{htr},users need to provide train_file [valid_file] [test_file].If the format is udd_{htr}, users need to provideentity_file relation_file train_file [valid_file] [test_file].In both cases, valid_file and test_file are optional.
  • --delimiter DELIMITER. Delimiter used in data files. Note all files should use the same delimiter.
  • --save_path SAVE_PATH the path of the directory where models and logs are saved.
  • --no_save_emb Disable saving the embeddings under save_path.
  • --max_step MAX_STEP The maximal number of steps to train the model.A step trains the model with a batch of data.
  • --batch_size BATCH_SIZE. The batch size for training.
  • --batch_size_eval BATCH_SIZE_EVAL.The batch size used for validation and test.
  • --neg_sample_size NEG_SAMPLE_SIZE. The number of negative samples we use for each positive sample in the training.
  • --neg_deg_sample Construct negative samples proportional to vertex degree in the training.When this option is turned on, the number of negative samples per positive edgewill be doubled. Half of the negative samples are generated uniformly whilethe other half are generated proportional to vertex degree.
  • --neg_deg_sample_eval. Construct negative samples proportional to vertex degree in the evaluation.
  • --neg_sample_size_eval NEG_SAMPLE_SIZE_EVAL. The number of negative samples we use to evaluate a positive sample.
  • --eval_percent EVAL_PERCENT. Randomly sample some percentage of edges for evaluation.
  • --no_eval_filter Disable filter positive edges from randomly constructed negative edges for evaluation
  • --log LOG_INTERVAL, --log_interval LOG_INTERVAL. Print runtime of different components every x steps.
  • --eval_interval EVAL_INTERVAL. Print evaluation results on the validation dataset every x stepsif validation is turned on
  • --test Evaluate the model on the test set after the model is trained.
  • --num_proc NUM_PROC The number of processes to train the model in parallel.In multi-GPU training, the number of processes by default is set to match the number of GPUs.If set explicitly, the number of processes needs to be divisible by the number of GPUs.
  • --num_thread NUM_THREAD The number of CPU threads to train the model in each process.This argument is used for multiprocessing training.
  • --force_sync_interval FORCE_SYNC_INTERVAL. We force a synchronization between processes every x steps formultiprocessing training. This potentially stablizes the training processto get a better performance. For multiprocessing training, it is set to 1000 by default.
  • --hidden_dim HIDDEN_DIM.The embedding size of relation and entity
  • --lr LR The learning rate. DGL-KE uses Adagrad to optimize the model parameters.
  • -g GAMMA, --gamma GAMMA. The margin value in the score function. It is used by TransX and RotatE.
  • -de, --double_ent Double entitiy dim for complex number It is used by RotatE.
  • -dr, --double_rel Double relation dim for complex number.
  • -adv, --neg_adversarial_sampling Indicate whether to use negative adversarial sampling.It will weight negative samples with higher scores more.
  • -a ADVERSARIAL_TEMPERATURE, --adversarial_temperature ADVERSARIAL_TEMPERATURE
  • The temperature used for negative adversarial sampling.
  • -rc REGULARIZATION_COEF, --regularization_coef REGULARIZATION_COEF
  • The coefficient for regularization.
  • -rn REGULARIZATION_NORM, --regularization_norm REGULARIZATION_NORM norm used in regularization.
  • --gpu GPU [GPU ...] A list of gpu ids, e.g. 0 1 2 4
  • --mix_cpu_gpu Training a knowledge graph embedding model with both CPUs and GPUs.The embeddings are stored in CPU memory and the training is performed in GPUs.This is usually used for training a large knowledge graph embeddings.
  • --valid Evaluate the model on the validation set in the training.
  • --rel_part Enable relation partitioning for multi-GPU training.
  • --async_update Allow asynchronous update on node embedding for multi-GPU training.This overlaps CPU and GPU computation to speed up.

此外,還需要修改以下路徑程式碼,具體需要修改資料型別從int32為int64。

 

6.2 知識圖譜嵌入向量的應用

  • 實體相似性分析
  • 關係相似性分析
  • 實體-關係得分評估

其中實體-關係得分評估可用於連結預測。連結預測是指包括社交網站的好友推薦,預測蛋白質間的相互影響,預測犯罪嫌疑人的關係,商品推薦在內的一系列問題總稱。

6.3 連結預測問題

DRKG提供了一個連結預測demo,但需對程式碼進行小幅修改。

在執行以下程式碼會報錯:IndexError: tensors used as indices must be long, byte or bool tensors

 

需要轉化張量格式,增加:

6.4 老藥新用、藥物重定向

DRKG提供了covid-19新冠藥物篩查demo,主要包括基於兩個方向的藥物篩選:

1、基於“疾病-化合物”關係的藥物篩查

2、基“基因-化合物”關係的藥物篩查

6.4.1 基於“疾病-化合物”關係的藥物篩查

首先,收集DRKG中所有冠狀病毒(COV)疾病,對映到對應的圖譜實體(疾病)嵌入向量;

其次,在Drugbank中使用FDA批准的藥物作為候選藥物,約有8104個,同樣對映到對應的圖譜實體(藥物)嵌入向量;

最後,定義關係為['Hetionet::CtD::Compound:Disease','GNBR::T::Compound:Disease'] 這兩類關係,同樣對映到對應的圖譜關係嵌入向量;

藉助於訓練圖譜嵌入向量的TransE演算法,定義以下得分函式,作為衡量所有候選藥物在上述兩類關係上與所有冠狀病毒(COV)疾病的匹配程度,得分越大越好。

 統計得分TOP100的藥物,並與以開展的32個covid-19臨床試驗藥物比較,觀察有多少在TOP100中存在。

 6.4.2 基於”基因-化合物“的藥物篩查

首先,收集DRKG中所有冠狀病毒(COV)相關基因,對映到對應的圖譜實體(基因)嵌入向量;

其次,在Drugbank中使用FDA批准的藥物作為候選藥物,約有8104個,同樣對映到對應的圖譜實體(藥物)嵌入向量;

最後,定義關係為['GNBR::N::Compound:Gene'] 抑制關係,同樣對映到對應的圖譜關係嵌入向量;

藉助於訓練圖譜嵌入向量的TransE演算法,定義以下得分函式,作為e衡量所有候選藥物在上述抑制關係上與所有冠狀病毒(COV)基因的匹配程度,得分越大越好。

 

 6.4.3 總結

縱觀本專案的藥物篩查實現方式,採用的方法其實並不複雜,核心還是對預訓練知識圖譜嵌入模型的使用。至於看到過的基於GNN進行藥物分子設計的應用,在本專案所提供的程式碼中並未體現,因此相關工作仍然需要繼續研究。

 

相關文章