1 导引

我们在博客《联邦学习:联邦场景下的多源知识图谱嵌入》中介绍了联邦场景下的知识图谱嵌入,现在让我们回顾一下其中关于数据部分的细节。在联邦场景下,\(C\)个知识图谱\(\left\{\mathcal{G}_c\right\}_{c=1}^C=\left\{\left\{\mathcal{E}_c, \mathcal{R}_c, \mathcal{T}_c\right\}\right\}_{c=1}^C\)位于不同的客户端上。知识图谱拥的实体集合\(\mathcal{E}_c\)之间可能会存在重叠,而其关系集合\(\mathcal{R}_c\)和元组集合\(\mathcal{T}_c\)之间则不会重叠[1]。我们联系一下现实场景看这是合理的,比如在不同客户端对应不同银行的情况下,由于不同银行都有着自己的业务流程,所以关系集合不重叠。


2 联邦异构知识图谱划分


由于我们这里的本地知识图谱\(\{\mathcal{E}_c, \mathcal{R}_c, \mathcal{T}_c\}\)中每个知识图谱的关系\(\mathcal{R}_c\)(即边的种类)是不同的,我们在划分元组之前我们需要先对关系进行划分,然后针对关系划分的结果来划分元组。 待元组划分到本地后,还需要将原有的实体和关系的索引映射到本地索引。最后,再在本地进行训练/验证/测试集的拆分。整体数据划分流程图如下:

2.1 划分关系


random.shuffle(triples)# triples为元祖集合,大小为 (n_triples, 3)# 每各元组按(h,t,r)顺序存储triples = np.concatenate(triples) # 先根据边的类型edge_type(即关系类型)将不同的edge_type映射到不同的client_idedge_types = list(set(triples[:, 2]))random.shuffle(edge_types)edge_type_to_cid = {}n_edge_types_per_client = len(edge_types)//n_clientsfor id, edge_type in enumerate(edge_types):    c_id = id // n_edge_types_per_client    if c_id < n_clients - 1:        edge_type_to_cid[edge_type] = c_id    else:        edge_type_to_cid[edge_type] = n_clients - 1

2.2 确定元组划分


# 然后根据edge_type到client_id的映射情况,来将元组triples划分到不同的clientc_id_triples = [[] for i in range(n_clients)]    for triple in triples:    edge_type = triple[2]    c_id = edge_type_to_cid[edge_type]    c_id_triples[c_id].append(triple.reshape(1, -1))

2.3 索引映射



# mapping global indices to local indicesc_id_triples_ori = [[] for i in range(n_clients)]    for c_id in range(n_clients):    triples = np.concatenate(c_id_triples[c_id])    c_id_triples_ori[c_id] = triples                edge_index = triples[:, :2]    edge_type = triples[:, 2]    # map entity indices to local entity indices    index_mapping = {}    entities = list(set(edge_index.flatten()))    random.shuffle(entities)    for index, entity in enumerate(entities):        index_mapping[entity] = index    f = lambda x: index_mapping[x]    f = np.vectorize(f)    client_entity_local_index = f(edge_index)    # map edge indices to local entity indices    index_mapping = {}    edges = copy.deepcopy(list(set((edge_type))))    random.shuffle(edges)    for index, edge in enumerate(edges):        index_mapping[edge] = index    f = lambda x: index_mapping[x]    f = np.vectorize(f)    client_edge_local_index = f(edge_type)    c_id_triples[c_id] = np.concatenate([client_entity_local_index, \        client_edge_local_index.reshape(-1, 1)], axis=1)

2.4 训练/验证/测试集拆分


# split train, valid, test datasetfor c_id in range(n_clients):    n_triples = c_id_triples[c_id].shape[0]    n_train = int(n_triples * 0.8)    n_val = int((n_triples - n_train) * 0.5)    n_test = n_triples - n_train - n_val    mod_to_slice = {"train": slice(0, n_train), \        "valid": slice(n_train, n_train+n_val), "test": slice(-n_test, n_triples)}    for mode in ["train", "valid", "test"]:        client_data[c_id][mode]["edge_index_ori"] = c_id_triples_ori[c_id][mod_to_slice[mode], : 2].T        client_data[c_id][mode]["edge_index"] = c_id_triples[c_id][mod_to_slice[mode], : 2].T                client_data[c_id][mode]["edge_type_ori"] = c_id_triples_ori[c_id][mod_to_slice[mode], 2]        client_data[c_id][mode]["edge_type"] = c_id_triples[c_id][mod_to_slice[mode], 2]

3 关于异构性的分析和解决






