Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Node2Vec graph sampling algorithm #829

Merged
merged 16 commits into from
Dec 25, 2024
Merged
3 changes: 2 additions & 1 deletion docs/en-US/source/9.olap&procedure/3.learn/1.tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
## 1. Introduction to TuGraph Graph Learning Module
Graph learning is a machine learning method that utilizes the topological information of a graph structure to analyze and model data. Unlike traditional machine learning methods, graph learning uses graph structures where vertices represent entities in data and edges represent relationships between entities. By extracting features and patterns from these vertices and edges, deep associations and patterns can be revealed in data that can be used in various practical applications.

The TuGraph Graph Learning Module is a graph learning module based on a graph database that provides four sampling operators: Neighbor Sampling, Edge Sampling, Random Walk Sampling, and Negative Sampling. These operators can be used to sample vertices and edges in a graph to generate training data. The sampling process is performed in a parallel computing environment, providing high efficiency and scalability.
The TuGraph Graph Learning Module is a graph learning module based on a graph database that provides four sampling operators: Neighbor Sampling, Edge Sampling, Random Walk Sampling, Negative Sampling, and Node2Vec Sampling. These operators can be used to sample vertices and edges in a graph to generate training data. The sampling process is performed in a parallel computing environment, providing high efficiency and scalability.

After sampling, the obtained training data can be used to train a model that can be used for various graph learning tasks such as prediction and classification. Through training, the model can learn the relationships between vertices and edges in the graph, allowing for prediction and classification of new vertices and edges. In practical applications, this module can be used to handle large-scale graph data such as social networks, recommendation systems, and bioinformatics.

Expand Down Expand Up @@ -61,6 +61,7 @@ TuGraph implements an operator for obtaining the full graph data and four sampli
|Edge Sampling | Sample the edges in the graph according to the sampling rate to obtain the sampling subgraph |
|Random Walk Sampling | Conduct a random walk based on the given node to obtain the sampling subgraph |
|Negative Sampling | Generate a subgraph of non-existent edges|
|Node2Vec Sampling | Perform biased random walks using the Node2Vec algorithm to generate node sequences and node embeddings |

### 6.2.Compilation
Skip if TuGraph has been compiled.
Expand Down
20 changes: 20 additions & 0 deletions docs/en-US/source/9.olap&procedure/3.learn/2.sampling_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,5 +98,25 @@ NodeInfo: A list of dictionaries containing metadata information for the nodes.
EdgeInfo: A list of dictionaries containing metadata information for the edges.
Return value: This function does not return anything.

### 3.6 Node2Vec Sampling Operator
Node2Vec is a graph embedding algorithm that generates node sequences through random walks and embeds the nodes into a low-dimensional space using the Word2Vec algorithm. It combines the ideas of DFS and BFS, using parameters 𝑝 and 𝑞 to control the transition probabilities of the random walk, thereby balancing the capture of community structure (local) and structural equivalence (global) information. The resulting embedding vectors can be used for downstream tasks such as node classification, clustering, or link prediction.

```python
def Process(db_: lgraph_db_python.PyGraphDB, olapondb: lgraph_db_python.PyOlapOnDB, feature_num: size_t, p: cython.double, q: cython.double, walk_length: size_t, num_walks: size_t, sample_node: list, NodeInfo: list, EdgeInfo: list):
```
Parameter list:

db_: An instance of the graph database.
olapondb: Graph analysis class.
feature_num: A size_t value specifying the length of the generated node vectors.
p: A double parameter for the Node2Vec algorithm that controls the likelihood of returning to the previous node.
q: A double parameter for the Node2Vec algorithm that controls the likelihood of exploring outward nodes.
walk_length: A size_t value specifying the length of each random walk.
num_walks: A size_t value specifying the number of random walks to perform.
sample_node: A list specifying the nodes to sample from.
NodeInfo: A list of dictionaries containing metadata information for the nodes.
EdgeInfo: A list of dictionaries containing metadata information for the edges.
Return value: This function does not return anything.

## 4. User-Defined Sampling Algorithm
Users can also implement a custom sampling algorithm through the TuGraph Olap interface. For the interface document, see [here](../2.olap/5.python-api.md). This document mainly introduces the interface of related functions used by the graph sampling algorithm design.
2 changes: 1 addition & 1 deletion docs/en-US/source/9.olap&procedure/3.learn/3.training.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ When using TuGraph's graph learning module for training, it can be divided into
Full graph training involves loading the entire graph from the TuGraph db into memory and training the GNN. Mini-batch training uses the sampling operator of the TuGraph graph learning module to sample the entire graph data and then input it into the training framework for training.

## 2. Mini-Batch Training
Mini-batch training requires the use of TuGraph's graph learning module's sampling operators, which currently support Neighbor Sampling, Edge Sampling, Random Walk Sampling, and Negative Sampling. The sampling result of the TuGraph graph learning module's sampling operator is returned in the form of a List.
Mini-batch training requires the use of TuGraph's graph learning module's sampling operators, which currently support Neighbor Sampling, Edge Sampling, Random Walk Sampling, Negative Sampling, and Node2Vec Sampling. The sampling result of the TuGraph graph learning module's sampling operator is returned in the form of a List.

The following takes Neighbor Sampling as an example to introduce how to convert the sampled results into the training framework for format conversion.
The user needs to provide a Sample class:
Expand Down
3 changes: 2 additions & 1 deletion docs/zh-CN/source/9.olap&procedure/3.learn/1.tutorial.md
qishipengqsp marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
## 1.TuGraph 图学习模块简介
图学习是一种机器学习方法,其核心思想是利用图结构中的拓扑信息,通过顶点之间的联系及规律来进行数据分析和建模。不同于传统机器学习方法,图学习利用的数据形式为图结构,其中顶点表示数据中的实体,而边则表示实体之间的关系。通过对这些顶点和边进行特征提取和模式挖掘,可以揭示出数据中深层次的关联和规律,从而用于各种实际应用中。

这个模块是一个基于图数据库的图学习模块,主要提供了四种采样算子:Neighbor Sampling、Edge Sampling、Random Walk Sampling 和 Negative Sampling。这些算子可以用于对图中的顶点和边进行采样,从而生成训练数据。采样过程是在并行计算环境下完成的,具有高效性和可扩展性。
这个模块是一个基于图数据库的图学习模块,主要提供了五种采样算子:Neighbor Sampling、Edge Sampling、Random Walk Sampling Negative Sampling 和 Node2Vec Sampling。这些算子可以用于对图中的顶点和边进行采样,从而生成训练数据。采样过程是在并行计算环境下完成的,具有高效性和可扩展性。

在采样后,我们可以使用得到的训练数据来训练一个模型。该模型可以用于各种图学习任务,比如预测、分类等。通过训练,模型可以学习到图中的顶点和边之间的关系,从而能够对新的顶点和边进行预测和分类。在实际应用中,这个模块可以被用来处理各种大规模的图数据,比如社交网络、推荐系统、生物信息学等。

Expand Down Expand Up @@ -52,6 +52,7 @@ TuGraph在cython层实现了一种获取全图数据的算子及4种采样算子
| Edge Sampling | 根据采样率采样图中的边,得到采样子图 |
| Random Walk Sampling | 根据给定的顶点,进行随机游走,得到采样子图 |
| Negative Sampling | 生成不存在边的子图。 |
| Node2Vec Sampling | 使用 Node2Vec 算法执行有偏随机游动,以生成捕获网络结构的节点序列和节点向量。 |

### 6.2.编译
如果TuGraph已编译,可跳过此步骤。
Expand Down
20 changes: 19 additions & 1 deletion docs/zh-CN/source/9.olap&procedure/3.learn/2.sampling_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ del db
del galaxy
```
## 3. 图采样算子介绍
图采样算子在cython层实现,用于对输入的图进行采样处理,生成的NodeInfo用于保存feature属性、label属性等点信息,EdgeInfo用于保存边信息,这些元数据信息可以被用于特征抽取、网络嵌入等任务中。目前TuGraph图学习模块支持GetDB、NeighborSampling、EdgeSampling、RandomWalkSampling、NegativeSampling五种采样算子
图采样算子在cython层实现,用于对输入的图进行采样处理,生成的NodeInfo用于保存feature属性、label属性等点信息,EdgeInfo用于保存边信息,这些元数据信息可以被用于特征抽取、网络嵌入等任务中。目前TuGraph图学习模块支持GetDB、NeighborSampling、EdgeSampling、RandomWalkSampling、NegativeSampling、Node2VecSampling六种采样算子
### 3.1.RandomWalk算子:
在给定的采样点周围进行指定次数的随机游走,得到采样子图。

Expand Down Expand Up @@ -93,5 +93,23 @@ NodeInfo: list类型,一个点属性字典的列表,表示点的元数据信
EdgeInfo: list类型,一个边属性字典的列表,表示边的元数据信息。
返回值: 无。

### 3.6.Node2VecSampling算子:
Node2Vec 是一种图嵌入算法,通过随机游走生成节点序列,并使用 Word2Vec 算法将节点嵌入到低维空间中。它结合了 DFS 和 BFS 的思想,通过参数 𝑝 和𝑞 控制随机游走的跳跃概率,从而平衡捕捉社区结构(局部)和结构等价性(全局)信息。嵌入向量可以用于下游任务如节点分类、聚类或链接预测。
```python
def Process(db_: lgraph_db_python.PyGraphDB, olapondb:lgraph_db_python.PyOlapOnDB, feature_num: size_t, p:cython.double , q:cython.double, walk_length: size_t, num_walks: size_t, sample_node:list, NodeInfo: list, EdgeInfo: list):
```
参数列表:
db_: 图数据库实例。
olapondb: 图分析类。
feature_num: size_t类型,生成的node 向量的长度。
p: double类型,Node2Vec算法参数,控制返回前一个节点的偏好。
q: double类型,Node2Vec算法参数,控制探索外部节点的偏好。
walk_length: size_t类型,随机游走的长度。
num_walks: size_t类型,随机游走的次数。
sample_node: list类型,采样点列表。
NodeInfo: list类型,一个点属性字典的列表,表示点的元数据信息。
EdgeInfo: list类型,一个边属性字典的列表,表示边的元数据信息。
返回值: 无。

## 4. 用户自定义采样算法
用户也可以通过TuGraph Olap接口实现自定义采样算法,接口文档参见[此处](../2.olap/5.python-api.md),该文档主要介绍图采样算法使用的相关函数的接口设计。
2 changes: 1 addition & 1 deletion docs/zh-CN/source/9.olap&procedure/3.learn/3.training.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
使用TuGraph 图学习模块进行训练时,可以分为全图训练和mini-batch训练。
全图训练即把全图从TuGraph db加载到内存中,再进行GNN的训练。而mini-batch训练则使用上面提到的TuGraph 图学习模块的采样算子,将全图数据进行采样后,再送入训练框架中进行训练。
## 2. Mini-Batch训练
Mini-Batch训练需要使用TuGraph 图学习模块的采样算子,目前支持Neighbor Sampling、Edge Sampling、Random Walk Sampling和Negative Sampling。
Mini-Batch训练需要使用TuGraph 图学习模块的采样算子,目前支持Neighbor Sampling、Edge Sampling、Random Walk Sampling、Negative Sampling和Node2Vec Sampling。
TuGraph 图学习模块的采样算子进行采样后的结果以List的形式返回。
下面以Neighbor Sampling为例,介绍如何将采样后的结果,进行格式转换,送入到训练框架中进行训练。
用户需要提供一个Sample类:
Expand Down
1 change: 1 addition & 0 deletions learn/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,4 @@ add_extension2(getdb)
add_extension2(negative_sampling)
add_extension2(neighbors_sampling)
add_extension2(random_walk)
add_extension2(node2vec_sampling)
2 changes: 1 addition & 1 deletion learn/docker/dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ RUN wget https://tugraph-web.oss-cn-beijing.aliyuncs.com/tugraph/deps/Python-3.6
&& tar xf Python-3.6.9.tgz && cd Python-3.6.9 && ./configure --prefix=/usr/local \
&& make -j16 && make install \
&& python3 -m pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple --trusted-host pypi.tuna.tsinghua.edu.cn \
&& python3 -m pip install pexpect requests pytest httpx cython==3.0.0a11 -i https://pypi.tuna.tsinghua.edu.cn/simple --trusted-host pypi.tuna.tsinghua.edu.cn \
&& python3 -m pip install pexpect requests pytest httpx cython==3.0.0a11 smart-open==6.4.0 gensim -i https://pypi.tuna.tsinghua.edu.cn/simple --trusted-host pypi.tuna.tsinghua.edu.cn \
&& rm -rf /Python*

# install cmake
Expand Down
11 changes: 10 additions & 1 deletion learn/examples/train_cora.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,12 @@

rw_len = 2

p = 10
q = 1
walk_length = 10
num_walks = 10


def construct_graph():
src_ids = [0, 2, 3, 4]
dst_ids = [1, 1, 2, 3]
Expand Down Expand Up @@ -64,6 +70,9 @@ def process(self, db, olapondb, seed_nodes, NodeInfo, EdgeInfo):
elif args.method == "negative_sampling":
self.algo.Process(db, olapondb, feature_len,
len(seed_nodes), NodeInfo, EdgeInfo)
elif args.method == "node2vec_sampling":
self.algo.Process(db, olapondb, feature_len, p, q, walk_length, num_walks,
seed_nodes, NodeInfo, EdgeInfo)


def sample(self, g, seed_nodes):
Expand Down Expand Up @@ -264,7 +273,7 @@ def main(args):
help="the path to store vertex info")
parser.add_argument('--method', type=str, default='neighbors_sampling',
help='sample method:\
neighbors_sampling, edge_sampling, random_walk, negative_sampling')
neighbors_sampling, edge_sampling, random_walk, negative_sampling, node2vec_sampling')
parser.add_argument('--neighbor_sample_num', type=int, default=20,
help='neighbor sampling number.')
parser.add_argument('--randomwalk_length', type=int, default=20,
Expand Down
Loading
Loading