Loading and streaming back data with Apache Arrow

Follow along with a notebook in Colab Google Colab

The Enterprise Edition of GDS installed on AuraDS includes an Arrow Flight server, configured and running by default. The Arrow Flight server speeds up data-intensive processes such as:

  • Creating a graph directly from in-memory data.

  • Streaming node and relationship properties.

  • Streaming the relationship topology of a graph.

There are two ways to use the Arrow Flight server with GDS:

  1. By using the GDS Python client, which includes an Arrow Flight client.

  2. By implementing a custom Arrow Flight client as explained in the GDS manual.

In the following examples we use the GDS client as it is the most convenient option. All the loading and streaming methods can be used without Arrow, but are more efficient if Arrow is available.

Setup

%pip install 'graphdatascience>=1.7'

from graphdatascience import GraphDataScience

# Replace with the actual connection URI and credentials
AURA_CONNECTION_URI = "neo4j+s://xxxxxxxx.databases.neo4j.io"
AURA_USERNAME = "neo4j"
AURA_PASSWORD = ""

# When initialized, the client tries to use Arrow if it is available on the server.
# This behaviour is controlled by the `arrow` parameter, which is set to `True` by default.
gds = GraphDataScience(AURA_CONNECTION_URI, auth=(AURA_USERNAME, AURA_PASSWORD), aura_ds=True)

# Necessary if Arrow is enabled (as is by default on Aura)
gds.set_database("neo4j")

You can call the gds.debug.arrow() method to verify that Arrow is enabled and running:

gds.debug.arrow()

Loading data

You can load data directly into a graph using the gds.graph.construct client method.

The data must be a Pandas DataFrame, so we need to install and import the pandas library.

%pip install pandas

import pandas as pd

We can then create a graph as in the following example. The format of each DataFrame with the required columns is specified in the GDS manual.

nodes = pd.DataFrame(
    {
        "nodeId": [0, 1, 2],
        "labels":  ["Article", "Article", "Article"],
        "pages": [3, 7, 12],
    }
)

relationships = pd.DataFrame(
    {
        "sourceNodeId": [0, 1],
        "targetNodeId": [1, 2],
        "relationshipType": ["CITES", "CITES"],
        "times": [2, 1]
    }
)

article_graph = gds.graph.construct(
    "article-graph",
    nodes,
    relationships
)

Now we can check that the graph has been created:

gds.graph.list()

Streaming node and relationship properties

After creating the graph, you can read the node and relationship properties as streams.

# Read all the values for the node property `pages`
gds.graph.nodeProperties.stream(article_graph, "pages")
# Read all the values for the relationship property `times`
gds.graph.relationshipProperties.stream(article_graph, "times")

Performance

To see the difference in performance when Arrow is available, we can measure the time needed to load a dataset into a graph. In this example we use a built-in OGBN dataset, so we need to install the ogb extra.

%pip install 'graphdatascience[ogb]>=1.7'

# Load and immediately drop the dataset to download and cache the data
ogbn_arxiv = gds.graph.ogbn.load("ogbn-arxiv")
ogbn_arxiv.drop()

We can then time the loading process. On an 8 GB AuraDS instance, this should take less than 30 s.

%%timeit -n 1 -r 1

# This call uses the cached dataset, so only the actual loading is timed
ogbn_arxiv = gds.graph.ogbn.load("ogbn-arxiv")

With Arrow disabled by adding arrow=False to the GraphDataScience constructor, the same loading process would take more than 1 minute. Therefore, with this dataset, Arrow provides at least a 2x speedup.

Cleanup

article_graph.drop()
ogbn_arxiv.drop()

gds.close()