Create and store embeddings in a Neo4j database

Neo4j’s vector indexes and vector functions allow you to calculate the similarity between node and relationship properties in a graph. A prerequisite for using these features is that vector embeddings have been set as properties of these entities. This page shows how these embeddings can be created and stored as properties on nodes and relationships in a Neo4j database using the GenAI plugin.

For a hands-on guide on how to use the GenAI plugin on a Neo4j database, see Embeddings & Vector Indexes Tutorial → Create embeddings with cloud AI providers.

To learn more about using embeddings in combination with vector indexes, see Cypher → Vector indexes → Vectors and embeddings in Neo4j.

Example graph

The examples on this page use the Neo4j movie recommendations dataset, focusing on the plot and title properties of Movie nodes. There are 9083 Movie nodes with a plot and title property.

To recreate the graph, download and import this dump file into an empty Neo4j database. Dump files can be imported for both Aura and self-managed instances.

The embeddings on this page are generated using the OpenAI model text-embedding-ada-002 (1536-dimensional vectors).

Generate a single embedding and store it

Use the genai.vector.encode() function to generate a vector embedding for a single value.

Signature for genai.vector.encode() Function

genai.vector.encode(resource :: STRING, provider :: STRING, configuration :: MAP = {}) :: LIST<FLOAT>

The resource (a STRING) is the object to transform into an embedding, such as a chunk text or a node/relationship property.
The provider (a STRING) is the case-insensitive identifier of the provider to use. See identifiers under AI providers for supported options.
The configuration (a MAP) contains provider-specific settings, such as which model to invoke, as well as any required API credentials. See AI providers for details of each supported provider. Note that because this argument may contain sensitive data, it is obfuscated in the query.log. However, if the function call is misspelled or the query is otherwise malformed, it will be logged without being obfuscated.

This function sends one API request every time it is called, which may result in a lot of overhead in terms of both network traffic and latency. If you want to generate many embeddings at once, use Generate a batch of embeddings and store them.

Use the db.create.setNodeVectorProperty procedure to store an embedding to a node property.

Signature for db.create.setNodeVectorProperty Procedure

db.create.setNodeVectorProperty(node :: NODE, key :: STRING, vector :: ANY)

Use the db.create.setRelationshipVectorProperty procedure to store an embedding to a relationship property.

Signature for db.create.setRelationshipVectorProperty Procedure

db.create.setRelationshipVectorProperty(relationship :: RELATIONSHIP, key :: STRING, vector :: ANY)

node or relationship is the entity in which the new property will be stored.
key (a STRING) is the name of the new property containing the embedding.
vector is the object containing the embedding.

The embeddings are stored as properties on nodes or relationships with the type LIST<INTEGER | FLOAT>.

Example 1. Create an embedding from a single property and store it

Create an embedding property for the Godfather

MATCH (m:Movie {title:'Godfather, The'})
WHERE m.plot IS NOT NULL AND m.title IS NOT NULL
WITH m, m.title || ' ' || m.plot AS titleAndPlot (1)
WITH m, genai.vector.encode(titleAndPlot, 'OpenAI', { token: $openaiToken }) AS propertyVector (2)
CALL db.create.setNodeVectorProperty(m, 'embedding', propertyVector) (3)
RETURN m.embedding AS embedding

1	Concatenate the `title` and `plot` of the `Movie` into a single `STRING`.
2	Create a 1536 dimensional embedding from the `titleAndPlot`.
3	Store the `propertyVector` as a new `embedding` property on The Godfather node.

Result (output capped after 4 entries)

+----------------------------------------------------------------------------------------------------+
| embedding                                                                                          |
+----------------------------------------------------------------------------------------------------+
| [0.005239539314061403, -0.039358530193567276, -0.0005175105179660022, -0.038706034421920776, ... ] |
+----------------------------------------------------------------------------------------------------+

Generate a batch of embeddings and store them

Use the genai.vector.encodeBatch procedure to generate many vector embeddings with a single API request. This procedure takes a list of resources as an input, and returns the same number of result rows.

This procedure attempts to generate embeddings for all supplied resources in a single API request. Therefore, it is recommended to see the respective provider’s documentation for details on, for example, the maximum number of embeddings that can be generated per request.

Signature for genai.vector.encodeBatch Procedure

genai.vector.encodeBatch(resources :: LIST<STRING>, provider :: STRING, configuration :: MAP = {}) :: (index :: INTEGER, resource :: STRING, vector :: LIST<FLOAT>)

The resources (a LIST<STRING>) parameter is the list of objects to transform into embeddings, such as chunks of text.
The provider (a STRING) is the case-insensitive identifier of the provider to use. See AI providers for supported options.
The configuration (a MAP) specifies provider-specific settings such as which model to invoke, as well as any required API credentials. See AI providers for details of each supported provider. Note that because this argument may contain sensitive data, it is obfuscated in the query.log. However, if the function call is misspelled or the query is otherwise malformed, it will be logged without being obfuscated.

Each returned row contains the following columns:

The index (an INTEGER) is the index of the corresponding element in the input list, to aid in correlating results back to inputs.
The resource (a STRING) is the name of the input resource.
The vector (a LIST<FLOAT>) is the generated vector embedding for this resource.

Example 2. Create embeddings from a limited number of nodes and store them

MATCH (m:Movie WHERE m.plot IS NOT NULL)
WITH m
LIMIT 20
WITH collect(m) AS moviesList (1)
WITH moviesList, [movie IN moviesList | movie.title || ': ' || movie.plot] AS batch (2)
CALL genai.vector.encodeBatch(batch, 'OpenAI', { token: $openaiToken }) YIELD index, vector
WITH moviesList, index, vector
CALL db.create.setNodeVectorProperty(moviesList[index], 'embedding', vector) (3)

1	Collect all 20 `Movie` nodes into a `LIST<NODE>`.
2	The list comprehension (`[]`) extracts the `title` and `plot` properties of the movies in `moviesList` into a new `LIST<STRING>`.
3	`db.create.setNodeVectorProperty` is run for each `vector` returned by `genai.vector.encodeBatch`, and stores that vector as a property named `embedding` on the corresponding node.

Example 3. Create embeddings from a large number of nodes and store them

MATCH (m:Movie WHERE m.plot IS NOT NULL)
WITH collect(m) AS moviesList, (1)
     count(*) AS total,
     100 AS batchSize (2)
UNWIND range(0, total-1, batchSize) AS batchStart (3)
CALL (moviesList, batchStart, batchSize) { (4)
    WITH [movie IN moviesList[batchStart .. batchStart + batchSize] | movie.title || ': ' || movie.plot] AS batch (5)
    CALL genai.vector.encodeBatch(batch, 'OpenAI', { token: $openaiToken }) YIELD index, vector
    CALL db.create.setNodeVectorProperty(moviesList[batchStart + index], 'embedding', vector) (6)
} IN CONCURRENT TRANSACTIONS OF 1 ROW (7)

1	Collect all returned `Movie` nodes into a `LIST<NODE>`.
2	`batchSize` defines the number of nodes in `moviesList` to be processed at once. Because vector embeddings can be very large, a larger batch size may require significantly more memory on the Neo4j server. Too large a batch size may also exceed the provider’s threshold.
3	Process `Movie` nodes in increments of `batchSize`. The end range `total-1` is due to `range` being inclusive on both ends.
4	A `CALL` subquery executes a separate transaction for each batch. Note that this `CALL` subquery uses a variable scope clause to import variables. If you are using an older version of Neo4j, use an importing `WITH` clause instead.
5	`batch` is a list of strings, each being the concatenation of `title` and `plot` of one movie.
6	The procedure sets `vector` as value for the property named `embedding` for the node at position `batchStart + index` in the `moviesList`.
7	Set to `1` the amount of batches to be processed at once (see `CALL` subqueries → Concurrent transactions).

This example may not scale to larger datasets, as collect(m) requires the whole result set to be loaded in memory. For an alternative method more suitable to processing large amounts of data, see GenAI documentation - Embeddings & Vector Indexes Tutorial → Create embeddings with cloud AI providers.