Segment Based on Customer Behavior - Not Characteristics - with Louvain

Every meaningful marketing decision starts with segmentation—who to target, what to recommend, and how to personalize an experience. The quality of those decisions depends entirely on how customers are grouped.

There are two fundamentally different ways to think about customer segmentation. We can group customers by who they are, or by what they do.

Overview

What Is Neo4j Graph Analytics For Snowflake?

Neo4j helps organizations find hidden relationships and patterns across billions of data connections deeply, easily, and quickly. Neo4j Graph Analytics for Snowflake brings to the power of graph directly to Snowflake, allowing users to run 65+ ready-to-use algorithms on their data, all without leaving Snowflake!

Prerequisites

What You Will Need

  • A Snowflake account with appropriate access to databases and schemas.

  • Neo4j Graph Analytics application installed from the Snowflake marketplace. Access the marketplace via the menu bar on the left hand side of your screen, as seen below: image

What You Will Build

  • A method to identify communities that are at high risk of fraud in P2P networks

What You Will Learn

  • How to prepare and project your data for graph analytics

  • How to read and write directly from and to your snowflake tables

Loading The Data

Dataset overview : This dataset is a subset of instakart data and can be found here.

Let’s name our database RETAIL_RECS. We are going to add five new tables:

  • One called aisles based on the aisles.csv

  • One called baskets based on baskets.csv

  • One called departments based on departments.csv

  • One called order_history based on order_history.csv

  • One called products based on products.csv

Follow the steps found here to load in your data.

Then switch to that database:

USE DATABASE RETAIL_RECS;
USE SCHEMA PUBLIC;

Segmentation by Characteristics

In a traditional approach like K-means, each customer is first reduced to a small set of characteristics. These characteristics are fixed numeric attributes — things like average basket size or reorder tendency — that at best, these approaches cluster customers by summarized behavior and at worst, by characteristics that entirely miss how they are actually behaving.

Once every customer is represented as a point in feature space, K-means groups together customers who look numerically similar.

Let’s first prep our data for this approach:

CREATE OR REPLACE TABLE ORDER_ITEM_COUNTS AS
SELECT
  order_id,
  COUNT(*) AS items_in_order
FROM BASKETS
GROUP BY 1;

We are going to create two features: reorder rate and average items per order. These features do get at a given customer’s behavior, specifically how many things they buy each time and how often they reorder a item.

CREATE OR REPLACE TABLE USER_KMEANS_FEATURES AS
WITH ORDER_STATS AS (
  SELECT
    order_id,
    COUNT(*) AS items_in_order,
    AVG(reordered::FLOAT) AS reorder_rate_in_order
  FROM BASKETS
  GROUP BY 1
)
SELECT
  o.user_id,
  AVG(s.items_in_order::FLOAT)       AS avg_items_per_order,
  AVG(s.reorder_rate_in_order)       AS reorder_rate
FROM ORDERS o
JOIN ORDER_STATS s
  ON o.order_id = s.order_id
GROUP BY 1;
select * from user_kmeans_features
USER_ID AVG_ITEMS_PER_ORDER REORDER_RATE

202279

9

0.6666666666666666

205970

8

1

178520

13

0.9230769230769231

156122

26

0.8076923076923077

142903

2

0

Now that we have our features, we will take them out of SQL and load them into python where we will perform our analysis. With Graph Analytics for Snowflake, you can avoid this data movement and run your algorithm directly on SQL tables.

from snowflake.snowpark.context import get_active_session

session = get_active_session()

df = session.table("USER_KMEANS_FEATURES").select(
    "USER_ID", "AVG_ITEMS_PER_ORDER", "REORDER_RATE"
).dropna()

pdf = df.to_pandas()
pdf.head()
USER_ID AVG_ITEMS_PER_ORDER REORDER_RATE

202279

9

0.6666666666666666

205970

8

1

178520

13

0.9230769230769231

156122

26

0.8076923076923077

142903

2

0

import os
os.environ["LOKY_MAX_CPU_COUNT"] = "1"
os.environ["JOBLIB_MULTIPROCESSING"] = "0"
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

X = pdf[["AVG_ITEMS_PER_ORDER", "REORDER_RATE"]].to_numpy()

X_scaled = MinMaxScaler().fit_transform(X)

km = KMeans(n_clusters=3, n_init=10, random_state=42, algorithm="lloyd")
pdf["CLUSTER"] = km.fit_predict(X_scaled)

Let’s take a look at our final result. We sorted our customers into three distinct groups based on how much they buy when they go to the store and how many of their items are reorders.

But these are summaries of customer’s behavior, and if we pause and consider what exactly we are modeling, we will be able to poke some holes. Does a teacher reordering pencils form a meaningful group with a new father reordering baby formula? From a marketing standpoint, likely not.

When customer behavior is flattened into these summaries, we necessarily lose information. That is the fundamental challenge in modeling based on characteristics rather than the behavior itself.

import matplotlib.pyplot as plt

plt.figure()
plt.scatter(
    pdf["AVG_ITEMS_PER_ORDER"],
    pdf["REORDER_RATE"],
    c=pdf["CLUSTER"]
)
plt.xlabel("avg_items_per_order")
plt.ylabel("reorder_rate")
plt.title("K-means customer clusters")
plt.show()

image

Model Based on Behavior with Louvain

With a graph-based approach, we don’t summarize behavior into attributes. Instead, we model behavior directly.

Customer segmentation in graph analytics mirrors how students choose tables in a cafeteria. Students who interact frequently sit together; likewise, customers who interact with the same items naturally form groups. Thats how graph algorithms, like louvain, segment customers based on directly on their behavior

In our example, customers are connected to the products they purchase, forming a network of interactions.

Louvain finds communities in this network — groups of customers who are tightly connected through shared purchasing behavior.

We will create three tables. Two are node tables for the customers and the products. The third is an edge list showing customer purchase events.

CREATE OR REPLACE VIEW CUSTOMERS AS
SELECT DISTINCT
  user_id AS nodeId
FROM ORDERS;

CREATE OR REPLACE VIEW PRODUCTS_NODES AS
SELECT
  product_id AS nodeId
FROM PRODUCTS;

CREATE OR REPLACE VIEW PURCHASES AS
SELECT
  o.user_id    AS sourceNodeId,
  b.product_id AS targetNodeId
FROM ORDERS o
JOIN BASKETS b
  ON o.order_id = b.order_id;

Granting Permissions

Next we grant the necessary permissions:

USE ROLE ACCOUNTADMIN;
-- Create a consumer role for users and admins of the GDS application
CREATE ROLE IF NOT EXISTS gds_user_role;
CREATE ROLE IF NOT EXISTS gds_admin_role;
GRANT APPLICATION ROLE neo4j_graph_analytics.app_user TO ROLE gds_user_role;
GRANT APPLICATION ROLE neo4j_graph_analytics.app_admin TO ROLE gds_admin_role;

CREATE DATABASE ROLE IF NOT EXISTS gds_db_role;
GRANT DATABASE ROLE gds_db_role TO ROLE gds_user_role;
GRANT DATABASE ROLE gds_db_role TO APPLICATION neo4j_graph_analytics;

-- Grant access to consumer data
GRANT USAGE ON DATABASE RETAIL_RECS TO ROLE gds_user_role;
GRANT USAGE ON SCHEMA RETAIL_RECS.PUBLIC TO ROLE gds_user_role;

-- Required to read tabular data into a graph
GRANT SELECT ON ALL TABLES IN DATABASE RETAIL_RECS TO DATABASE ROLE gds_db_role;

-- Ensure the consumer role has access to created tables/views
GRANT ALL PRIVILEGES ON FUTURE TABLES IN SCHEMA RETAIL_RECS.PUBLIC TO DATABASE ROLE gds_db_role;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA RETAIL_RECS.PUBLIC TO DATABASE ROLE gds_db_role;
GRANT CREATE TABLE ON SCHEMA RETAIL_RECS.PUBLIC TO DATABASE ROLE gds_db_role;
GRANT CREATE VIEW ON SCHEMA RETAIL_RECS.PUBLIC TO DATABASE ROLE gds_db_role;
GRANT ALL PRIVILEGES ON FUTURE VIEWS IN SCHEMA RETAIL_RECS.PUBLIC TO DATABASE ROLE gds_db_role;
GRANT ALL PRIVILEGES ON ALL VIEWS IN SCHEMA RETAIL_RECS.PUBLIC TO DATABASE ROLE gds_db_role;

-- Compute and warehouse access
GRANT USAGE ON WAREHOUSE NEO4J_GRAPH_ANALYTICS_APP_WAREHOUSE TO APPLICATION neo4j_graph_analytics;
use role gds_role;

Running our Algorithms

We apply Louvain to a customer–item purchase graph to identify communities of customers with similar buying behavior. These communities emerge from shared purchase patterns, rather than predefined customer attributes.

CALL Neo4j_Graph_Analytics.graph.louvain('CPU_X64_XS', {
    'defaultTablePrefix': 'retail_recs.public',
    'project': {
        'nodeTables': [ 'CUSTOMERS', 'PRODUCTS_NODES'],
        'relationshipTables': {
            'PURCHASES': {
                'sourceTable': 'CUSTOMERS',
                'targetTable': 'PRODUCTS_NODES',
                'orientation': 'UNDIRECTED'
            }
        }
    },
    'compute': {
        'mutateProperty': 'community_id'
    },
    'write': [{
        'nodeLabel': 'CUSTOMERS',
        'outputTable': 'CUSTOMERS_GROUP',
        'nodeProperty': 'community_id'
    }]
});

And now we have a different customer segments derived directly from those customer’s behavior rather than the customer’s characteristics:

select  community_id, count(*) as node_count from customers_group group by community_id
COMMUNITY_ID NODE_COUNT

4970

91

4952

43

5277

47

5380

30