Skip to main content

Neo4j

This document consists of the following sections:

Intro

The following document describes the current Neo4j setup. Neo4j is a Graph database that we're using to build Developer Graph V1. Full PRD can be found here. It also describes first use case that will be implemented.

Exporting data from PostgreSQL

The data inside graph DB is represented using 3 structures: nodes, node properties and relationships between nodes. In order to have data for the first use case we need repositories and users who contributed to them.

Using this SQL script we can get all repos (without forks) with more than 100 stars:

SELECT id, github_id, name, full_name, stars_count
FROM public.github_repo
WHERE stars_count > 100
AND fork = false;

Using this SQL script we can get users who contributed to those repos:

SELECT gc.repo_id, gc.user_id, gc.contributions, gu.github_id, gu.login, gu.owner_type
FROM public.github_contributor as gc
INNER JOIN public.github_repo as gr on gc.repo_id = gr.id
INNER JOIN public.github_user as gu on gc.user_id = gu.id
WHERE gr.stars_count > 100
AND gr.fork = false
AND gu.owner_type = 'User';

Using pgAdmin, export results of these SQL queries into CSV files.

Setup of new Neo4j AuraDB instance inside the archipelo-dev project on GCP

To create new Neo4j AuraDB instance, follow next steps:

  1. Go to Neo4j section in GCP console
  2. Make sure that Neo4j API is enabled. See here for details
  3. Click on "Manage on Provider" button
  4. Neo4j AuraDB Console will open up
  5. Go to Instances
  6. Click "New Instance"
  7. Select 2GB/1CPU instance, type name for it, select Neo4j version and GCP region (or leave all as default)
  8. Check "I understand" checkbox and click "Create Instance"
  9. Window with user/password and connection file will appear - save all of it (here is the link to the credentials file for existing Neo4j DB in dev environment)
  10. Wait until the instance will be in the "Running" state

How to connect to Neo4j DB

There are 2 ways of how you can connect using Neo4j Browser via Web Browser or via Neo4j Desktop.

Via Browser:

  1. Go to Neo4j console
  2. Open Instances
  3. Click on "Query" button for the instance you want to connect to
  4. Neo4j Browser will open in new tab - wait until it will connect to the instance

Via Neo4j Desktop:

  1. Download Neo4j Desktop from here
  2. Install it following provided instructions
  3. Open Neo4j Desktop
  4. Create new Project or use pre-created one
  5. In front of Project name click "Add" button and select "Remote connection"
  6. Follow the wizard to connect to the Neo4j AuraDB instance in GCP (use connection file that you've downloaded for details)
  7. After first successful connection, it is recommended to reset the password
  8. Run the next command in the query window: :server change-password
  9. Change password in the opened window.

How to import data into Neo4j DB using Cypher

The current dataset consist of GitHub repositories, users and contributions from the users to repositories. Here are the list of steps that we need to do in order to import dataset into the Neo4j DB:

  1. Upload CSV files to the public web storage (GCP bucket, Google Drive, Dropbox, etc)
  2. Create indexes (constraints) in Neo4j DB before importing data - this will increase performance tremendously.
  3. Import repositories data
  4. Import users data
  5. Import contributions data

Upload CSV files to the public web storage

Use any kind of web storage that provide public access without any authentication. Upload 2 CSV files from "exporting data" steps into the storage and copy links to them.

Create indexes

In Neo4j Browser run the next query - first two line will create indexes and last one will wait until all indexes are created - this in just a precaution to make sure that creation process is complete and indexes are ready to use:

CREATE CONSTRAINT repo_id ON (r:Repo) ASSERT r.repoID IS UNIQUE;
CREATE CONSTRAINT dev_id ON (d:Developer) ASSERT d.devID IS UNIQUE;
CALL db.awaitIndexes();

There is no expected output from this query.

Import repositories data

Replace REPLACE_THIS_WITH_URL with URL to the CSV file with repositories and run the next query:

:auto LOAD CSV WITH HEADERS FROM 'REPLACE_THIS_WITH_URL' AS row
CALL {
WITH row
MERGE (repo:Repo {repoID: row.id})
ON CREATE SET repo.githubID = row.github_id,
repo.name = row.name,
repo.fullName = row.full_name,
repo.starsCount = toInteger(row.stars_count)
} IN TRANSACTIONS OF 10000 ROWS;

Example output:

Added 46924 labels, created 46924 nodes, set 234620 properties, completed after 3965 ms.

Import users data

Replace REPLACE_THIS_WITH_URL with URL to the CSV file with users and contributions and run the next query:

:auto LOAD CSV WITH HEADERS FROM 'REPLACE_THIS_WITH_URL' AS row
CALL {
WITH row
MERGE (dev:Developer {devID: row.user_id})
ON CREATE SET dev.githubID = row.github_id,
dev.githubLogin = row.login
} IN TRANSACTIONS OF 10000 ROWS;

Example output:

Added 46924 labels, created 46924 nodes, set 234620 properties, completed after 3965 ms.

Import contributions data

Replace REPLACE_THIS_WITH_URL with URL to the CSV file with users and contributions and run the next query:

:auto LOAD CSV WITH HEADERS FROM 'REPLACE_THIS_WITH_URL' AS row
CALL {
WITH row
MATCH (repo:Repo {repoID: row.repo_id})
MATCH (dev:Developer {devID: row.user_id})
MERGE (dev)-[dr:CONTRIBUTED_TO]->(repo)
ON CREATE SET dr.contributions = toInteger(row.contributions)
} IN TRANSACTIONS OF 10000 ROWS;

Example output:

Set 952456 properties, created 952456 relationships, completed after 51063 ms.

Query examples using Cypher

Here is the query that implements the first use case that we derived:

MATCH (d:Developer {devID: "2def4d9c-5ff9-4b75-98c8-123cfbe9267b"})
MATCH (r:Repo {repoID: "51e8aacc-9d39-48c3-b324-af16ab67e020"})
MATCH (d)-[od:CONTRIBUTED_TO]->(r)<-[od2:CONTRIBUTED_TO]-(d2:Developer)
MATCH (r2:Repo {repoID: "f4b76241-4b79-41d6-a23b-bb1e0cb8dd1b"})
MATCH (d2)-[od3:CONTRIBUTED_TO]->(r2)
RETURN r,d,od,d2,od2,od3,r2;

Get all developers who contributed to one repository:

MATCH (r:Repo {repoID: "51e8aacc-9d39-48c3-b324-af16ab67e020"})
MATCH (d:Developer)-[od:CONTRIBUTED_TO]->(r)
RETURN r,d,od;