Getting Started with JanusGraph and Python

Introduction

janusgraph

The following introduction is from the official janusgrah website

JanusGraph can be queried from all languages for which a TinkerPop driver exists. Drivers allow sending of Gremlin traversals to a Gremlin Server like the JanusGraph Server. A list of TinkerPop drivers is available on TinkerPop’s homepage.

In addition to drivers, there exist query languages for TinkerPop that make it easier to use Gremlin in different programming languages like Java, Python, or C#. Some of these languages even construct Gremlin traversals from completely different query languages like Cypher or SPARQL. Since JanusGraph implements TinkerPop, all of these languages can be used together with JanusGraph.

Install JanusGraph

We will install JanusGraph and configure it to use Cassandra as data storage backend while using Elasticsearch for searching and indexing. For development purpose we can use local instances of Canssandra and Elasticsearch which conveniently come with the installing package of JanusGraph. In production phase, we will re-configure JanusGraph to use production servers of Cassandra and Elasticsearch.

First, we need to download JanusGraph from its relases page. We will use the newest version at the time of this writing - 0.4.0.

In [2]:
!curl -L -O https://github.com/JanusGraph/janusgraph/releases/download/v0.4.0/janusgraph-0.4.0-hadoop2.zip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   618    0   618    0     0    378      0 --:--:--  0:00:01 --:--:--   378
  0  274M    0 1733k    0     0  47128      0  1:41:57  0:00:37  1:41:20 230780:00:14  1:16:03 63545^C

Unzip the downloaded package and move in to the unzipped directory

In [ ]:
import os

!unzip janusgraph-0.4.0-hadoop2.zip
os.chdir('janusgraph-0.4.0-hadoop2/')
!ls

We are almost ready to start the JanusGraph. We need to change the configuration file for the JanusGraph.

We will back up the original configuration file and replace it with a new one.

In [5]:
!cp conf/gremlin-server/gremlin-server.yaml conf/gremlin-server/gremlin-server.yaml.orig
!cp conf/gremlin-server/gremlin-server-configuration.yaml conf/gremlin-server/gremlin-server.yaml

Now we can start JanusGraph

In [7]:
!bin/janusgraph.sh start
Forking Cassandra...
Running `nodetool statusthrift`.. OK (returned exit status 0 and printed string "running").
Forking Elasticsearch...
Connecting to Elasticsearch (127.0.0.1:9200)..... OK (connected to 127.0.0.1:9200).
Forking Gremlin-Server...
Connecting to Gremlin-Server (127.0.0.1:8182)...... OK (connected to 127.0.0.1:8182).
Run gremlin.sh to connect.

As we can see, JanusGraph will automatically start Cassandra and Elasticsearch for us, we barely did anything on this part.

Some usefull commands with janusgraph.sh script:

  • janusgraph.sh start: Start JanusGraph.
  • janusgraph.sh stop: Stop JanusGraph.
  • janusgraph.sh clean: Clean everything in the storage and search engine, give us a fresh start.

To interact with JanusGraph server, or gremlin server to be exact, we can use gremlin console, python package gremlinpython or other clients. For now we will use gremlin console which also come with janusgraph in the same installing package.

In the terminal inside the unzipped directory, issue the following command:

bin/gremlin.sh

This command will start gremlin console. We have to use gremlin language to interact with gremlin server. Issue the following commands inside gremlin console to connect gremlin console to the gremlin server (janusgraph server):

:remote connect tinkerpop.server conf/remote.yaml session
:remote console

That's it, that's how we install and configure JanusGraph for our development. We can now use gremlin to interact with JanusGraph. In the next section, we will talk about how we can load a graph from graphml file into JanusGraph.

Importing a graph into JanusGraph

For this part, we will use the airport route data set from Kelvin Lawrence’s book Practical Gremlin. You can download the GraphML file here.

Dowload the said file into /tmp folder for easy reference

In [8]:
!curl -o /tmp/air-routes.graphml https://github.com/krlawrence/graph/raw/master/sample-data/air-routes.graphml
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   154  100   154    0     0     79      0  0:00:01  0:00:01 --:--:--    79

In gremlin console, which is connected to the gremlin server, issue the following commands to create a Configuration for the graph we are about to load in to our database:

map = new HashMap<String, Object>();
map.put("storage.backend", "cql");
map.put("storage.hostname", "127.0.0.1");
map.put("graph.graphname", "airroutes");
ConfiguredGraphFactory.createConfiguration(new MapConfiguration(map));

graph=ConfiguredGraphFactory.open("airroutes");

We basically create a graph named "airroutes" which will be store using cql storage backend (an adapter for Cassandra, the old and soon will be deprecated one is cassandrathrift which we don't use here) at the local address (127.0.0.1).

With the graph created, we can load the content of the graph from the downloaded graphml file into janusgraph. To do so, issue the following in the gremlin console:

graph.io(graphml()).readGraph('/tmp/air-routes.graphml')
graph.tx().commit()

To get a list of graphs we have created use the command below:

gremlin> ConfiguredGraphFactory.getGraphNames()
==>airroutes

Now we have a graph named "airroutes" in our graph database. Next time after connecting to the gremlin server we just need to issue the following command to open the "airroutes" graph:

gremlin> graph = ConfiguredGraphFactory.open('airroutes')
==>standardjanusgraph[cassandrathrift:[127.0.0.1]]

Let's try doing some trarvese on the graph:

gremlin> g = graph.traversal()
==>graphtraversalsource[standardjanusgraph[cassandrathrift:[127.0.0.1]], standard]
gremlin> g.V().values('code').count()
==>3619
gremlin> departure_airport="SFO"
==>SFO
gremlin> arrival_airport="JFK"
==>JFK
gremlin> g.V().has('code', departure_airport).repeat(out('route').simplePath()).times(2).has('code', arrival_airport).path().by('code').limit(5)
==>[SFO, ATL, JFK]
==>[SFO, DFW, JFK]
==>[SFO, DCA, JFK]
==>[SFO, TPA, JFK]
==>[SFO, LGB, JFK]

Connect to Gremlin server using python

Gremlin server doesn't support python out-of-the-box, we need to install gremlin-python for gremlin server. Please install the compatible version with the version of janusgraph that you are using. For JanusGraph 0.4.0 we have:

Tested Compatibility:

  • Apache Cassandra 2.2.10, 3.0.14, 3.11.0
  • Apache HBase 1.2.6, 1.3.1, 1.4.10, 2.1.5
  • Google Bigtable 1.3.0, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.9.0, 1.10.0, 1.11.0
  • Oracle BerkeleyJE 7.5.11
  • Elasticsearch 5.6.14, 6.0.1, 6.6.0
  • Apache Lucene 7.0.0
  • Apache Solr 7.0.0
  • Apache TinkerPop 3.4.1
  • Java 1.8

So we will need gremlin-python version 3.4.1, the same with the version of TinkerPop

In [ ]:
!bin/gremlin-server.sh install org.apache.tinkerpop gremlin-python 3.4.1

We also need a pip package named gremlinpython that provides Python APIs for us to access the gremlin server.

In [25]:
!pip install gremlinpython==3.4.1
Requirement already satisfied: gremlinpython==3.4.1 in /Volumes/Data/ethan/.env_tf20_p37/lib/python3.7/site-packages (3.4.1)
Requirement already satisfied: aenum>=1.4.5 in /Volumes/Data/ethan/.env_tf20_p37/lib/python3.7/site-packages (from gremlinpython==3.4.1) (2.2.1)
Requirement already satisfied: isodate>=0.6.0 in /Volumes/Data/ethan/.env_tf20_p37/lib/python3.7/site-packages (from gremlinpython==3.4.1) (0.6.0)
Collecting tornado<5.0,>=4.4.1 (from gremlinpython==3.4.1)
Requirement already satisfied: six>=1.10.0 in /Volumes/Data/ethan/.env_tf20_p37/lib/python3.7/site-packages (from gremlinpython==3.4.1) (1.12.0)
ERROR: notebook 6.0.1 has requirement tornado>=5.0, but you'll have tornado 4.5.3 which is incompatible.
Installing collected packages: tornado
  Found existing installation: tornado 6.0.3
    Uninstalling tornado-6.0.3:
      Successfully uninstalled tornado-6.0.3
Successfully installed tornado-4.5.3

Note that there's some problem with tornado version. If we use tornado version 5.0 or newer, we can not run gremlinpython in jupyter notebook. But if we don't use the newest version of tornado, jupyter notebook might not run properly, so becarefull with this.

We also need to chage some configuration of the gremlin server to enable us to access it from gremlinpython.

First we need to create a groovy script which will be evaluated when the gremlin server starts.

In [14]:
%%writefile scripts/init.groovy
def globals = [:]

graph = ConfiguredGraphFactory.open("airroutes")

globals << [g : graph.traversal()]
Writing scripts/init.groovy

We basically tell the gremlin server that when it starts to run, it will open the graph named airroutes and create a traversal named g which can be accessed globally. To tell the gremlin server about this file we need to alter the configuration file a little bit.

In [22]:
!sed -i -e 's/files: \[\]/files: \[scripts\/init.groovy\]/g' conf/gremlin-server/gremlin-server.yaml

Please note that after this step, if airroutes graph does not exist in the database, gremlin server cannot start properly

Restart the gremlin server and try running the following python code:

In [1]:
from gremlin_python.structure.graph import Graph
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection

from gremlin_python.process.graph_traversal import __
from gremlin_python.process.strategies import *
from gremlin_python.process.traversal import Barrier
from gremlin_python.process.traversal import Cardinality
from gremlin_python.process.traversal import Column
from gremlin_python.process.traversal import Direction
from gremlin_python.process.traversal import Operator
from gremlin_python.process.traversal import Order
from gremlin_python.process.traversal import P
from gremlin_python.process.traversal import Pick
from gremlin_python.process.traversal import Pop
from gremlin_python.process.traversal import Scope
from gremlin_python.process.traversal import T
In [3]:
graph = Graph()
g = graph.traversal().withRemote(DriverRemoteConnection('ws://localhost:8182/gremlin','g'))

hkVertexId = g.V().has('airport', 'code', 'HKG').id().next()
hkVertexId
Out[3]:
28792

That's it, we can now interact with gremlin server using python and do some query with it. To learn more about gremlin and janus graph in python you can check this github repository out.

In [ ]: