Getting Started with JanusGraph and Python
Introduction¶
The following introduction is from the official janusgrah website
JanusGraph can be queried from all languages for which a TinkerPop driver exists. Drivers allow sending of Gremlin traversals to a Gremlin Server like the JanusGraph Server. A list of TinkerPop drivers is available on TinkerPop’s homepage.
In addition to drivers, there exist query languages for TinkerPop that make it easier to use Gremlin in different programming languages like Java, Python, or C#. Some of these languages even construct Gremlin traversals from completely different query languages like Cypher or SPARQL. Since JanusGraph implements TinkerPop, all of these languages can be used together with JanusGraph.
Install JanusGraph¶
We will install JanusGraph and configure it to use Cassandra as data storage backend while using Elasticsearch for searching and indexing. For development purpose we can use local instances of Canssandra and Elasticsearch which conveniently come with the installing package of JanusGraph. In production phase, we will re-configure JanusGraph to use production servers of Cassandra and Elasticsearch.
First, we need to download JanusGraph from its relases page. We will use the newest version at the time of this writing - 0.4.0.
!curl -L -O https://github.com/JanusGraph/janusgraph/releases/download/v0.4.0/janusgraph-0.4.0-hadoop2.zip
Unzip the downloaded package and move in to the unzipped directory
import os
!unzip janusgraph-0.4.0-hadoop2.zip
os.chdir('janusgraph-0.4.0-hadoop2/')
!ls
We are almost ready to start the JanusGraph. We need to change the configuration file for the JanusGraph.
We will back up the original configuration file and replace it with a new one.
!cp conf/gremlin-server/gremlin-server.yaml conf/gremlin-server/gremlin-server.yaml.orig
!cp conf/gremlin-server/gremlin-server-configuration.yaml conf/gremlin-server/gremlin-server.yaml
Now we can start JanusGraph
!bin/janusgraph.sh start
As we can see, JanusGraph will automatically start Cassandra and Elasticsearch for us, we barely did anything on this part.
Some usefull commands with janusgraph.sh
script:
-
janusgraph.sh start
: Start JanusGraph. -
janusgraph.sh stop
: Stop JanusGraph. -
janusgraph.sh clean
: Clean everything in the storage and search engine, give us a fresh start.
To interact with JanusGraph server, or gremlin server to be exact, we can use gremlin console, python package gremlinpython
or other clients. For now we will use gremlin console which also come with janusgraph in the same installing package.
In the terminal inside the unzipped directory, issue the following command:
bin/gremlin.sh
This command will start gremlin console. We have to use gremlin language to interact with gremlin server. Issue the following commands inside gremlin console to connect gremlin console to the gremlin server (janusgraph server):
:remote connect tinkerpop.server conf/remote.yaml session
:remote console
That's it, that's how we install and configure JanusGraph for our development. We can now use gremlin to interact with JanusGraph. In the next section, we will talk about how we can load a graph from graphml file into JanusGraph.
Importing a graph into JanusGraph¶
For this part, we will use the airport route data set from Kelvin Lawrence’s book Practical Gremlin. You can download the GraphML file here.
Dowload the said file into /tmp
folder for easy reference
!curl -o /tmp/air-routes.graphml https://github.com/krlawrence/graph/raw/master/sample-data/air-routes.graphml
In gremlin console, which is connected to the gremlin server, issue the following commands to create a Configuration for the graph we are about to load in to our database:
map = new HashMap<String, Object>();
map.put("storage.backend", "cql");
map.put("storage.hostname", "127.0.0.1");
map.put("graph.graphname", "airroutes");
ConfiguredGraphFactory.createConfiguration(new MapConfiguration(map));
graph=ConfiguredGraphFactory.open("airroutes");
We basically create a graph named "airroutes" which will be store using cql storage backend (an adapter for Cassandra, the old and soon will be deprecated one is cassandrathrift
which we don't use here) at the local address (127.0.0.1).
With the graph created, we can load the content of the graph from the downloaded graphml file into janusgraph. To do so, issue the following in the gremlin console:
graph.io(graphml()).readGraph('/tmp/air-routes.graphml')
graph.tx().commit()
To get a list of graphs we have created use the command below:
gremlin> ConfiguredGraphFactory.getGraphNames()
==>airroutes
Now we have a graph named "airroutes" in our graph database. Next time after connecting to the gremlin server we just need to issue the following command to open the "airroutes" graph:
gremlin> graph = ConfiguredGraphFactory.open('airroutes')
==>standardjanusgraph[cassandrathrift:[127.0.0.1]]
Let's try doing some trarvese on the graph:
gremlin> g = graph.traversal()
==>graphtraversalsource[standardjanusgraph[cassandrathrift:[127.0.0.1]], standard]
gremlin> g.V().values('code').count()
==>3619
gremlin> departure_airport="SFO"
==>SFO
gremlin> arrival_airport="JFK"
==>JFK
gremlin> g.V().has('code', departure_airport).repeat(out('route').simplePath()).times(2).has('code', arrival_airport).path().by('code').limit(5)
==>[SFO, ATL, JFK]
==>[SFO, DFW, JFK]
==>[SFO, DCA, JFK]
==>[SFO, TPA, JFK]
==>[SFO, LGB, JFK]
Connect to Gremlin server using python¶
Gremlin server doesn't support python out-of-the-box, we need to install gremlin-python for gremlin server. Please install the compatible version with the version of janusgraph that you are using. For JanusGraph 0.4.0 we have:
Tested Compatibility:
- Apache Cassandra 2.2.10, 3.0.14, 3.11.0
- Apache HBase 1.2.6, 1.3.1, 1.4.10, 2.1.5
- Google Bigtable 1.3.0, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.9.0, 1.10.0, 1.11.0
- Oracle BerkeleyJE 7.5.11
- Elasticsearch 5.6.14, 6.0.1, 6.6.0
- Apache Lucene 7.0.0
- Apache Solr 7.0.0
- Apache TinkerPop 3.4.1
- Java 1.8
So we will need gremlin-python version 3.4.1, the same with the version of TinkerPop
!bin/gremlin-server.sh install org.apache.tinkerpop gremlin-python 3.4.1
We also need a pip package named gremlinpython
that provides Python APIs for us to access the gremlin server.
!pip install gremlinpython==3.4.1
Note that there's some problem with tornado version. If we use tornado version 5.0 or newer, we can not run gremlinpython in jupyter notebook. But if we don't use the newest version of tornado, jupyter notebook might not run properly, so becarefull with this.
We also need to chage some configuration of the gremlin server to enable us to access it from gremlinpython.
First we need to create a groovy script which will be evaluated when the gremlin server starts.
%%writefile scripts/init.groovy
def globals = [:]
graph = ConfiguredGraphFactory.open("airroutes")
globals << [g : graph.traversal()]
We basically tell the gremlin server that when it starts to run, it will open the graph named airroutes
and create a traversal named g
which can be accessed globally. To tell the gremlin server about this file we need to alter the configuration file a little bit.
!sed -i -e 's/files: \[\]/files: \[scripts\/init.groovy\]/g' conf/gremlin-server/gremlin-server.yaml
Please note that after this step, if airroutes
graph does not exist in the database, gremlin server cannot start properly
Restart the gremlin server and try running the following python code:
from gremlin_python.structure.graph import Graph
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.graph_traversal import __
from gremlin_python.process.strategies import *
from gremlin_python.process.traversal import Barrier
from gremlin_python.process.traversal import Cardinality
from gremlin_python.process.traversal import Column
from gremlin_python.process.traversal import Direction
from gremlin_python.process.traversal import Operator
from gremlin_python.process.traversal import Order
from gremlin_python.process.traversal import P
from gremlin_python.process.traversal import Pick
from gremlin_python.process.traversal import Pop
from gremlin_python.process.traversal import Scope
from gremlin_python.process.traversal import T
graph = Graph()
g = graph.traversal().withRemote(DriverRemoteConnection('ws://localhost:8182/gremlin','g'))
hkVertexId = g.V().has('airport', 'code', 'HKG').id().next()
hkVertexId
That's it, we can now interact with gremlin server using python and do some query with it. To learn more about gremlin and janus graph in python you can check this github repository out.