Recently we had a few days for doing freestyle development and research at work. As a big-data enthusiast I instantly chose a task that was about investigating an arbitrary graph database solution and trying to model some subset of our data with it.
The task is not easy, when one is seeking for a new technology wants to minimise the trade-offs. So basically for the first glance what we are looking for is a graph database that
- is fast
- has no single point of failure
- and of course scales to the skies
It might sound sort of a sci-fi stuff if you are familiar with big-data applications, but let's get started, we have nothing to lose.
After a couple of hours of research I found the Titan Graph Database by Thinkaurelius to be the closest to my expectations, and at this point I have to say that I'm pleased.
Graph Database Crash Course
If you are not familiar with graph databases here's a quick introduction. A graph database models data-sets using the graph theory from maths. There are two main kinds of objects that are mandatory in the scheme
- nodes (representing entities such as users, accounts, etc)
- edges (representing relations between nodes, such as ownership etc)
Nodes and edges can be decorated with labels and attributes, so that they can carry additional metadata. For example a very simple database that contains persons and models friendships could look like this
- a node type, labeled "person" with an attribute "name"
- an edge type, labeled "friend"
A friendship between two persons would be represented by a friend edge between those two nodes. Even with this simple model, endless kinds of information could be extracted doing graph traversals (actually that's the exact thing that for example Facebook does).
Titan is a scalable graph database optimised for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multimachine cluster. Titan is a transactional database that can support thousands of concurrent users executing complex graph traversals in real time.
Pretty much that's the key, it's something like we need. Let's have a brief run-through on its features.
- elastic and linear scalability for a growing data and user base
- data distribution and replication for performance and fault tolerance
- multi-datacenter high availability and hot backups
- support for ACID and eventual consistency
- support for various storage backends (Cassandra, HBase, etc.)
- support for global graph data analytics through Hadoop integration
- support for geo, numeric range, and fulltext search via Elasticsearch, Solr or Lucene
- open source with the liberal Apache 2 license
It sounds to me like if it was Christmas anyways :).
Titan itself is not a database in its original sense. It's a graph database engine, that integrates existing solutions as building blocks to form a system. The main building blocks are
- the storage backend
- the index/search backend (optional)
- the TinkerPop graph stack
One is able to use Titan with various storage backends, such as BerkleyDB, Cassandra and HBase. The Titan database instance will inherently show the characteristics of the chosen storage backend.
Benefits of Titan with Cassandra
- continuously available with no single point of failure
- no read/write bottlenecks to the graph as there is no master/slave architecture
- elastic scalability allows for the introduction and removal of machines
- caching layer ensures that continuously accessed data is available in memory
- increase the size of the cache by adding more machines to the cluster
- integration with Hadoop
Benefits of Titan with HBase
- tight integration with the Hadoop ecosystem
- native support for strong consistency
- linear scalability with the addition of more machines
- strictly consistent reads and writes
- convenient base classes for backing Hadoop MapReduce jobs with HBase tables
- support for exporting metrics via JMX
The image below shows how the characteristics change using different storage backends. Their trade-offs with respect to the CAP theorem are worth noticing.
About the TinkerPop Graph Stack
TinkerPop merely provides a set of interfaces that Titan implements to get features that comes from the stack. Each part of the stack implements a specific function in supporting graph-based application development.
The foundation of TinkerPop is the Blueprints API which
is a property graph model interface with provided implementations. Databases that implement the Blueprints interfaces automatically support Blueprints enabled applications.
Titan, by implementing the Blueprints API inherits a big bunch of predefined features and components from TinkerPop like a query language ( Gremlin) and a "graph server" (Rexster) that can expose any Blueprints graph through several mechanisms with a general focus on REST APIs.
The Juicy Part
Titan queries are basically optimal for relatively small subgraphs and networks, processing smaller portions of the whole graph. For global, long-running queries Titan introduces a component, called Titan-Hadoop (formerly Faunus). It compiles Gremlin queries to Hadoop MapReduce jobs, and runs them on the cluster. It can compute graph derivations, graph statistics, and graph data mappings (input/output mappings) on massive-scale graphs represented across a multi-machine cluster.
The Gremlin Query Language
Gremlin comes from the TinkerPop stack. It's a declarative graph language, representing a graph traversal with its statements. It is hosted in Groovy by default, but can be used natively in various JVM languages such as Java and Scala.
Although graph database schemes are quite different from relational ones, some Gremlin graph queries can turn out to be surprisingly similar to the traditional SQL ones. Here's a brief comparison table.
The Bottom Line
Titan is a very nice piece of software. It beautifully integrates already existing technologies to create a new one, it can really touch an enthusiastic engineer. Seriously, this is the way how things are meant to be done. However to form a complete picture, many more things are left to try out, such as
- deploying it on a multi-machine cluster
- doing benchmarks using various kinds of data-sets
- doing benchmarks using various storage backends
- doing failure tests
Here are the usual Pros/Cons, please note that these are something like first impressions. As mentioned before, a more extensive investigation should be done to be fully competent.
- large-scale, robust architecture
- flexible configuration possibilities
- very easy installation (on a single machine)
- friendly community
- wide range of tools (coming from TinkerPop)
- has commercial support from Hortonworks
- not as widely used as neo4j
- documentation is out of date on a few places