COLUMN DATABASES Rows are organized into individual tables Columns are represented as rows in those tables Designed to reduce IO and seek times when accessing data
CASSANDRA Massively distributed
Can support massive clusters with 75,000+ machines CQL – Cassandra Query Language
No joins or subqueries SELEC T * FR O M u sers W H ER E last_nam e = ”sm ith ”; MapReduce
Hadoop is all you get
CASSANDRA - PERFORMANCE "In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments" although "this comes at the price of high write and read latencies.” – Toronto University Absolutely amazing throughput Not so amazing response times for each individual query
TITAN Runs on Cassandra or Hbase
Oracle Berkeley DB Can store massive graphs Doesn’t support Cypher Query
Point 1 CASSANDRA Cassandra is a non-relational data store that stores data in tables. Cassandra organizes columns into rows and rows into tables. NEO4J Neo is a graph database that organizes data into arcs and nodes.
Point 2 CASSANDRA Because columns are stored as rows, tables can have a huge number of columns (maximum of 2 billion columns). NEO4J Neo can house at most 34 billion nodes, 34 billion relationships, and 68 billion properties in total.
Point 3 CASSANDRA All tables must have an index which is used as a basis for sharding the data. NEO4J Indexes can be added and removed wherever desired.
Point 4 CASSANDRA Cassandra has impressive HA capabilities that can span multiple data centers with little effort. NEO4J Neo uses master slave replication.
Point 5 CASSANDRA Cassandra can elegantly run on huge clusters that replicate and shard data effortlessly. NEO4J Neo doesn’t shard your data.
Point 6 CASSANDRA Cassandra scales linearly by adding more hardware. There is pretty much no limit to the hardware that you can add. NEO4J Neo read throughput scales linearly with the number of servers, but the number of servers in a cluster has to stay relatively small.
Point 7 CASSANDRA The dataset can grow virtually endlessly while still getting the same performance. NEO4J The dataset size is limited to at most 34 billion nodes, 34 billion relationships, and 68 billion properties in total.
Point 8 CASSANDRA Cassandra does not use a master/slave paradigm, so there is no down-time when a machine dies. NEO4J There is a brief window of downtime while a new master is elected.
Point 9 CASSANDRA Cannot do traversal queries NEO4J Traversal queries that have exponential cost on traditional RDBMS have linear cost on Neo.
Point 10 CASSANDRA Write performance is just as good as read performance. NEO4J Write performance is slower than read performance.
Point 11 CASSANDRA Every query has additional latency due to cluster overhead NEO4J Individual queries can be serviced much faster with far less latency.
Point 12 CASSANDRA ACID transactions are mostly supported, but with tunable consistency. NEO4J ACID transactions are fully supported and completely consistent, but there is a performance hit for the consistency.
Point 13 CASSANDRA Cassandra can perform operations completely synchronously or alternatively at variously levels of consistency with corresponding performance on an operation by operation basis. NEO4J Consistency is not tunable
Point 14 CASSANDRA Cassandra uses it’s own query language (CQL) that has similar syntax to SQL (no joins) NEO4J Neo uses Cypher and also supports Gremlin
Point 15 CASSANDRA Instead of performing joins at runtime data must be de-normalized before hand NEO4J Graphs are normalized and highly connected. Traversals are very fast.
NEO4J “A single instance of Neo4j can house at most 34 billion nodes, 34 billion relationships, and 68 billion properties, in total. Businesses like Google obviously push these limits, but in general, this does not pose a limitation in practice. It is also important to understand that these limits were chosen purely as a storage optimization, and do not indicate any particular shortcoming of the product. They are easily, and are in fact being, increased.” http://info.neotechnology.com/rs/neotechnology/images/Unders tanding%20Neo4j%20Scalability(2).pdf
Gemlin g.V('customerId','ALFKI').as('customer') .out('ordered').out('contains').out('is').as('products') .in('is').in('contains').in('ordered').except(‘customer') .out('ordered').out('contains').out('is').except('products') .groupCount().cap().orderMap(T.decr)[0..<5].productName Cypher MATCH (c1)-[:ordered]->(o1)-[:contains]->(p1)<-[:contains]-(o2)<-[:ordered]- (c2)-[:ordered]->(o3)-[:contains]->(p2) WHERE c1.customerId = "ALFKI" AND c1 != c2 AND p1 != p2 RETURN p2.productName, count(p2) num
SQL SELECT TOP (5) [t14].[ProductName] FROM (SELECT COUNT(*) AS [value], [t13].[ProductName] FROM [customers] AS [t0] CROSS APPLY (SELECT [t9].[ProductName] FROM [orders] AS [t1] CROSS JOIN [order details] AS [t2] INNER JOIN [products] AS [t3] ON [t3].[ProductID] = [t2].[ProductID] CROSS JOIN [order details] AS [t4] INNER JOIN [orders] AS [t5] ON [t5].[OrderID] = [t4].[OrderID] LEFT JOIN [customers] AS [t6] ON [t6].[CustomerID] = [t5].[CustomerID] CROSS JOIN ([orders] AS [t7] CROSS JOIN [order details] AS [t8] INNER JOIN [products] AS [t9] ON [t9].[ProductID] = [t8].[ProductID]) WHERE NOT EXISTS(SELECT NULL AS [EMPTY] FROM [orders] AS [t10] CROSS JOIN [order details] AS [t11]
INNER JOIN [products] AS [t12] ON [t12].[ProductID] = [t11].[ProductID] WHERE [t9].[ProductID] = [t12].[ProductID] AND [t10].[CustomerID] = [t0].[CustomerID] AND [t11].[OrderID] = [t10].[OrderID]) AND [t6].[CustomerID] <> [t0].[CustomerID] AND [t1].[CustomerID] = [t0].[CustomerID] AND [t2].[OrderID] = [t1].[OrderID] AND [t4].[ProductID] = [t3].[ProductID] AND [t7].[CustomerID] = [t6].[CustomerID] AND [t8].[OrderID] = [t7].[OrderID]) AS [t13] WHERE [t0].[CustomerID] = N'ALFKI' GROUP BY [t13].[ProductName]) AS [t14] ORDER BY [t14].[value] DESC
CONCLUSION If one plans on writing a recommendation queries, a graph db is a more elegant fit than a relational DB. Only use Titan if you need it You have an insanely large graph Or you expect an insanely high load Neo4J Faster queries A more straightforward and powerful query language