NoSQL, HyperGraphDB and Neo4J


posted by Kobrix Software at Kobrix Software, Official Blog - 3 weeks ago
NoSQL has picked up a lot of steam lately. HyperGraphDB being a NoSQL DB *par excellence*, we will be joining the upcomping conference organized by the 10gen, the maker of MongoDB: "NoSQL Live from Boston...
These are the comments I have made and the reply from Boris.
I gather them here for any further comment I may want to make.
See after the last comment.


3 comments:

semanticC said...
As its the 12th I guess the conference has taken place, links I will be following up. There are a few questions I have. 1. Can you point me towards any comparison of NeoDB and Hypergraphdb, do they cover the same ground? How do they differ? 2. The relationship between graph databases and 2.1. OWL, how would OWL be consumed, or would it? 2.2. more generally RDF, and then XML, after all there are XML databases that parse in the XML. How do they compare? I'm sure I have missed something(s), but what? 3. One of the problems I have encountered is in keeping various .properties files aligned. One approach is to use something like magic lenses such as the augeas implementation. But, at the same time, I have wanted to rewrite these properties out of their ANT context into a Maven POM context. A job for hypergraphdb? Ideas? 4. Moving on, I have noticed the fascinating post about using hypergraphdb to create a neural net. 4.1. Would you agree that what is happening here is in line with Rickard Öberg? http://www.qi4j.org/ for background and http://www.qi4j.org/qi4j/351.html where he discusses the relationship between algorithms and OOP. BTW, he also arrives at the need for atoms and mentions the same focus, the business case, that you emphasise in your background paper, Rapid Software Evolution. 4.2. I notice that Neo4J has an example of a spreading activation algorithm (token passing), http://wiki.github.com/tinkerpop/gremlin/pagerank - I expect this means that either db could also be used to implement Random Indexing - sparse matrices - as developed by P. Kanerva and M. Sahlgren Some of this may be touched on in the Disko project. Again, ideas? Sorry for such a long comment, but not sure how/if to email privately.
Kobrix Software said...
Hi semanticC, A good place to discuss HyperGraphDB would be the discussion forum: http://groups.google.com/group/hypergraphdb?hl=en This is a long list of topics raised indeed :) Let me try to cover them one by one, perhaps in separate responses: 1) Such comparison should ideally be done independently and I am not aware of any. For starters, HyperGraphDB has much more general data model than Neo. In fact, the name is maybe a bit misleading from a functionality perspective because now it's being labeled as "another graph database", which it is, but it is also OO database, a relational database (albeit nonsql) etc. In HyperGraphDB, edges point to an arbitrary number of things, including nodes and other edges Neo is a classical graph of nodes and directed edges between any two nodes. In addition, HGDB has a type system while Neo doesn't. So HGDB has in effect a dynamic schema that you can introspect, reason about and change. Besides the data models, the storage models are quite different: HyperGraphDB has a general two-layered architecture where a big part of the storage layout can be customized. Neo uses linked lists to store its graph and claims that this makes faster traversals (probably true) and that this is all you need to do with a graph, you don't need indices, pattern mining etc. (here, I disagree). HGDB relies heavily on a lot of indexing for more complicated graph-related queries & algorithms. In sum, HyperGraphDB has pretty much the most versatile data model I know of, and subsumes Neo and others easily. Weather that sort of generality comes at the expense of performance remains to be seen. As you've probably realized from the neural net post, HGDB gives you more representational choices so performance has to be measured more globally, at an application level, through a design that makes intelligent use of what HGDB has to offer. more on the others later....perhaps at the end I'll sum up my responses in a separate blog.
semanticC said...
Hi Boris, Thanks so much for your reply. It would be great if the other questions inspire a blog post. If anyone is interested the NoSQL conference is previewed and will be written up here http://radar.oreilly.com/2010/02/nosql-conference-coming-to-bos.html - and it is a good discussion. Boris contributes too! There are still many things I cannot get my head around. I can see the 'representational choices' the ability to define functions directly working on the data using the HGDB API. I expect this is a good thing in the way that, for example, annotations are better than XML, everything is in the place where it will be used, which facilitates concentrating on the task. But other benefits? Here I cannot see. Moving on again, I am reminded of the efforts of Henry Story to create a framework to import RDF, inspired by Active Record. I am very unclear about all of this. Did I read somewhere that there is a standardisation of the syntax for the import statements of RDF namespaces? Anyway, the idea would be to make the referenced ontology available in code, presumably it would already be in Sesame as the graph db backend? All of this seems relevant to HGDB. First you have mentioned the type system, so how to model the types? I had thought that OWL was a good way of both modelling and sharing those models. But if so, what of the other aspect of HGDB, its ability to deal with semi-structured data, how to fit the two together? I am thinking about Collective Entity Resolution as perhaps one sort of solution and simply in code, how they might interact, as another area. Moving up towards the goal of evolutionary software, I have long thought that it must be possible to describe software using OWL. I assumed that reasoning would take the place of a lot of code when there is a well constructed model. Of course that brings me back to what role reasoning in NoSQL. I know it is build in to AllegroGraph. As I say, many thoughts, but I don't really understand the ramifications of NoSQL at the moment. Perhaps I am missing the point altogether?
  1. To create a data source or data sources for Node.js. This is just to demonstrate the principle. My page calls into Node.js which responds with some appropriate data.
  2. Determine nature of data that Node.js needs as this is not clear at this point. Need to determine relationship with JSON in my application. It seems to me that JSON is more like piping in my application, formatting user specific data and returning to the user, so not a data feed. More as I discover more ... 
  3. At this point determine if HGDB or Neo4J or ... (and?) can be data backend for feeds. I think it can. This is quite difficult as the level of support in different implementations is different and i may need one with greater support e.g. more flora around. But I hardly have time and knowledge to evaluate one, let alone compare and all I need is a working demo of something. Something I am calling Knowledge Combinotronics.
  4. Experiment with functions such as forms of graph traversal, spreading activation, some implementation of collective entity resolution and random projection, e.g. semantic vectors. This is all very ambitious for me and no idea how far I will get, if anywhere.

3 comments:

Unknown said...

Hi SemanticC,
Boris and me briefly tried to get a grip on the main differences/similarities at NOSQL Live.
While there is no comparison available, just briefly:

- HypergraphDB is has a very generic model that lets any ID point to any other ID on top of a Berkely DB (or probably any) K/V store, which permits any upper model layer, amongst others a graph, hypergraph or relational model.

- Neo4j is a bit more pragmatic and implements a Property Graph model (well, academically speaking a "multi labeled directed graph" with an optimized storage layer where the references between nodes, relationships and properties actually exist as persistent references, leading to very good performance in real world scenarios since there is no storage impendance penalty for traversing the graph beside disk IO. Traversal is a contact time operatoin independent on the size of the graph. An artikel on that is in the making.

You can use both HGDB and Neo4j for modeling of RDF Triple- and Quadstores. While HGDB will let you build anything on nested K/V store lookups, Neo4j is projecting any model onto the property graph, so the traversal speeds of > 1M nodes/relationships per second on normal HW can help solve interesting problems.

For a deeper discussion, please contact me off-list or Skype!

Unknown said...

Peter,
Thank you very much for these comments. It looks like I have inadvertently found myself more things to do! I will have to follow up over the next few days by implementing some of the Tutorial type examples for both projects to see how that helps my understanding and comprehension of the respective APIs. Please bear in mind that at this stage I am 'playing'. I do want to get a web site up that demonstrates something along the lines mentioned in my post 'Knowledge Combinatorics'. This makes heavy use of Tapestry, AJAX, Node.js, any ejb server for legacy - which I mock out for demo purposes. It is my intention that AJAX requests return data from a NoSQL backend, the high performance seems a good match, and the ability to define functions or methods against the datastore what I am looking for. I very much doubt I will have time to be thorough in my investigation at this stage. I will just plumb for one of the stores that I have most success in programming against. The web site is a demonstrator project. I will inform anyone I have contact with on the way when it is up. (Wish me luck!)
Adam

Borislav Iordanov said...

Hi Peter and SemanticC,

Very long, pure graph traversals are probably going to be faster in neo since it uses linked lists. For shorter ones, where most of the data is already in the cache, performance is going to be comparable.

However, I object to the "more pragmatic" bit :) I think it's quite the opposite. If you need a graph where all you do is massive traversals, then neo is more "pragmatic" and I would highly recommend it. If you need a database that is also designed to cover other common aspects in "real world" applications, such as typing, an evolving schema, n-ary relations as in SQL, if you need to store objects and the ability to customize that storage to your needs (e.g. blobs or value graphs), if you need more general querying capabilities other than traversals, well I would say hgdb is much more "pragmatic" :)

Best,
Boris

top