I recently worked on a proof of concept in Neo4j to model a set of unstructured data, sourced from various systems, and stored in a No SQL HBase Data lake. I’m excited to share my increased understanding of graph database technology and Neo4j with you here.
What is a graph database?
Very simply, a graph database is a database which treats relationships between data with the same importance as the actual data itself. They typically do not force data models upon the data. Rather, they allow the data to be ingested and drawn out with the relationships between the data. Native graph processing (a.k.a. “index-free adjacency”) is the most efficient means of processing graph data since connected nodes physically “point” to each other in the database.
With the property graph model, the data is organized into the following:
- Nodes: These are similar to entities in the model. They can hold attributes relating to the node, which are typically held as Key-Value Pairs (KVPs). Nodes are typically tagged with labels.
- Edges (Relationships): They provide the 'relationships' between the nodes. These relationships need to be semantically-relevant connections between the data and therefore should be named appropriately. A relationship will always have a direction, a start node and an end node. Relationships can also have properties which pertain to the relationship itself.
- Properties: Properties are KVPs which can reside on both Nodes and Edges.
Why graph database technology?
There are some cons of graph database as with any technology. However, the cons are very little compared to its pros.
- Performance: Your data volume will definitely increase in the future, but what’s going to increase at an even faster pace is the connections (or relationships) between your data. Big data will definitely get bigger, but connected data will grow exponentially. With traditional databases, relationship queries come to a grinding halt as the number and depth of relationships increase. In contrast, graph database performance stays constant even as your data grows year over year.
- Flexibility: With graph databases, your IT and data architecture teams move at the speed of business because the structure and schema of a graph data model flex as your solutions and industry change. Your team doesn’t have to exhaustively model your domain ahead of time (and then exhaustively remodel and migrate the DB after someone asks for a change). Instead, you can add to the existing structure without endangering current functionality. With the graph database model, you are the one dictating changes and taking charge; whereas the RDBMS data model dictates its requirements to you, forcing you to adapt to its tabular way of seeing the world.
- Agility: Developing with graph technology aligns perfectly with today’s agile, test-driven development practices, allowing your graph-database-backed application to evolve with your changing business requirements.
What is Neo4j/Cypher?
Neo4j is the most popular graph database implemented in Java, developed by the Neo technology. It’s in its early days, but is open source and very powerful.
In addition, Neo4j is schema free. The data does not have to adhere to any convention, it’s easy to get started and use, is very well documented, and has large developer community. Furthermore, it complies with ACID properties and supports a wide variety of languages including Java, Python, Perl, Scala, and Cypher.
Cypher is the query language for Neo4j, easy to formulate queries based on relationships. It has many features that stem from improving on pain points with SQL such as join tables.
An example of modeling in Neo4J using custom Python libraries
I have used Python and Python custom library py2Neo to connect and import data to Neo4 for the PoC purpose, however, due to sensitive nature of the project, I can’t share the PoC with you. However, I have implemented something similar to model a family relationship to give you an example. Below is the graph model implemented in Neo4j.
Project benefits from using Neo4J:
- It enabled us to provide visual representation of the data stored in HBase very quickly.
- Neo4J made it easy to make sense out of the data and understand the relationships.
- The development team had a clear view of how to handle the data received from the pre-existing data APIs.
- It unblocked the test team to build the test scenarios.
Overall, I would say that graph database technology is supremely useful when dealing with unstructured data. I hope you learned something new about modeling unstructured data and how Neo4j might help your business.