This post describes how the Metagraph is used to store the schema in the Neo4j database.
Introduction
InterMine, as a data warehousing system, stores biological data which is loaded from various data sources. It is imperative to ensure that the loaded data conforms to the existing schema/model so as to maintain the data integrity.
Presently the data model in InterMine is stored in an external XML file. For each entity, its attributes, references and collections are stored in this XML file. For example, a part of the file which has the BioEntity model looks like this.
While developing Neo4j prototype of InterMine, to reduce the dependency on external files, it was decided to store the schema in the database itself. Neo4j being a graph database stores data in the form of Nodes and directed Relationships. This brought up two issues:
-
The data model must to be in the form of a graph itself so that it can be stored in the Neo4j database.
-
The data model should represent all the existing Nodes in the database and the Relationships among them.
Since the data model is a graph and it stores information about the InterMine graph, it can be called a metagraph.
Metagraph Structure
Each node in the metagraph is assigned :Metagraph
label. All the metagraph nodes are further classified into two types, NodeType and RelType. As the name suggests, each NodeType node represents a specific type of nodes and each RelType node represents a specific type of relationship in the IM graph. So each node has either :NodeType
or :RelType
label depending on which entity it represents.
NodeType
A metagraph node labelled :NodeType
contains following two properties.
-
labels - A list containing the labels of the nodes that are represented by the
:NodeType
node. For example, [“Gene”,”SequenceFeature”,”BioEntity”]. -
keys - A list containing the keys of all the properties exist amongst the nodes that are represented by the
:NodeType
node. For example, [“primaryIdentifier”, “secondaryIdentifier”, “symbol”].
Each :NodeType
node is uniquely identified by its labels property. So, for all the nodes in the IM graph which are labelled :Gene
, :SequenceFeature
, :BioEntity
, there exists one :NodeType
node in the metagraph which has its labels property set as [“Gene”,”SequenceFeature”,”BioEntity”].
RelType
A metagraph node labelled :RelType
contains following two properties.
-
type - A string denoting the
type
of the relationships that are represented by the:RelType
node. For example, “HOMOLOGUE_OF”. -
keys - A list containing the keys of all the properties exist amongst the relationships that are represented by the
:RelType
node. For example, [“DataSet”].
Each :RelType
node is uniquely identified by its type property. So, for all the relationships of type HOMOLOGUE_OF in the IM graph, there exists one :RelType
node in the metagraph which has its type property set as “HOMOLOGUE_OF”.
Relationships in MetaGraph
Metagraph should not only contain information about the properties of various entities in the IM graph but it should also store how different types of nodes are connected to each other. To represent this information, we make use of Neo4j relationships.
We know that each :RelType
node represents a type of relationships that exist in the IM graph. Now, we create two outgoing relationships/edges from each :RelType
node - :StartNodeType
and :EndNodeType
. These edges end on a :NodeType
node.
Thus the metagraph path (a:RelType)-[:StartNodeType]->(b:NodeType)
shows that the relationships represented by node a
starts from the nodes represented by the node b
. Same case follows for :EndNodeType
.
Generating MetaGraph
Representing Nodes
The following Cypher query, creates :NodeType
nodes for all the nodes that exist in the IM graph.
Representing Relationships
The following Cypher query, creates :RelType
nodes for all the relationships that exist in the IM graph. It also connects them to the :NodeType
nodes with :StartNodeType
and :EndNodeType
relationships.
Once schema is generated and stored in the database, it can be accessed by the data loader at runtime. By enforing the rules of the schema, the loader maintains data integrity while loading new data to the database. The code for generating and querying metadata in Neo4j is developed in the org.intermine.neo4j.metadata package in InterMine/Neo4j repository.
Perhaps in a later post, I will discuss the org.intermine.neo4j.metadata package and will write a small tutorial on how you can use it generate & query your own Neo4j data model. So, stay tuned!