PathQuery To Cypher - Part 2


In this post I’ll present the approach that I am using for the converting Path Query to Cypher. The basics of Path Query to Cypher conversion have been discussed in my previous post.

Observations

A Path Query consists of many paths. Each path represents some data in the database. For example consider the following Path Query.

<query model="genomic" view="Gene.length Gene.homologues.dataSets.publication.title" constraintLogic="(A and B)">
	<constraint path="Gene.homologues.evidence.publications.abstractText" value="a" op="CONTAINS" code="A"/>
	<constraint path="Gene.homologues.dataSets.name" value="a" op="CONTAINS" code="B"/>
</query>

It contains the following four paths.

  • Gene.length
  • Gene.homologues.dataSets.name
  • Gene.homologues.dataSets.publication.title
  • Gene.homologues.evidence.publications.abstractText

One observation that we can make by looking at these paths is that some prefixes are common in paths. For example, Gene.homologues is common in the last three, Gene.homologues is common is the second & third and Gene prefix is there in all of them.

Infact, while building queries in the query builder, we start from a model and add its attributes, references & collections to our query. Then we move on to any of the references/collections and then add its attributes, references & collections to our Path Query as required, and so on. Thus, all the paths will have some prefix in common and there is an heirarchy associated with different components of the path.

With a bit of thought, I came to a conclusion that Tree Data Structure would be the perfect representation for all the paths of the Path Query. Tree is a popular hierarchical data structure which has a root, subtrees of children with a parent node.

Tree Data Structure

PathTree Representation

Let us generate a Tree using the four paths of the example Path Query shown above. Since it is a Tree made up of paths, we can call it a Path Tree.

A Path Tree

A path tree is made up of many TreeNodes. Each tree node represents a component of the path. For example, the path Gene.homologues.dataSets.name would be represented by four TreeNodes - Gene, homologues, dataSets & name. These components further represent the Nodes, Relationships & Properties in the InterMine Neo4j graph. These TreeNodes also represent Neo4j Graphical Entities (& properties). All the paths having common prefix have common ancestor TreeNodes in the PathTree. This way we can represent hierachy among the various components of the path and can avoid storing redundant information.

Generating Cypher using Path Tree

In Cypher, we can assign Nodes, Relationships & even Paths to the variables. These variables can then be used in place of those Nodes/Relationships/Paths in the rest of the query. For example consider an example Cypher query.

MATCH (n:Gene), (n)-[r:locatedOn]-(c:Chromosome)
WHERE n.length > 12345
RETURN c.symbol

In the example, we first matched all the Genes and assigned it to the variable n. Now, n is used to MATCH a relationship from Genes to Chromosomes and also in the WHERE clause it is used to compare the length of the Genes.

This way, while converting a Path Query to Cypher, we can assign a unique variable name to each TreeNode in the Path Tree and then use it in the remaining MATCH, WHERE, RETURN, ORDER BY & OPTIONAL MATCH clauses. The high-level approach, in an algorithmic form is presented as follows.

1. Take a PathQuery object as input.
2. Retrieve all the Paths from the Views, Constraints & Sort Order.
3. Using all the Paths of PathQuery, create a PathTree such that 
	1. Each TreeNode represents a component of the path. For example, the path "Gene.pathways.identifier" forms three TreeNodes i.e. Gene, Pathways & Identifier.
	2. Paths with common prefix have the same common ancestor.
	3. Root TreeNode represents a Graph Node.
	4. All Leaves of the PathTree always represent Graph Properties.
	5. All other Internal TreeNodes can represent either Graph Nodes or Graph Relationships.
4. Generate & store a unique variable name to each Internal node of the PathTree.
	1. This variable name will be used for referring that TreeNode in the cypher query.
	2. For generating the variable name, we can separate each component of the path using underscores. For example, the variable name for "Gene.pathways.identifier" will be gene_pathways_identifier.
5. Use the PathTree & PathQuery to generate the cypher query
	1. For creating the Match Clause, starting with the Root, recursively match each TreeNode of the PathTree,
		1. If the TreeNode is Root, match the node itself. e.g. (gene).
		2. If current TreeNode is a NODE,
			1. If its parent is also a NODE, then fetch the Relationship Type from the XML data model file and create the match as (parentNode)-[relationshipFromXml]-(currentNode).
			2. If the parent is a RELATIONSHIP, then fetch the grand parent from the PathTree and create match as (grandParentNode)-[parentNode]-(currentNode).
		3. If current TreeNode is a RELATIONSHIP, 
	        	1. If current node does not have any children, then add match with an empty node as (parentNode)-[currentNode]-().
	        	2. If current node has any children, then do nothing (they will be matched when recursion reaches the children).
	2. For creating the WHERE clause,
		1. For each constraint in the PathQuery, generate an equivalent Cypher constraint.
		2. In the constraint logic of the PathQuery, replace the constraint code of each constraint with its equivalent Cypher constraint.
	3. For creating the RETURN clause
		1. For each view, get its path
		2. Get the variableName of the last TreeNode of the path
		3. Add variableNames separated by commas for each such variable
	4. For creating ORDER BY clause,
		1. For each Sort Order, get its path
		2. Get the variableName of the last TreeNode of the path
		3. Add variableName ASC/DESC separated by commas for each such variable
	5. For handling JOIN operations in the PathQuery,
		1. Add OPTIONAL MATCH clause in the query for corresponding paths.
7. Return the generated query

In the next post, I’ll explain the generation of each clause - MATCH, RETURN, ORDER BY, WHERE & OPTIONAL MATCH separately. Meanwhile, you can have a look at the Path Query to Cypher conversion code at org.intermine.neo4j.cypher package.

Related Posts

Project Report - Google Summer of Code 2017

Documenting InterMine-Neo4j API with Swagger UI

Creating Match, Return, Order By clauses From a PathTree

The PathQuery To Cypher Puzzle - Part 1

Metadata in Neo4j

My GSoC Journey Begins

First Jekyll Blog!