Creating Match, Return, Order By clauses From a PathTree
Jul 2, 2017
7 minute read
In this post I’ll describe in detail, how I created Match, Return and Order By clauses of Cypher using a PathTree. PathTree was discussed in the previous post.
Description of Clauses
Match is the very first clause in every Cypher query. It allows you to specify the patterns Neo4j will search for in the database. These patterns must be such that the data which matches them, must be the data you wish to operate (or query) on. Consider an example Match clause which matches three Genes such that two of them INTERACTS_WITH the third one.
The pattern above is a 2-dimensional pattern which we often have to deal with, in Cypher. Although, it is easier to imagine the pattern in 2-D, we generally need to break them down into multiple 1-D patterns while writing cypher queries. Also, 1-D patterns would be easier to generate programmatically than their 2-D counterparts. For example, the following Match clause with two 1-D patterns is equivalent to the one shown above.
Note that we could simply use the variable name b in the second pattern without specifying the Label Gene. It doesn’t seem very significant here but it makes generating queries very convinient while dealing with some complicated paths. Suppose you need to specify a Path/Node/Relationship again and again in the query, wouldn’t you be comfortable in just writing its variable name and get done with it?
In the RETURN part of the cypher query, we define the parts of the pattern in which we are interested. It can be nodes, relationships, or properties on these.
ORDER BY is a sub-clause following RETURN, and it specifies that the output should be sorted and how. For example,
We see that variable names are crucial in a Cypher query. A PathTree represents all the paths in the corresponding PathQuery. Since we can have a constraint/view/order on any path, therefore we need to assign variable name to each node of the PathTree.
We cannot just use the component name as the variable name. This is because a component can be repeated in the same path. For example, in Gene.chromosome.gene.length, gene appears twice. Having a unique variable name for each TreeNode is crucial to generating the Cypher query.
To create the variable names, I have simply separated components in the path using an underscore instead of a dot. Also, I have converted the path string to its lower case form. For example, the TreeNode for the path Gene.homologues.dataSets.name will have variable name gene_homologues_datasets_name. This approach is fine till we don’t exceed the max length of variable name for a path.
A TreeNode for the path Gene.chromosome.gene.length, stores the following information.
type : Whether this TreeNode represents a Graphical Node or Relationship or a Property on them.
name : The last component of the path, i.e. length.
variable name : The one we generated using approach shown above, i.e. gene_chromosome_gene_length.
graphical name : This is the name by which we refer this entity/property in the InterMine Neo4j graph. Here we keep it same as name, i.e. length.
The return clause always starts with the RETURN keyword. After that, for each view in the PathQuery, we add an expression separated by commas. The expression consists of the variable name and the respective graphical name.
The order by clause always starts with the ORDER BY keyword. After that, we simply append the variable name and the respective graphical name for each sortOrder's path of the PathQuery. The sort type (Ascending/Descending) is also added.
Creation of Match clause is rather complex. The match clause always starts with the MATCH keyword. After that, for each edge in the PathTree we write two nodes and the relationship between them. If two adjacent TreeNodes represent Graph Nodes, then we use a dummy relationship to join them. Otherwise we add the relationship represented by the TreeNode in between.
The createMatchClause() recursive method takes in the Query object and the root TreeNode as parameters. The following code snippet shows the method in action.
In the next post, the generation of Where clause will be covered. It is most complex of all because it involves converting around 30+ PathQuery contraints into their equivalend Cypher expressions.