In this post I am sharing with you the Project Report that I have submitted as the work product for Google Summer of Code 2017. It summarizes all the blog posts written by me during this summer.
- Organisation - InterMine
- Github / @intermine
- Website / InterMine.org
- Project: Prototype a RESTful API Querying Neo4j Database
- Mentors:
- Sam / @sammyjava
- Daniela / @danielabutano
- Vivek / @vivekkrish
- Student: Yash Sharma / @yasharmaster
Table of contents
- Introduction
- Quick links
- How to use the code
- Metadata in Neo4j
- PathQuery to Cypher
- RESTful API
- Conclusion
Introduction
This document is for the purpose of review and for those interested to gain understanding about the work that I did for InterMine as a GSoC 2017 student developer. In this project we developed a prototype of InterMine which uses Neo4j database. The rest of the report describes the various tasks done during the summer.
Quick Links
How to use the code
Test the API via its Swagger UI
- Visit intermine-neo4jwebapp.herokuapp.com to test the API via its Swagger UI.
Host Server Locally
- Visit the project repo intermine/neo4j.
- Follow instructions given in README.md on how to host the server locally.
Metadata in Neo4j
Note - The metadata related code is in org.intermine.neo4j.metadata package.
In the present version of InterMine, the data model is stored in an external XML file. For each entity, its attributes, references and collections are stored in this XML file. For example, a part of the data model file which describes the BioEntity class looks like this.
While developing Neo4j prototype of InterMine, to reduce the dependency on external files, it was decided to store the schema in the Neo4j database itself. Since we can only store Nodes & Directed Relationships in Neo4j, therefore it brought up two issues:
- The data model must to be in the form of a graph itself so that it can be stored in the Neo4j database.
- The data model should represent all the existing Nodes in the database and the Relationships among them.
Since the data model is a graph and it stores information about the InterMine graph, we call it a metagraph. To create the schema/data-model for the database,
- Some sample data is loaded in Neo4j which adheres to the data model.
- As per the Nodes & Relationship of the sample data, metagraph is generated & stored in Neo4j.
- The data model can be retrieved and accessed using the Model class and its methods.
Metagraph Structure
Each node in the metagraph is assigned :Metagraph
label. All the metagraph nodes are further classified into two types, NodeType and RelType. As the name suggests, each NodeType node represents a specific type of nodes and each RelType node represents a specific type of relationship in the IM graph. So each node has either :NodeType
or :RelType
label depending on which entity it represents.
NodeType
A metagraph node labelled :NodeType
contains following two properties.
-
labels - A list containing the labels of the nodes that are represented by the
:NodeType
node. For example, [“Gene”,”SequenceFeature”,”BioEntity”]. -
keys - A list containing the keys of all the properties exist amongst the nodes that are represented by the
:NodeType
node. For example, [“primaryIdentifier”, “secondaryIdentifier”, “symbol”].
Each :NodeType
node is uniquely identified by its labels property. So, for all the nodes in the IM graph which are labelled :Gene
, :SequenceFeature
, :BioEntity
, there exists one :NodeType
node in the metagraph which has its labels property set as [“Gene”,”SequenceFeature”,”BioEntity”].
RelType
A metagraph node labelled :RelType
contains following two properties.
-
type - A string denoting the
type
of the relationships that are represented by the:RelType
node. For example, “HOMOLOGUE_OF”. -
keys - A list containing the keys of all the properties exist amongst the relationships that are represented by the
:RelType
node. For example, [“DataSet”].
Each :RelType
node is uniquely identified by its type property. So, for all the relationships of type HOMOLOGUE_OF in the IM graph, there exists one :RelType
node in the metagraph which has its type property set as “HOMOLOGUE_OF”.
Relationships in MetaGraph
Metagraph should not only contain information about the properties of various entities in the IM graph but it should also store how different types of nodes are connected to each other. To represent this information, we make use of Neo4j relationships.
We know that each :RelType
node represents a type of relationships that exist in the IM graph. Now, we create two outgoing relationships/edges from each :RelType
node - :StartNodeType
and :EndNodeType
. These edges end on a :NodeType
node.
Thus the metagraph path (a:RelType)-[:StartNodeType]->(b:NodeType)
shows that the relationships represented by node a
starts from the nodes represented by the node b
. Same case follows for :EndNodeType
.
PathQuery to Cypher
Note - The Path Query to Cypher conversion code is in org.intermine.neo4j.cypher package.
The InterMine graph is queried by the user via the API using Path Queries. On the other hand, Neo4j database is queried by Cypher Query Language. Therefore, we needed to convert the Path Query given by the user to Cypher Query. This conversion was the most difficult part of the project. Following is an example Path Query.
As you can see, various paths make up most of the Path Query. The information represented by paths is
- either returned as views,
- constrained to filter the data or
- used to order the results etc.
After pondering over the paths, I concluded that a Tree would be the best possible representation of all the paths of the Path Query. So, I created a data structure called a Path Tree and came up with an approach of Path Query to Cypher conversion which makes use of the Path Tree.
Roughly, the following approach was used by me to convert Path Query to Cypher.
The above approch is implemented in QueryGenerator class. After converting the Path Query given above, we will get the following Cypher query.
RESTful API
Note - The REST API code is in intermine-neo4jwebapp sub-project of the intermine/neo4j repository.
InterMine-Neo4j API
The InterMine-Neo4j API is developed using Jersey and is documented using Swagger. The InterMine-Neo4j API can be explored via Swagger UI at intermine-neo4jwebapp.herokuapp.com. There are two major endpoints in the API so far,
- query/cypher - This returns the Cypher equivalent of any Path Query.
- query/results - This endpoint returns the results of a Path Query by converting it into Cypher and running that Cypher against InterMine-Neo4j Graph.
API Response
We have kept InterMine-Neo4j API JSON response exactly similar to the InterMine API response. This would ensure that the applications that consume existing InterMine API can work seamlessly with the new API. The API response is being created manually in QueryResult class.
Conclusion
Contributing to InterMine this summer has been an amazing experience. It was my first time working on a project with a global team of experienced developers. I have got a chance to explore the latest technologies like Neo4j, Jersey, Gradle, Swagger etc.
I am grateful to my mentors Sam, Vivek and Daniela, for their guidance, direction and support throughout the summer. It is only because of their support and encouragement that I could complete this project. The entire InterMine community is welcoming and friendly. Overall it was a great learning experience!