GSoC 2020 Proposal: GORA-650 Add a data store for ArangoDB
Project: Add a data store for ArangoDB
Project: Apache Gora
Mentor:Furkan Kamaci, Kevin Ratnasekera
Background:
The Apache Gora [1] open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores, distributed in-memory key/value stores, in-memory data grids, in-memory caches, distributed multi-model stores, and hybrid in-memory architectures.
Moreover, Gora facilitates data modelling through a new paradigm called object-to-datastore mapping, which aims to bring to the NoSQL realm a solution similar to ORM (Object-Relational mapping). Therefore, the community has developed support for a whole variety of backends such MongoDB, HBase, Solr, etc. As part of the efforts to extend Gora support to new data stores, this project aims to implement a backend for ArangoDB [2].
Multi model databases are becoming more popular due its native support for polyglot persistence. Polyglot Persistence is a term to mean that when storing data, it is best to use multiple physical data models, chosen based upon the way data is being used by individual applications or components of a single application. Multi model databases acknowledge the need for multiple data models, combining them to reduce operational complexity, operational costs, extensibility and maintain data consistency. Apache Gora currently supports OrientDB [3] datastore as a multi model database, the project proposes further extending multi model database support with ArangoDB [2] .
Solution:
Apache Gora [1] provides a reusable set of interface implementations to be used when implementing a datastore for any persistent storage system. Here [4] provides detailed information on writing a datastore for any backend.
QueryBase
ResultBase
DataStoreBase
By providing implementation for above contracts, then one may leverage Apache Gora key value based API access, persist, manipulate data irrespective of physical data model of underline backend system. Apache Gora datastore API provides methods such as put, get, connect/disconnect to backend, create/delete schema and execute queries. This contract implementation has to be carried out using the java driver for ArangoDB. [5] This driver provides a client API to talk into the ArangoDB backend to store, retrieve data, execute queries as per the API specification given by ArangoDB. All the ArangoDB client related properties should be externalized using gora.properties file. Some of the common properties include backend database host, port etc.
One of the key aspects of the datastore implementation is to come up with a design, how AVRO persistent data beans are mapped to the physical data model of ArangoDB database, in other words in Apache Gora terminology ArangoDB mapping file design.Mapping file gora-arangodb-mapping.xml contains details on how a particular field in AVRO data bean is mapped to the physical data model of ArangoDB. That may have field attributes such as field name, field data type etc.
Considering query and result implementations, ArangoDB includes it’s own query language known asArangoDB Query Language ( AQL [6] ) similar to Structured Query Language (SQL) for data manipulation stored over theArangoDB backend. Availability of released stable version query builder for AQL in java language is questionable at this point, some design level discussion carried out over [7] and some work has been done over [8] , one possibility is to write these queries to object mapping from scratch for the most essential ones related to Apache Gora.
Writing integration tests to the ArangoDB module is required for proper testing of the ArangoDB module, Apache Gora provides base datastore test cases classes to extend as per the requirements of custom datastore. Each test requires programmatically start/stop light weight ArangoDB server instances. Since ArangoDB Server is written from C++, it's required to make use of TestContainers [9] spin up embedded ArangoDB server instance ( process running inside a docker container ) programmatically from java tests for Apache Gora.
Deliverables:Maven module for new datastore ( gora-arangodb )
This will have implementation to following gora core interfaces ,
QueryBase
ResultBase
DataStoreBase
consists of methods to connect and disconnect to the ArangoDB backend, methods to get and put persistent data beans over the ArangoDB backend, methods to create and delete schemas over the ArangoDB backend and lastly methods to execute key value based queries over ArangoDB backend.
Unit tests for the ArangoDB module. That should include tests for datastore functionality ( datastore base tests ), test cases for mapping creation, as well tests for workloads such as MR, Spark, Flink etc.
ArangoDB Module documentation for Apache Gora website.
Scheduling:
Expected OutcomeTime Period
Community Bonding Period
Setting up development environment
Research on ArangoDB document data model, coding examples
Fix bugs and improvements to code base as warm up
Create initial maven module structure for datastoreMay 04 - June 01
( 4 weeks )
Coding Period 1
Design for mapping file -gora-arangodb-mapping.xmlof ArangoDB store.
Implementation of DataStoreBase interface methods related to connect/disconnect ArangoDB client to backend, get, put, create schema, delete schema.June 01 - July 03
( 4 Weeks )
Coding Period 2
Implementation of DataStoreBase interface methods related to query execution. That should include implementation to interfaces QueryBase, ResultBaseJuly 03 - July 31
( 4 Weeks )
Coding Period 3
Unit tests for datastore and workloads
ArangoDB module documentation for Apache Gora website.July 31 - August 31
( 4 Weeks )
Community Engagement:
Communicated with a potential mentor and submitted patches for following Jira tickets.
About:
I am Dinuka Perera, currently a computer science masters student at University of Stuttgart, Germany. I have well over 5 years worth of experience in Java and related data technologies.
Commitment:
I will be able to allocate more than 45+ hours of work per week within the entire GSoC period. I don't have any other commitments within the period. My motivation for participating in GSoC with Apache Gora is to become a long term contributor and earn merit to become an Apache committer.
References: