Before delving into graph database, I am going to write down some thoughts on modularity. When starting to develop a prototype system, especially if doing that alone, it is not a good idea to waste time on irrelevant things but focus on the most important new ideas. Besides, systems are inherently wide in scope and include multiple separate parts, so constructing them is time-consuming.
Developing software systems becomes remarkably laborious, if one wishes to do everything on their own. Mythical heroic epics of software development reach their climax when all existing solutions are found to be lacking in some respect, after extensive trials and learning, and at the end everything is developed from scratch. These attitudes are slowly fading into obscurity. The current trend is developing only the essential parts of the system in-house, and the rest is constructed from ready-made building blocks. The reason for this change is, that modularity of software has improved and libraries have become better, but also that hardware has become sufficiently capable so it’s not mandatory anymore to optimize every last extra processing cycle from the code. Also the mindset has changed: developers are more tolerant of not every detail of the system being exactly just so.
My own web pages are a good example of modular development. I did not want to use very much time for building the platform itself, because that is not at the core of my business. The purpose of the web page is to server as a marketing tool, a channel for distributing information and a blogging platform, so the relavant task is producing the content. I need servers for various purposes, so using just a web hosting service was not sufficient, but I absolutely didn’t want to waste too much time with hardware either.
As I described in the previous part of this series, I got cloud servers from Nebula for my company. They are virtual servers residing in a datacenter, and they can be brought online and shut down on demand. OpenStack provides an easy-to-use interface for managing the servers in a browser. SSH encryption keys can be used to make the maintenance more secure, and the virtual servers are pre-configured and isolated units that perform only the task appointed to them. The environment can be controlled very tightly, which makes it more manageable also from the security point of view, and resources are used more efficiently.
Large amounts of data and users cannot be handled with single machines. It is easy to construct also clusters of servers from virtual machines, and there are ready-made solutions for balancing the load between the servers in the cluster. For instance, Apache Spark can be used for making queries from a large mass of data in such a way, that multiple machines process the search at the same time, and finally the results are combined. Also the Neo4j graph database supports distributing the database among multiple servers. There are also solutions for combining Neo4j and Spark.
Cloud servers make hardware modular, expandable and manageable. Docker does the same for software. The Docker containers, that can be seen as ‘cargo containers’ for software, are very light-weight virtual machines intended for running one restricted service. One server, such as a virtual server inside a data center, can run multiple docker containers at the same time. Each one can contain a part of the system, and these parts can communicate with each other. The tasks and interfaces of each container can be specified in detail.
As an example, on my own web server I have MySQL database in one Docker and WordPress content management system in another. With the Docker Compose tool I can configure this system in one file and run the system with one command. The data in the database and the files used by the content management system can reside on a separate storage volume, that can be detached from the cloud server and attached to another. In a few minutes I can set up an identical environment on another virtual server using OpenStack and Docker, attach the storage volume or a backup copy of it, and associate the IP address with the new server. Using the API interfaces I could in principle automatize this procedure into a command script, that I could run on my own laptop. Similarly, I could create scripts that allow increasing the resources of the server as the load increases and decrease them when the demand decreases.
The modular thinking extends even to individual pieces of software. WordPress was an easy choice for a content management system, because it is very modular and there are lots of plugins available for extending its functionalities. For this reason, it is possible to keep the basic system simple and offer it for free. Plugins can be bought to enable advanced features. For a business user it was a very simple and fast solution to install the basic WordPress system in a Docker container running on a virtual server, purchase an easily customizable theme, and enable automatic spam protection, search engine optimization and backups with suitable plugins. Graphical elements and layout instructions I purchased from a professional graphic designer. I took only a very little time to set up and customize the website, and I could focus on the actual content production, as I wished. When the site goes down, I get an automatic alert, and I can take an ssh connection to investigate and fix the situation immediately.
When developing my own system, I want to follow the kind of modular thinking as described here. On one hand, I want to make my own system modular, so it can be easily adopted and integrated with other systems. On the other hand, I want to use ready-made building blocks in its development, so I can dedicate as much time as possible into the development of the actual core pieces of the system.
Next, I will finally get into graph databases and some of the core topics of my project.