modular system development

Before delving into graph database, I am going to write down some thoughts on modularity. When starting to develop a prototype system, especially if doing that alone, it is not a good idea to waste time on irrelevant things but focus on the most important new ideas. Besides, systems are inherently wide in scope and include multiple separate parts, so constructing them is time-consuming.

Developing software systems becomes remarkably laborious, if one wishes to do everything on their own. Mythical heroic epics of software development reach their climax when all existing solutions are found to be lacking in some respect, after extensive trials and learning, and at the end everything is developed from scratch. These attitudes are slowly fading into obscurity. The current trend is developing only the essential parts of the system in-house, and the rest is constructed from ready-made building blocks. The reason for this change is, that modularity of software has improved and libraries have become better, but also that hardware has become sufficiently capable so it’s not mandatory anymore to optimize every last extra processing cycle from the code. Also the mindset has changed: developers are more tolerant of not every detail of the system being exactly just so.

My own web pages are a good example of modular development. I did not want to use very much time for building the platform itself, because that is not at the core of my business. The purpose of the web page is to server as a marketing tool, a channel for distributing information and a blogging platform, so the relavant task is producing the content. I need servers for various purposes, so using just a web hosting service was not sufficient, but I absolutely didn’t want to waste too much time with hardware either.

As I described in the previous part of this series, I got cloud servers from Nebula for my company. They are virtual servers residing in a datacenter, and they can be brought online and shut down on demand. OpenStack provides an easy-to-use interface for managing the servers in a browser. SSH encryption keys can be used to make the maintenance more secure, and the virtual servers are pre-configured and isolated units that perform only the task appointed to them. The environment can be controlled very tightly, which makes it more manageable also from the security point of view, and resources are used more efficiently.

Large amounts of data and users cannot be handled with single machines. It is easy to construct also clusters of servers from virtual machines, and there are ready-made solutions for balancing the load between the servers in the cluster. For instance, Apache Spark can be used for making queries from a large mass of data in such a way, that multiple machines process the search at the same time, and finally the results are combined. Also the Neo4j graph database supports distributing the database among multiple servers. There are also solutions for combining Neo4j and Spark.

Cloud servers make hardware modular, expandable and manageable. Docker does the same for software. The Docker containers, that can be seen as ‘cargo containers’ for software, are very light-weight virtual machines intended for running one restricted service. One server, such as a virtual server inside a data center, can run multiple docker containers at the same time. Each one can contain a part of the system, and these parts can communicate with each other. The tasks and interfaces of each container can be specified in detail.

As an example, on my own web server I have MySQL database in one Docker and WordPress content management system in another. With the Docker Compose tool I can configure this system in one file and run the system with one command. The data in the database and the files used by the content management system can reside on a separate storage volume, that can be detached from the cloud server and attached to another. In a few minutes I can set up an identical environment on another virtual server using OpenStack and Docker, attach the storage volume or a backup copy of it, and associate the IP address with the new server. Using the API interfaces I could in principle automatize this procedure into a command script, that I could run on my own laptop. Similarly, I could create scripts that allow increasing the resources of the server as the load increases and decrease them when the demand decreases.

The modular thinking extends even to individual pieces of software. WordPress was an easy choice for a content management system, because it is very modular and there are lots of plugins available for extending its functionalities. For this reason, it is possible to keep the basic system simple and offer it for free. Plugins can be bought to enable advanced features. For a business user it was a very simple and fast solution to install the basic WordPress system in a Docker container running on a virtual server, purchase an easily customizable theme, and enable automatic spam protection, search engine optimization and backups with suitable plugins. Graphical elements and layout instructions I purchased from a professional graphic designer. I took only a very little time to set up and customize the website, and I could focus on the actual content production, as I wished. When the site goes down, I get an automatic alert, and I can take an ssh connection to investigate and fix the situation immediately.

When developing my own system, I want to follow the kind of modular thinking as described here. On one hand, I want to make my own system modular, so it can be easily adopted and integrated with other systems. On the other hand, I want to use ready-made building blocks in its development, so I can dedicate as much time as possible into the development of the actual core pieces of the system.

Next, I will finally get into graph databases and some of the core topics of my project.

 

project startup

I am now starting a new series of articles, where I showcase my own internal development project. The purpose is to demonstrate my way of working and things I am able to do, and also to present new kinds of ideas for developing digital systems.

The goal of the project is to construct a learning data analysis system, that can be used on many different devices and also in web browsers. In short term, it will be used as a small-scale reference project, that could be used for presenting and trying out various ideas, but in longer term I hope to develop a marketable customer product from it. This would of course require growing the company or finding suitable partner companies. However, technology has reached the point where it is possible to create a prototype by myself.

Usually, when I’m starting a new project, I begin with an exploration phase, in which I evaluate various techniques and tools to find the suitable ones for getting the expected results. Usually I also perform some tests and build some quick prototypes to get an intuitive feeling of how the ideas work in practice. I try to avoid getting too attached with tools and techniques that I’m familiar with, but instead try to discover the most suitable options, since I’m not afraid of having to learn new things. Some constraints have to be placed, however, so I can get started quickly. It is not a good idea to do everything with completely new tools.

In this project, the most important constraints arise from my wish to use functional programming and graph databases in the core modules that process data. I will write more about these things later, but this decision was based on my prior experience of functional programming making the development of challenging algorithmic code faster and more manageable as compared with using more traditional programming languages. The data processing and analysis methods that I plan to use are graph-based, and the data to be processed will be very heterogeneous, so graph databases will make data processing easier and more intuitive.

The project must obviously contain server components, since the same data must be accessible from many different devices and web browsers. I do not wish to waste time configuring any hardware, so rented cloud servers are an easy solution. I want to use servers located in Finland, and the service must be scalable so it will be possible to grow the business in future. Because I work alone, I want to take advantage of readily available modules that have been tested and proven to work. For this reason, the cloud platform must support Docker that enables using such packaged modular software components.

Based on these constraints, the initial project environment looks like this: the cloud server provider will be Nebula, which is a Finnish company. They can provide me a scalable service based on OpenStack servers with out-of-the-box Docker support. As the graph database engine I have chosen neo4j, because it can scale well for future applications, it has existing interfaces to many other platforms and tools, and it has a Docker container available. I was also convinced by the architecture of their graph database, but this I will discuss in more detail in a future article.

Server-side programming I will do initially with Haskell and Yesod, because I’m familiar with them and I can get started quickly using them. For production-level applications I have to look for more mature tools, though, and I have been eyeing the Scala programming language and Apache Spark as the most promising options, but I will look into other tools as well. On browser side I want to keep things simple and light, and the among the tools I’m familiar with Bootstrap and jQuery look like the best options at the moment. On desktop and mobile applications I dont want to spend too much time, so Qt and Android are a natural choice since they are very common and I’m familiar with them.

The next step will be setting up the neo4j database on the server and developing a light prototype application for browsing the database and making small changes. I will report on my progress next week, and also discuss graph databases in general.