Suppose you have ever worked with cloud services. In that case, you probably know about the widespread opinion that moving an application to the cloud is a panacea for all possible problems with it.I regularly encounter this attitude in meetings with a wide variety of customers. Unfortunately, this opinion is wrong. Part of the responsibility lies with the client.
It is necessary to adhere to certain principles in the construction of Cloud Native applications to ensure their fault tolerance, both at the program code level and the level of the underlying infrastructure. However, it should be understood that even adherence to these principles does not guarantee complete protection against all risks.
Therefore, if you are faced with the task of building a 100% fault-tolerant and highly available (High Availability, HA) system, I recommend sticking to the Multicloud Native Service approach, which combines the best in Multi-Cloud and Cloud-Native approaches. This approach is not limited to the use of containers: for an application to survive any failure, it is necessary to think about the infrastructure, in particular, use not one, but at least two independent sites, for example, a public cloud provider and a private infrastructure.
I will detail the advantages and options for implementing this approach, the difficulties you can face when using it, the methods for solving them, and what characteristics an ideal cloud provider should have.
Table of Contents
Principles Of Building Cloud-Native Applications
Before considering options for building Cloud Native architectures, you need to have your applications ready to be migrated to the cloud. According to the Cloud Native Computing Foundation (CNCF), which has become a kind of ambassador for the development of cloud technologies, five basic principles distinguish Cloud Native applications from among others.
This is the ability to quickly deploy and configure your system on any new site, which becomes especially relevant when the cloud provider unexpectedly changes. Means of achieving agility include CI / CD and the Infrastructure As Code (IaC) approach. Let’s dwell on the second one.
Your entire infrastructure should be described declaratively in the form of code. The description may include a list of virtual machines required by the service, requirements for their configuration, network topology, DNS names, and so on. Terraform has recently become the de facto standard for solving this problem, but there are also alternatives: Ansible, Salt, Cloudify, Foreman, Pulumi, AWS Cloud Formation.
The purpose of these tools is the ability to quickly deploy, configure the infrastructure, and control the preservation of its state, including when changing providers. If their manual configuration is quite possible with a small number of virtual machines, then when this number is measured in hundreds and is complemented by complex network topology, transferring the infrastructure can take at best several weeks or even months.
The goal of the IaC approach is to shorten this time. With declarative descriptions, all required during the transition is to use a different Terraform provider, change the Environment variables, and format the code a little.
It is considered a good practice to combine tools for declarative description with Post Install scripts. After starting the infrastructure, configure the information system as needed, for example, install the latest updates to the software used, and so on. Ansible Playbooks, Puppet, Chef, and other tools are intended for this purpose, or you can use a cloud unit.
2. Possibility Of Operation
This is the ability to manage the life cycle of your systems in an automated manner. This property is closely related to agility and primarily involves the use of CI / CD pipelines for automatic application deployment. You should be able to quickly release new builds, track the status of their execution in different environments, and roll back in case of failures. If you are building an application based on microservices, it is recommended to set up an independent lifecycle (separate assembly) for each service.
Be sure to consider the possibility of rolling back updates or updating only part of your information systems (Rolling Update); this will help avoid errors at the rollout level, leading to the unavailability of the full service.The modern IT market offers many solutions for building CI / CD: GitLab CI / CD, Jenkins, GoCD, etc.
3. Observation Capability
This is monitoring the infrastructure and various business metrics of your systems. The need for monitoring for any application, be it a Startup or Enterprise solution, can hardly be overestimated. However, I regularly come across projects where monitoring is absent, which in most cases leads to serious problems.
When moving to the cloud, monitoring will at least help to check the provider’s stated SLA and get a performance retrospective when analyzing disputable situations, as a maximum – to find bottlenecks in the architecture and fix them before they affect users.
In most cases, cloud providers support all standard monitoring tools such as Prometheus, Grafana, Fluentd, OpenTracing, Jaeger, Zabbix, and others, and may also offer their own built-in monitoring.
It is the ability to scale your applications based on changing workloads. When it comes to auto-scaling, Kubernetes is, of course, the first tool that comes to mind, becoming a kind of standard for orchestrating containerized applications.
Indeed, it supports all the necessary levels of scaling, and some providers even have autoscaling at the cluster level out of the box. However, I note that its use is by no means mandatory: you can build an elastic application based on pure IaaS (Infrastructure as a Service), using monitoring and pre-configured hooks to handle load changes. Yes, Kubernetes greatly simplifies this process, but remember that it is just a tool.
For being consistent with the Cloud Native approach, it is imperative to plan your architecture so that there are no nodes that scale only vertically (by increasing/decreasing the allocated resources) since you will ultimately be limited by the capacity of the underlying hardware or hypervisors in the cloud. Try to ensure that each node can scale horizontally by changing the number of application instances.
Also, do not forget the downside of scaling – possible problems with DDoS attacks. You should always carefully select the boundaries for autoscaling (minimum and a maximum number of nodes) and use AntiDDos. Otherwise, DDoS attacks, which are not distinguished by the autoscaling system from the payload, can lead to severe financial losses.
5. Fault Tolerance
This can quickly and automatically recover applications in case of failures – with minimal impact on users. Despite certain guarantees from cloud providers (SLA), it is imperative to provide resiliency at the level of the applications and infrastructure themselves.
Here are some basic recommendations:
Have a DRP (Disaster Recovery Plan) exercise plan and routinely perform stress tests against it to identify the Single Point Of Failure (SPOF). It is essential to take into account all possible failure domains, including checking the behavior of the system in combat mode – when individual nodes or an entire arm of the system fail, of course, providing the ability to turn them back on quickly.
Use geo-replication to ensure the high availability of your data. Geo-replication guarantees the storage of copies of data in multiple data centers. But keep in mind that geo-distributed storage has higher latency and is not suitable for solutions that require low latency, such as high-load databases.
Use load balancing at all levels of your application: in front of web servers, application servers, and database servers. All network calls should be directed not to direct addresses of virtual machines but to address load balancers to avoid creating additional points of failure. Most cloud providers offer Load Balancer as a Service (LBaaS). Consider the possibility of failure of some nodes so that the redistributed load does not kill the remaining instances of the service. Also, use reliable GSLBs to balance between different providers.
Use a backup. Although I have come across examples of services and information systems where the correct Cloud Native architecture was built and completely abandoned the maintenance of backups, they have become irrelevant in my practice. So maintaining backups is desirable, but not at all necessary – but only if you use the Multi-Cloud Native approach correctly. When in doubt, always make backups!
Conduct security audits, such as Penetration Test, for vulnerabilities and resilience to different types of attacks.