How To Run Tools On Kubernetes For Data Access

Table of Contents

Presto and Hive Metastore in Kubernetes for data access

To give analysts access to data, you can use Presto. It is a tool that allows you to build SQL queries to work with Big Data. It helps solve ad hoc analytics tasks.

Advantages of running Presto on Kubernetes

Running Presto on Kubernetes allows you to take advantage of all the autoscaling flexibility that is difficult to implement in a classic deployment. At rest, Presto consumes a minimum of resources and does not require a powerful cluster to run. But when analysts start sending a lot of requests, the load grows. An autoscaling Kubernetes cluster will allocate the required amount of capacity, and this will allow all analysts to work simultaneously without having to compete for resources. When the load subsides, unnecessary resources will automatically return to the cloud.

How To Run Presto On Kubernetes?

Currently, there is Presto Operator and Presto Helm Chart from Starburst. This way, you can quickly deploy Presto to Kubernetes.

Presto And Hive Metastore: Presto can receive data from different sources: Hadoop, PostgreSQL, and so on – and build queries on the combined amount of data. If you store data in S3, then the Hive Metastore is used for Presto to work with it. It allows you to represent data in S3 not just as a set of files but as a set of tables with data. Analysts do not need to know which S3 bucket the data is stored in: everything is in the Hive Metastore. You can use SQL to access the data and work in the usual way.

Superset In Kubernetes For Building Dashboards

To use the collected data for BI analytics, it needs to be wrapped in charts, dashboards, and other understandable ways of presenting the information. For this, Superset is suitable – a business intelligence tool for researching and visualizing data, an Open Source analog of Tableau. At the same time, Superset is flexible and Cloud-Native in using various services as a backend.

Out of the box, it supports integration with Presto, Greenplum, Hadoop, and many other systems. Plus, it already has many ready-made visualizations, but there are tools for creating your own. If you integrate it with Presto, you can work with data in S3 using Superset as the SQL IDE. There is also an alternative tool to Superset, Metabase, which can also be run in Kubernetes.

The benefit of running Superset on Kubernetes: Superset is designed for high availability. It is a Cloud-Native tool that scales well in large distributed environments and can serve several hundred users simultaneously.

Airflow In Kubernetes For Workflow Management

To populate the data warehouse, you need an orchestrator or workflow management platform. It allows you to create a schedule for tasks and indicate the sequence of their launch, depending on the result of the previous task. Now the de facto standard in this area is Airflow, a platform for developing, planning, and monitoring data processing flows.

The benefits of running Airflow on Kubernetes are the same as other tools: flexible scaling and sandboxing.

How to run Airflow on Kubernetes

KubernetesPodOperator – in this case, only some Airflow tasks are brought out to Kubernetes. A separate pod will be created for each of them inside Kubernetes. CeleryExecutor is still used as Executor; this is the traditional way of using Airflow.
KubernetesExecutor – in this case, a separate Worker will be created for each Airflow task directly inside Kubernetes, which will already create new pods if necessary. If you use KubernetesPodOperator and KubernetesExecutor at the same time, then the first sub-Worker is created first. Then it will create the next pod and run an Airflow task in it.
This method is good because it allows you to create a Worker only on-demand, thereby saving resources, but at the same time the load grows, it allows you to scale, that is, to allocate more resources. It should be borne in mind that pods are not created instantly. Therefore, if you have hundreds or thousands of tasks that run in just a few minutes, it is better to use CeleryExecutor, and move some of the workloads to Kubernetes using KubernetesPodOperator.

By default, Airflow running on Kubernetes will store logs in temporary storage. To keep the logs always available, you need to connect persistent storage, for example, S3. This applies to all tools that run on Kubernetes.

Amundsen In Kubernetes: Data Discovery

Next, let’s talk about the Data Discovery problem. Let’s say your storage has grown, and there are already thousands of tables in it. When a new analyst comes to a project, he needs to get somehow acquainted with all this data, understand where and what lies. Often this is solved by personal communication: he asks for help from colleagues. It takes a long time, plus specialists are distracted from their main work.

There is an open-source Amundsen platform to solve the problem. It has a UI that allows users to easily access data. You can fill Amundsen with metadata manually or automatically if you integrate the tool with Airflow. At the same time, you can collect statistics on tables; there is a search, the ability to set tags, specify the owner of the data, the type of table, and so on. This helps to significantly increase the productivity and efficiency of data warehouse use and solves democratizing access.

JupyterHub In Kubernetes: Train Models And Experiment

To train models and conduct experiments in Big Data, JupyterHub is often used; this is also an industry standard.

Benefits Of Running JupytherHub On Kubernetes

Load scaling. Hosting JupyterHub on Kubernetes allows resources to be automatically returned to the cloud when they are idle. For example, an analyst needed 50 cores to work with Jupyter Notebook. After the end of the work, the resources are no longer required, and you can configure the interval after which they return to the cloud. This will automatically stop this Jupyter Notebook, but all results will be saved. When the data scientist returns to work, he will simply restart it and continue working.
Isolation environments. Again, there is one version of JupyterHub and libraries installed on the server in a traditional deployment. If an experiment requires different versions, the entire cluster has to be updated. Containers allow each specialist to create their environment based on an individual Docker image with the programs and libraries they need. Updating or launching new libraries in one of the environments does not affect the work of other data scientists in any way.

Kubeflow: MLOps In Kubernetes

It is important to deploy machine learning models quickly in production; otherwise, the data will become outdated, and there will be problems with the reproducibility of experiments. But sometimes, the process is structured in such a way that it takes a long time to transfer models from Data Scientist to Data Engineer.

MLOps helps to cope with this problem. It is an approach that standardizes developing machine learning models and reduces the time it takes to roll them out to production. With its help, new models are quickly transferred to production and begin to benefit the business. But to apply this approach, you need special tools.

One such tool is Kubeflow, a machine learning and Data Science platform. Kubeflow includes JupyterHub, so you don’t have to deploy it separately. It also helps solve the problems of tracking experiments, models and artefacts. Plus, Kubeflow allows you to bring models into production in a few minutes and make them available as a service.

Note: we’ll talk more about MLOps and Kubernetes in a separate article. In it, we create a Kubernetes cluster, deploy Kubeflow in it, train and publish the model.

We also hosted a webinar on MLflow. There is a video and a repository with instructions.

Advantages of running Kubeflow on Kubernetes: Kubeflow was specially created for Kubernetes, so it basically cannot be launched separately. Here, instead, it is worth mentioning the advantages of Kubeflow over other non-Kubernetes counterparts: fast publishing of models, orchestration of complex pipelines, convenient UI for managing experiments and monitoring models.

How to run Kubeflow in Kubernetes: there is a detailed instruction on the official website. Alternatively, you can make your life easier by deploying Kubeflow to cloud Kubernetes using this tutorial.

But it is worth considering that Kubeflow is still developing, so it is a little damp. There is an alternative – MLflow, a more stable platform, but it works with Kubernetes only in experimental mode. If we compare Kubeflow and MLflow with each other, the first one scales better, more functional and promising. MLflow is easier to use and more mature as a product; therefore, it is suitable for industrial use. However, it does not have the same breadth of functionality as Kubeflow (for example, MLflow does not have a built-in JupyterHub).

Also Read: Dark And Deep Web: 5 Real Risks For Your Company