Deploy a Spark cluster + Jupyter notebook (sys-admin nomination required)

Table of contents

Deploy a Spark cluster + Jupyter notebook (sys-admin nomination required)

Prerequisites

The user has to be registered in the IAM system for INFN-Cloud https://iam.cloud.infn.it/login. Only registered users can login into the INFN-Cloud dashboard https://my.cloud.infn.it/login.

User responsibilities

The solution described in this guide consists on the deployment of a Spark cluster on top of a Virtual Machine instantiated on INFN-CLOUD infrastructure. The instantiation of a VM comes with the responsibility of maintaining it and all the services it hosts.

Please read the INFN Cloud AUP in order to understand the responsibilities you have in managing this service.

Spark cluster configuration

Note

If you belong to multiple projects, aka multiple IAM-groups, after login into the dashboard, from the upper right corner, select the one to be used for the deployment you intend to perform. Not all solutions are available for all projects. The resources used for the deployment will be accounted to the respective project, and impact on their available quota. See figure below.

After the selection of the project choose the "Spark + Jupyter cluster" button from the list of available solutions.

The configuration menu is shown. Parameters are split in two pages: "Basic" and "Advanced" configuration.

Basic configuration

Default parameters are ready for the submission of a cluster composed by 1 master and 1 slave both with 4CPU and 8GB RAM. By default the provider where the cluster will be instantiated is automatically selected by the INFN-Cloud Orchestrator Service.

The user must specify (see Figure 1)

a human readable name for your deployment (max 50 characters)
certificate_type:

: - letsencrypt-prod meant for production purposes and must be used only for that reason, as there is a limited number of them. Once this limit is reached, the deployment will fail and it will take several days until these certificates become available again. - letsencrypt-staging suggested for testing purposes, as they are not limited in number - selfsigned a self-signed type certificate

a password that will be required to access the Kubernetes dashboard and the Grafana monitoring as admin user
the number of slaves and the RAM and CPU value
number of CPUs for K8s node of slaves
memory size for K8s node VM
disk size for K8s node VM
optionally, a S3 storage endpoint (http://endpoint:9000) and a list of its buckets to be mounted as persistent storage on the Jupyter notebook
disk size for K8s node VM
Number of vCPUs and memory size of the k8s master VM

/users_guides/img/spark/howto4_fig2.png — Figure 1: basic input data configuration.

Advanced configuration

The user can select (see Figure 2)

the timeout for the deployment
"no cluster deletion" in case of failure
send a confirmation email when complete

/users_guides/img/spark/howto4_fig3.png — Figure 2: advanced configuration tab.

Deployment result

To check the status of the deployment and its details select the "deployments" button. Here all the user's deployments are reported with "deployment uuid", "status", "creation time" and the "provider" (see Figure 3).

/users_guides/img/spark/howto4_fig4.png — Figure 3: list of user deployments.

For each deployment the button "Details" allows:

to get the details of deployment: overview info, input values and output values as the Kubernetes dashboard and Jupyter notebook endpoints (see Figure 4a)
to edit the description of the deployment
to retrieve the deployment log file that contains error messages in case of failure
to show the TOSCA template of the cluster
to request new ports to be opened
to retrieve VM details (see Figure 4b for an example)
to delete the cluster
to lock the deployment (it makes disappear the Delete action)

If the creation of a deployment fails, an additional option (retry) is introduced in the dropdown menu, allowing the user to resubmit the deployment with the same parameters:

/users_guides/img/create_failed.png — Figure 4: Deployment creation failed

If the deletion of a deployment fails, resulting in the status being set to DELETE_FAILED, the "delete (force)" button is displayed in the list of available actions, allowing the user to force the deletion of the deployment:

/users_guides/img/delete_failed.png — Figure 5: Deployment deletion failed

/users_guides/img/spark/howto4_fig5.png — Figure 4a: deployment output values.

/users_guides/img/spark/howto4_fig5b.png — Figure 4b: VM details screen.

Use Spark from Jupyter

Clicking on the jupyter_endpoint link you'll be asked to authenticate with IAM and choose the size of your personal Jupyter server (see Figure 6).

/users_guides/img/spark/howto4_fig6.png — Figure 6: Jupyter server options.

This will start a Jupyter notebook with your S3 bucket(s) mounted on the file-system, as shown in Figure 7.

/users_guides/img/spark/howto4_fig7.png — Figure 7: Jupyter instance dashboard.

You can then upload your preferred notebook (or take one previously uploaded in your S3 bucket) and open it in Jupyter. Click on the star button (shown in Figure 8) to connect with the underlying cluster by creating the Spark Context and Session.

/users_guides/img/spark/howto4_fig8.png — Figure 8: Jupyter notebook example.

In the Spark clusters connection box you can specify the Spark configuration, as shown in Figure 9.

/users_guides/img/spark/howto4_fig9.png — Figure 9: Spark cluster connection configuration box.

After clicking the Connect button and waiting a few seconds, you'll see the connection details as shown in Figure 10.

/users_guides/img/spark/howto4_fig10.png — Figure 10: Spark connection details. Go back to the notebook and use sc and spark variables to execute Spark operations.

Troubleshooting

In both the cases of auto and manual scheduling, the success of creation depends on the provider resources availability. Otherwise a "no quota" is reported as failure reason.

Known issues: the Jupyter notebook takes time to start, sometimes it could fail due to a timeout. In this case, go back to control panel and restart the notebook.

Contact for support: cloud-support@infn.it

Resource Availability Less Than Requested For a Spark Server

User may request resources for a Spark server that are not available in Kubernetes Cluster. In this case, a message will be prompted warning that there are insufficient CPU and/or memory. During this period, it is not possible to cancel the deployment using JupyterHub UI.

/users_guides/img/spark/howto4_fig11.png — Figure 11: resource warning while deploying Spark Server.

Jupyter returns Spawn failed error after 600 seconds. After that, the user can redeploy the Server.