Deploy a Spark cluster + Jupyter notebook (sys-admin nomination required)

Prerequisites

The user has to be registered in the IAM system for INFN-Cloud https://iam.cloud.infn.it/login. Only registered users can login into the INFN-Cloud dashboard https://my.cloud.infn.it/login.

User responsibilities

The solution described in this guide consists on the deployment of a Spark cluster on top of a Virtual Machine instantiated on INFN-CLOUD infrastructure. The instantiation of a VM comes with the responsibility of maintaining it and all the services it hosts.

Please read the INFN Cloud AUP in order to understand the responsibilities you have in managing this service.

Spark cluster configuration

Note

If you belong to multiple projects, aka multiple IAM-groups, after login into the dashboard, from the upper right corner, select the one to be used for the deployment you intend to perform. Not all solutions are available for all projects. The resources used for the deployment will be accounted to the respective project, and impact on their available quota. See figure below.

/users_guides/img/project_selection.png

After the selection of the project choose the "Spark + Jupyter cluster" button from the list of available solutions.

/users_guides/img/spark/howto4_fig1.png

The configuration menu is shown. Parameters are split in two pages: "Basic" and "Advanced" configuration.

Basic configuration

Default parameters are ready for the submission of a cluster composed by 1 master and 1 slave both with 4CPU and 8GB RAM. By default the provider where the cluster will be instantiated is automatically selected by the INFN-Cloud Orchestrator Service.

The user must specify (see Figure 1)

  • a human readable name for your deployment (max 50 characters)
  • certificate_type:

: - letsencrypt-prod meant for production purposes and must be used only for that reason, as there is a limited number of them. Once this limit is reached, the deployment will fail and it will take several days until these certificates become available again. - letsencrypt-staging suggested for testing purposes, as they are not limited in number - selfsigned a self-signed type certificate

  • a password that will be required to access the Kubernetes dashboard and the Grafana monitoring as admin user
  • the number of slaves and the RAM and CPU value
  • number of CPUs for K8s node of slaves
  • memory size for K8s node VM
  • disk size for K8s node VM
  • optionally, a S3 storage endpoint (http://endpoint:9000) and a list of its buckets to be mounted as persistent storage on the Jupyter notebook
  • disk size for K8s node VM
  • Number of vCPUs and memory size of the k8s master VM
/users_guides/img/spark/howto4_fig2.png
Figure 1: basic input data configuration.

Advanced configuration

The user can select (see Figure 2)

  • the timeout for the deployment
  • "no cluster deletion" in case of failure
  • send a confirmation email when complete
/users_guides/img/spark/howto4_fig3.png
Figure 2: advanced configuration tab.

Deployment result

To check the status of the deployment and its details select the "deployments" button. Here all the user's deployments are reported with "deployment uuid", "status", "creation time" and the "provider" (see Figure 3).

/users_guides/img/spark/howto4_fig4.png
Figure 3: list of user deployments.

For each deployment the button "Details" allows:

  • to get the details of deployment: overview info, input values and output values as the Kubernetes dashboard and Jupyter notebook endpoints (see Figure 4a)
  • to edit the description of the deployment
  • to retrieve the deployment log file that contains error messages in case of failure
  • to show the TOSCA template of the cluster
  • to request new ports to be opened
  • to retrieve VM details (see Figure 4b for an example)
  • to delete the cluster
  • to lock the deployment (it makes disappear the Delete action)

If the creation of a deployment fails, an additional option (retry) is introduced in the dropdown menu, allowing the user to resubmit the deployment with the same parameters:

/users_guides/img/create_failed.png
Figure 4: Deployment creation failed

If the deletion of a deployment fails, resulting in the status being set to DELETE_FAILED, the "delete (force)" button is displayed in the list of available actions, allowing the user to force the deletion of the deployment:

/users_guides/img/delete_failed.png
Figure 5: Deployment deletion failed
/users_guides/img/spark/howto4_fig5.png
Figure 4a: deployment output values.
/users_guides/img/spark/howto4_fig5b.png
Figure 4b: VM details screen.

Use Spark from Jupyter

Clicking on the jupyter_endpoint link you'll be asked to authenticate with IAM and choose the size of your personal Jupyter server (see Figure 6).

/users_guides/img/spark/howto4_fig6.png
Figure 6: Jupyter server options.

This will start a Jupyter notebook with your S3 bucket(s) mounted on the file-system, as shown in Figure 7.

/users_guides/img/spark/howto4_fig7.png
Figure 7: Jupyter instance dashboard.

You can then upload your preferred notebook (or take one previously uploaded in your S3 bucket) and open it in Jupyter. Click on the star button (shown in Figure 8) to connect with the underlying cluster by creating the Spark Context and Session.

/users_guides/img/spark/howto4_fig8.png
Figure 8: Jupyter notebook example.

In the Spark clusters connection box you can specify the Spark configuration, as shown in Figure 9.

/users_guides/img/spark/howto4_fig9.png
Figure 9: Spark cluster connection configuration box.

After clicking the Connect button and waiting a few seconds, you'll see the connection details as shown in Figure 10.

/users_guides/img/spark/howto4_fig10.png
Figure 10: Spark connection details. Go back to the notebook and use sc and spark variables to execute Spark operations.

Troubleshooting

In both the cases of auto and manual scheduling, the success of creation depends on the provider resources availability. Otherwise a "no quota" is reported as failure reason.

Known issues: the Jupyter notebook takes time to start, sometimes it could fail due to a timeout. In this case, go back to control panel and restart the notebook.

Contact for support: cloud-support@infn.it

Resource Availability Less Than Requested For a Spark Server

User may request resources for a Spark server that are not available in Kubernetes Cluster. In this case, a message will be prompted warning that there are insufficient CPU and/or memory. During this period, it is not possible to cancel the deployment using JupyterHub UI.

/users_guides/img/spark/howto4_fig11.png
Figure 11: resource warning while deploying Spark Server.

Jupyter returns Spawn failed error after 600 seconds. After that, the user can redeploy the Server.

/users_guides/img/spark/howto4_fig12.png
Figure 12: Spark Spawn Failure after 600 seconds.