Deploy a HTCondor cluster on Kubernetes (sys-admin nomination required)

Note

This service is only available for pledged projects that made an explicit request for it.

Prerequisites

The user has to be registered in the IAM system for INFN-Cloud https://iam.cloud.infn.it/. Only registered users can login into the INFN-Cloud dashboard https://my.cloud.infn.it.

User responsibilities

Important

The solution described in this guide consists on instantiation of Virtual Machines instantiated on INFN-CLOUD infrastructure. The instantiation of a VM comes with the responsibility of maintaining it and all the services it hosts.

Please read the INFN Cloud AUP in order to understand the responsibilities you have in managing this service.

Selection of the Deployment type

Note

If you belong to multiple projects, i.e. multiple IAM-groups, after login into the INFN-Cloud dashboard, from the lower left corner, select the one to be used for the deployment you intend to perform. Not all solutions are available for all projects. The resources used for the deployment will be accounted to the respective project, and impact on their available quota. See figure below.

../../../_images/project_selection1.png

Once the project is selected, choose the “HTCondor cluster” button from the list of solutions available for your group:

../../../_images/htcondor_menu.png

Figure 1: Use-cases panel in the Dashboard

A menu is made available, as in the figure below:

../../../_images/htcondor_config.png

Figure 2: HTCondor deployment configuration

“Description” is a mandatory field.

Parameters are split in three pages: “General”, “IAM Integration” and “Advanced”

Deploy/Use New Solution

This deployment instantiate a K8s cluster which is then exploited to automatically deploy a working HTCondor cluster (HTCondor versions currently available are 8.9.9 and 9.0). HTCondor services will be deployed using dedicated pods in K8s. The pods will be instantiated on the K8s cluster that is composed by one master and a number of slaves as required at the submission time. The HTCondor cluster deployment is composed by three main components: the central manager (collector, negotiator and condor connection broker daemons), the scheduler and the worker node. Each one runs on a dedicated K8s pod. For details about HTCondor architecture, services and usage please read the official HTCondor manual.

Pods description:
  • the Central Manager pod hosts the connection broker, collector and negotiator services which are in charge to respectively allow network communication, collect resources available into the HTCondor cluster and to match jobs requirements with available job slots. Specifically, the CCB service is running on the pod in order to grant inbound connectivity to the cluster from the Internet because the HTCondor cluster runs inside a private network;
  • the SCHEDD pod hosts the schedd service in charge to accept submission and store batch jobs. Schedd can even be queried in order to retrieve job status;
  • the WN pod hosts the startd services in charge to execute the jobs submitted through the Schedd services.

Basic configuration

../../../_images/htcondor_config.png

Figure 3: HTCondor general configuration parameters

Default parameters are ready for the submission of a cluster composed by 1 K8s master and 3 K8s slave. By default the provider where the cluster will be instantiated is automatically selected by the INFN Cloud orchestrator service. Warning: by default the HTCondor cluster will be instantiated with 1 worker node. If you need more worker nodes, you have to scale the number of WN pods from the K8s dashboard manually.

You need to specify:

  • Deployment description:

    • a human readable name for the deployment (max 50 characters).
    • this is a mandatory field.
  • certificate_type:

    • the X509 type certificate used from the dashboard web pages. It can be:
    • ‘letsencrypt-prod’ meant for production purposes and must be used only for that reason, as there is a limited number of them. Once this limit is reached, the deployment will fail and it will take several days until these certificates become available again.
    • ‘letsencrypt-staging’ suggested for testing purposes, as they are not limited in number
    • ‘selfsigned’ a self-signed type certificate
  • admin_token:

    • admin password for accessing Grafana dashboard
  • storage_size:

    • Size (in GB) of the volume to be mounted in /data. Maximum value: 100. It will be used for the HTCondor schedd spool directory.
  • number_of_slaves:

    • default (and minimum)=3. This number is related to the number of K8s slaves, not HTCondor worker nodes. You can only increase it, otherwise the deployment will fail.
  • htcondor_image:

  • htcondor_image_tag:

    • The HTCondor WN tag name, related to the HTCondor version. Currently the HTCondor version that can be installed are the 8.9.9-el7 or 9.0-el7
  • master_flavor:

    • number of vCPUs and memory size of the K8s master VM (medium: 2 vCPUs, 4 GB RAM or large: 4 vCPUs, 8 GB RAM)
  • node_flavor:

    • number of vCPUs and memory size of each K8s node VM (medium: 2 vCPUs, 4 GB RAM or large: 4 vCPUs, 8 GB RAM)

IAM configuration

Figure 4: IAM configuration tab

Figure 4: IAM configuration tab.

  • cluster_secret
    • default=’testme’; this is the token for HTCondor daemon-daemon authN (authentication method is the shared secret, to be changed with IDTOKENS when we will use remote wns).
  • iam_server
    • default=’iam.cloud.infn.it’; this is the IAM server name for HTCondor authN.

Advanced configuration

In this section you can enable the following options:

  • “Configure scheduling”: * The automatic (default) or manual scheduling, allowing the user to perform the deployment by:

    • taking advantage of the PaaS Orchestrator scheduling capabilities (the recommended way), or
    • performing a direct submission towards one of the providers available, to be selected from the drop-down menu
  • “Set deployment creation timeout (minutes)”: * amount of time to wait until the deployment should be considered failed

  • “Do not delete the deployment in case of failure”: * needed in case of further debugging

  • “Send a confirmation email when complete”: * define whether to send a confirmation email when deployment is complete

Figure 5: Advanced configuration tab

Figure 5: Advanced configuration tab.

Deployment results

To check the status of the deployment and its details select the “Deployments” button. You’ll find there the list of all your deployment as shown below.

Figure 6: list of user deployments

Figure 6: list of user deployments.

For each deployment the button “Actions” allows you:

Figure 7: Available actions for each deployment

Figure 7: Available actions for each deployment

  • to get the details of the deployment: overview info, input values and output values, once the deployment is completed (see fig.8)
  • to show the TOSCA template of the deployment
  • to retrieve the deployment log file that contains error messages in case of failure
  • to show the details and the SSH private keys of the VMs created
  • to lock it
  • to delete the deployment

Clicking on the deployment UUID you can see its details:

  • Overview of the cluster - where you can find information on the location of your deployment
  • The Input Values - values you give to create the VM
  • The Output Values - values you can use to access the K8s dashboard and details about K8s master and slaves. There are also information about HTCondor services as the _condor_SCHEDD_HOST and the _condor_COLLECTOR_HOST, values necessary for the configuration of submission nodes.
Figure 8: deployment output values.

Figure 8: deployment output values.

You can access the grafana_endpoint linked to the url provided in output, using as token the admin_token you wrote in the input.

You can access the K8s dashboard either using the bearer token contained in the kubeconfig or the kubeconfig file itself.

From the K8s dashboard you can see all the pods running. By default the number of HTCondor worker nodes is one. You can increase them from the dashboard going to Workloads –> Deployments –> wn-pod –> scale and put the number of workers you need.

Figure 9: How to scale HTCondor wns.

Figure 9: How to scale HTCondor WNs from K8s dashboard.

How to submit HTCondor jobs to the deployed cluster

Configuration of the submission node

Using a standard User Interface

The User Interface must have the correct version of HTCondor installed (8.9.9 or 9.0.x)

  • create a htcondor source script

    [~]$ cat htcondor/.htc.rc
    
    #!/bin/bash
    export _condor_AUTH_SSL_CLIENT_CAFILE=$HOME/.ca.crt
    export _condor_COLLECTOR_HOST=<hostname of collector here>:<port of the collector>
    export _condor_SCHEDD_HOST=<hostname of schedd here>
    export _condor_SCHEDD_NAME=<hostname of schedd here>
    export _condor_SEC_DEFAULT_ENCRYPTION=REQUIRED
    export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=SCITOKENS
    export _condor_SCITOKENS_FILE=$HOME/.token
    export _condor_TOOL_DEBUG=D_FULLDEBUG,D_SECURITY
    
  • create the IAM token

    • configure the agent for a WLCG-Profile token (choosing the name of oidc profile as you want, e.g. infncloud-wlcg)

      [~]$ eval `oidc-agent`
      [~]$ oidc-gen --flow device --dae https://iam.cloud.infn.it/devicecode infncloud-wlcg
      
      • provide the IAM service value: https://iam.cloud.infn.it/
      • insert the following scopes for the client:
        • openid profile email offline_access wlcg wlcg.groups (for HTCondor 8.9.9)
        • openid compute.create offline_access compute.read compute.cancel compute.modify wlcg wlcg.groups (for HTCondor 9.0)
      • The result will be a link with a string code. open the link via browser, insert the required code and authorize the client.
      • After your approval, the oidc_gen command will automatically move to the next step allowing you to set an optional password for configuration encryption
    • obtain your token with the registered oidc profile (i.e infncloud-wlcg) profile

      [~]$ oidc-token infncloud-wlcg
      
      • N.B If you have already a profile registered and the previous command fails, you have to remove it and re-run the registration procedure:
      [~]$ oidc-gen -d infncloud-wlcg
      
    • once you’ve your token you can copy it on the very same location as you defined in the condor_config.local file /tmp/token if you left the default config

Using docker container

  • install the minimal HTC container and configure the “UI” for job submission to the HTCondor cluster already deployed (using as image dodasts/mini-htc:v0 for HTCondor 8.9.9 or dodasts/mini-htc:9.0-el7 for HTCondor 9.0)

    [~]$ docker run -ti dodasts/mini-htc:<tag> bash
    
  • modify the condor_config.local file with the IP of COLLECTOR and SCHEDD Host and name (as returned in the outputs in the dashboard)

    [~]$ cat /etc/condor/condor_config.local
    
    COLLECTOR_HOST = <IP>.myip.cloud.infn.it:30618
    SCHEDD_HOST = <IP>.myip.cloud.infn.it
    SCHEDD_NAME = <IP>.myip.cloud.infn.it
    
  • create the IAM token

    • (choosing the name of oidc profile as you want, e.g. infncloud-wlcg)

      [~]$ eval `oidc-agent`
      [~]$ oidc-gen --flow device --dae https://iam.cloud.infn.it/devicecode infncloud-wlcg
      
      • provide the IAM service value: https://iam.cloud.infn.it/
      • insert the following scopes for the client:
        • openid profile email offline_access wlcg wlcg.groups (for HTCondor 8.9.9)
        • openid compute.create offline_access compute.read compute.cancel compute.modify wlcg wlcg.groups (for HTCondor 9.0)
      • The result will be a link with a string code. open the link via browser, insert the required code and authorize the client.
      • After your approval, the oidc_gen command will automatically move to the next step allowing you to set an optional password for configuration encryption
    • obtain your token with the registered oidc profile (i.e infncloud-wlcg) profile

      [~]$ oidc-token infncloud-wlcg
      
      • N.B If you have already a profile registered and the previous command fails, you have to re-authenticate with the command

        [~]$ oidc-gen --reauthenticate --flow device --dae https://iam.cloud.infn.it/devicecode infncloud-wlcg
        

        or remove the profile and re-run the registration procedure

        [~]$ oidc-gen -d infncloud-wlcg
        
    • once you’ve your token you can copy it on the very same location as you defined in the condor_config.local file /tmp/token if you left the default config

Job submission

  • Start with job submission creating a executable “simple” and condor job description “sub”
[~]$ cat simple

#!/bin/bash

sleep 100
echo $HOSTNAME
[~]$ cat sub

universe   = vanilla
executable = simple
log        = simple.log
output     = simple.out
error      = simple.error
+OWNER = "condor"
queue
  • Now you can submit you first job (using -spool because the schedd is on the HTCondor remote cluster)
[~]$ condor_submit -spool sub
  • To check the status of the job
[~]$ condor_q -nobatch --allusers
  • To check the status of the cluster
[~]$ condor_status -any
  • To retrieve your output, log and error file
[~]$ condor_transfer_data -all