Deploy a HTCondor cluster on Kubernetes (sys-admin nomination required)¶
Table of Contents
Note
This service is only available for pledged projects that made an explicit request for it.
Prerequisites¶
The user has to be registered in the IAM system for INFN-Cloud https://iam.cloud.infn.it/. Only registered users can login into the INFN-Cloud dashboard https://my.cloud.infn.it.
- For more details regarding registration please see Getting Started
User responsibilities¶
Important
The solution described in this guide consists on instantiation of Virtual Machines instantiated on INFN-CLOUD infrastructure. The instantiation of a VM comes with the responsibility of maintaining it and all the services it hosts.
Please read the INFN Cloud AUP in order to understand the responsibilities you have in managing this service.
Selection of the Deployment type¶
Note
If you belong to multiple projects, i.e. multiple IAM-groups, after login into the INFN-Cloud dashboard, from the lower left corner, select the one to be used for the deployment you intend to perform. Not all solutions are available for all projects. The resources used for the deployment will be accounted to the respective project, and impact on their available quota. See figure below.
Once the project is selected, choose the “HTCondor cluster” button from the list of solutions available for your group:
A menu is made available, as in the figure below:
“Description” is a mandatory field.
Parameters are split in three pages: “General”, “IAM Integration” and “Advanced”
Deploy/Use New Solution¶
This deployment instantiate a K8s cluster which is then exploited to automatically deploy a working HTCondor cluster (HTCondor versions currently available are 8.9.9 and 9.0). HTCondor services will be deployed using dedicated pods in K8s. The pods will be instantiated on the K8s cluster that is composed by one master and a number of slaves as required at the submission time. The HTCondor cluster deployment is composed by three main components: the central manager (collector, negotiator and condor connection broker daemons), the scheduler and the worker node. Each one runs on a dedicated K8s pod. For details about HTCondor architecture, services and usage please read the official HTCondor manual.
- Pods description:
- the Central Manager pod hosts the connection broker, collector and negotiator services which are in charge to respectively allow network communication, collect resources available into the HTCondor cluster and to match jobs requirements with available job slots. Specifically, the CCB service is running on the pod in order to grant inbound connectivity to the cluster from the Internet because the HTCondor cluster runs inside a private network;
- the SCHEDD pod hosts the schedd service in charge to accept submission and store batch jobs. Schedd can even be queried in order to retrieve job status;
- the WN pod hosts the startd services in charge to execute the jobs submitted through the Schedd services.
Basic configuration¶
Default parameters are ready for the submission of a cluster composed by 1 K8s master and 3 K8s slave. By default the provider where the cluster will be instantiated is automatically selected by the INFN Cloud orchestrator service. Warning: by default the HTCondor cluster will be instantiated with 1 worker node. If you need more worker nodes, you have to scale the number of WN pods from the K8s dashboard manually.
You need to specify:
Deployment description:
- a human readable name for the deployment (max 50 characters).
- this is a mandatory field.
certificate_type:
- the X509 type certificate used from the dashboard web pages. It can be:
- ‘letsencrypt-prod’ meant for production purposes and must be used only for that reason, as there is a limited number of them. Once this limit is reached, the deployment will fail and it will take several days until these certificates become available again.
- ‘letsencrypt-staging’ suggested for testing purposes, as they are not limited in number
- ‘selfsigned’ a self-signed type certificate
admin_token:
- admin password for accessing Grafana dashboard
storage_size:
- Size (in GB) of the volume to be mounted in /data. Maximum value: 100. It will be used for the HTCondor schedd spool directory.
number_of_slaves:
- default (and minimum)=3. This number is related to the number of K8s slaves, not HTCondor worker nodes. You can only increase it, otherwise the deployment will fail.
htcondor_image:
- default=’htcondor/execute’, the HTCondor WN image name. You can use your docker image built from the default. https://hub.docker.com/r/htcondor/execute
htcondor_image_tag:
- The HTCondor WN tag name, related to the HTCondor version. Currently the HTCondor version that can be installed are the 8.9.9-el7 or 9.0-el7
master_flavor:
- number of vCPUs and memory size of the K8s master VM (medium: 2 vCPUs, 4 GB RAM or large: 4 vCPUs, 8 GB RAM)
node_flavor:
- number of vCPUs and memory size of each K8s node VM (medium: 2 vCPUs, 4 GB RAM or large: 4 vCPUs, 8 GB RAM)
IAM configuration¶
- cluster_secret
- default=’testme’; this is the token for HTCondor daemon-daemon authN (authentication method is the shared secret, to be changed with IDTOKENS when we will use remote wns).
- iam_server
- default=’iam.cloud.infn.it’; this is the IAM server name for HTCondor authN.
Advanced configuration¶
In this section you can enable the following options:
“Configure scheduling”: * The automatic (default) or manual scheduling, allowing the user to perform the deployment by:
- taking advantage of the PaaS Orchestrator scheduling capabilities (the recommended way), or
- performing a direct submission towards one of the providers available, to be selected from the drop-down menu
“Set deployment creation timeout (minutes)”: * amount of time to wait until the deployment should be considered failed
“Do not delete the deployment in case of failure”: * needed in case of further debugging
“Send a confirmation email when complete”: * define whether to send a confirmation email when deployment is complete
Deployment results¶
To check the status of the deployment and its details select the “Deployments” button. You’ll find there the list of all your deployment as shown below.
For each deployment the button “Actions” allows you:
- to get the details of the deployment: overview info, input values and output values, once the deployment is completed (see fig.8)
- to show the TOSCA template of the deployment
- to retrieve the deployment log file that contains error messages in case of failure
- to show the details and the SSH private keys of the VMs created
- to lock it
- to delete the deployment
Clicking on the deployment UUID you can see its details:
- Overview of the cluster - where you can find information on the location of your deployment
- The Input Values - values you give to create the VM
- The Output Values - values you can use to access the K8s dashboard and details about K8s master and slaves. There are also information about HTCondor services as the _condor_SCHEDD_HOST and the _condor_COLLECTOR_HOST, values necessary for the configuration of submission nodes.
You can access the grafana_endpoint linked to the url provided in output, using as token the admin_token you wrote in the input.
You can access the K8s dashboard either using the bearer token contained in the kubeconfig or the kubeconfig file itself.
From the K8s dashboard you can see all the pods running. By default the number of HTCondor worker nodes is one. You can increase them from the dashboard going to Workloads –> Deployments –> wn-pod –> scale and put the number of workers you need.
How to submit HTCondor jobs to the deployed cluster¶
Configuration of the submission node¶
Using a standard User Interface¶
The User Interface must have the correct version of HTCondor installed (8.9.9 or 9.0.x)
create a htcondor source script
[~]$ cat htcondor/.htc.rc #!/bin/bash export _condor_AUTH_SSL_CLIENT_CAFILE=$HOME/.ca.crt export _condor_COLLECTOR_HOST=<hostname of collector here>:<port of the collector> export _condor_SCHEDD_HOST=<hostname of schedd here> export _condor_SCHEDD_NAME=<hostname of schedd here> export _condor_SEC_DEFAULT_ENCRYPTION=REQUIRED export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=SCITOKENS export _condor_SCITOKENS_FILE=$HOME/.token export _condor_TOOL_DEBUG=D_FULLDEBUG,D_SECURITY
create the IAM token
configure the agent for a WLCG-Profile token (choosing the name of oidc profile as you want, e.g. infncloud-wlcg)
[~]$ eval `oidc-agent` [~]$ oidc-gen --flow device --dae https://iam.cloud.infn.it/devicecode infncloud-wlcg
- provide the IAM service value: https://iam.cloud.infn.it/
- insert the following scopes for the client:
- openid profile email offline_access wlcg wlcg.groups (for HTCondor 8.9.9)
- openid compute.create offline_access compute.read compute.cancel compute.modify wlcg wlcg.groups (for HTCondor 9.0)
- The result will be a link with a string code. open the link via browser, insert the required code and authorize the client.
- After your approval, the oidc_gen command will automatically move to the next step allowing you to set an optional password for configuration encryption
obtain your token with the registered oidc profile (i.e infncloud-wlcg) profile
[~]$ oidc-token infncloud-wlcg
- N.B If you have already a profile registered and the previous command fails, you have to remove it and re-run the registration procedure:
[~]$ oidc-gen -d infncloud-wlcg
once you’ve your token you can copy it on the very same location as you defined in the condor_config.local file /tmp/token if you left the default config
Using docker container¶
install the minimal HTC container and configure the “UI” for job submission to the HTCondor cluster already deployed (using as image dodasts/mini-htc:v0 for HTCondor 8.9.9 or dodasts/mini-htc:9.0-el7 for HTCondor 9.0)
[~]$ docker run -ti dodasts/mini-htc:<tag> bash
modify the condor_config.local file with the IP of COLLECTOR and SCHEDD Host and name (as returned in the outputs in the dashboard)
[~]$ cat /etc/condor/condor_config.local COLLECTOR_HOST = <IP>.myip.cloud.infn.it:30618 SCHEDD_HOST = <IP>.myip.cloud.infn.it SCHEDD_NAME = <IP>.myip.cloud.infn.it
create the IAM token
(choosing the name of oidc profile as you want, e.g. infncloud-wlcg)
[~]$ eval `oidc-agent` [~]$ oidc-gen --flow device --dae https://iam.cloud.infn.it/devicecode infncloud-wlcg
- provide the IAM service value: https://iam.cloud.infn.it/
- insert the following scopes for the client:
- openid profile email offline_access wlcg wlcg.groups (for HTCondor 8.9.9)
- openid compute.create offline_access compute.read compute.cancel compute.modify wlcg wlcg.groups (for HTCondor 9.0)
- The result will be a link with a string code. open the link via browser, insert the required code and authorize the client.
- After your approval, the oidc_gen command will automatically move to the next step allowing you to set an optional password for configuration encryption
obtain your token with the registered oidc profile (i.e infncloud-wlcg) profile
[~]$ oidc-token infncloud-wlcg
N.B If you have already a profile registered and the previous command fails, you have to re-authenticate with the command
[~]$ oidc-gen --reauthenticate --flow device --dae https://iam.cloud.infn.it/devicecode infncloud-wlcg
or remove the profile and re-run the registration procedure
[~]$ oidc-gen -d infncloud-wlcg
once you’ve your token you can copy it on the very same location as you defined in the condor_config.local file /tmp/token if you left the default config
Job submission¶
- Start with job submission creating a executable “simple” and condor job description “sub”
[~]$ cat simple
#!/bin/bash
sleep 100
echo $HOSTNAME
[~]$ cat sub
universe = vanilla
executable = simple
log = simple.log
output = simple.out
error = simple.error
+OWNER = "condor"
queue
- Now you can submit you first job (using -spool because the schedd is on the HTCondor remote cluster)
[~]$ condor_submit -spool sub
- To check the status of the job
[~]$ condor_q -nobatch --allusers
- To check the status of the cluster
[~]$ condor_status -any
- To retrieve your output, log and error file
[~]$ condor_transfer_data -all