slurmstepd: error: Unable to get current working directory: No such file or directory

This is by far one of the most difficult to solve errors pertaining to Slurm cluster maintenance to comprehend. The AI models cannot understand why it appears, and the analysis of the Slurm codebase is not helpful. The logging verbosity increase is also not particularly beneficial in tracing the source of that error.

This error occurs when the user who is set up as the effective user for Slurm tasks execution is not the one who owns the spool folder used by Slurm on the compute nodes (not the login or management nodes), which is typically the folder /var/spool/slurmd.

By running the following commands, you can see which effective user Slurm uses to configure task execution related to submitted job execution (do not mistake this user with the user who owns the running job processes - the latter are owned by UID and GID of the person who queued the running job).

$ scontrol show config | grep -i SlurmdUser

Therefore, the ownership on /var/spool/slurmd must be the one of SlurmdUser and its primary group (usually root). Otherwise, at each job submission on that node, the user will receive the error:

slurmstepd: error: Unable to get current working directory: No such file or directory

Of course, after fixing the ownership on /var/spool/slurmd, one must restart slurmd on that particular node.

0 comments:

Post a Comment

Creative Commons - Attribution 2.5 Generic. Powered by Blogger.

Steganography in Web Standards

Steganography in Web Standards Exploring the use of HTML IDs, UUIDs, and HMAC for cove...

Search This Blog

Translate