Super slow /usr/bin/env invocation

Tag: linux , cluster-computing Author: moveant Date: 2012-12-01

I am working on a computing cluster and I have a very weird /usr/bin/env behaviour... In two words, it works very slow. On the head node:

$ time /usr/bin/env which
<which output>

real    0m0.025s
user    0m0.001s
sys     0m0.001s

On the computing node:

$ qsub -I                                                                                                                
qsub: waiting for job 176620.scyld.localdomain to start
qsub: job 176620.scyld.localdomain ready

-bash-3.2$ time which
<which output>

real    0m0.003s
user    0m0.000s
sys     0m0.003s

-bash-3.2$ time /usr/bin/env /usr/bin/which

<which output>
real    0m0.003s
user    0m0.000s
sys     0m0.003s


-bash-3.2$ time /usr/bin/env which
<which output>

real    5m0.003s
user    0m0.001s
sys     0m0.001s

ps ax reports this:

12884 pts/3    S+     0:00 /usr/bin/env which

It takes 5 min to print usage banner for which. Any ideas why this could happen?

Edit 1:

Additional info about which:

-bash-3.2$ type -a which
which is /usr/bin/which
-bash-3.2$ file /usr/bin/which
/usr/bin/which: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses shared libs), stripped
-bash-3.2$ echo $PATH
/bin:/usr/bin:/home/gusev/.rvm/bin:/home/gusev/bin

Edit 2

I've strace'd /usr/bin/env which and it gets stuck at

execve("/bin/which", ["which"], [/* 47 vars */]

Now running a plain

/bin/which

also stucks, but this file does not exist:

-bash-3.2$ ls /bin/which
ls: /bin/which: No such file or directory

/bin is mounted over NFS:

-bash-3.2$ mount | grep bin
10.54.0.1:/bin on /bin type nfs (nolock,nonfatal)
10.54.0.1:/usr/bin on /usr/bin type nfs (nolock,nonfatal)

So this may be a networking issue...

Edit 3:

which which works perfectly fine:

-bash-3.2$ time which which
/usr/bin/which

real    0m0.002s
user    0m0.000s
sys     0m0.002s

The output of strace -e trace=execve /usr/bin/env which is

execve("/usr/bin/env", ["/usr/bin/env", "which"], [/* 47 vars */]) = 0
execve("/bin/which", ["which"], [/* 47 vars */]) = -1 ENOENT (No such file or directory)
execve("/usr/bin/which", ["which"], [/* 47 vars */]) = 0
<which output>

Edit 4:

The hang time is exactly 5 minutes always. Looks like it is some kind of default value timeout.

Other Answer1

It's likely that it's the which command, not the env command, that's causing the problem.

Since you're seeing very different results for

time /usr/bin/env /usr/bin/which

vs.

time /usr/bin/env which

it's likely that you have another which command in your $PATH, perhaps in /usr/local/bin or $HOME/bin. What does type -a which tell you? What does your $PATH look like?

Note that which can be either a shell script or an executable. If it's a shell script, try grabbing a copy of it and adding set -x to see what it's doing.

comments:

I've updated my post... Basically which is just a plain executable. And it is the only one which in $PATH.
@user1179926: What happens when you invoke which directly?
it immediately outputs usage banner.

Other Answer2

This issue, as well as that described in your previous question, appears to be caused by execve taking a long time to return on your compute notes. The fact that the dirs in your path are NFS mounted may be a contributing factor.

Running the commands through strace, we see that env uses repeated calls to execve to probed for the existence of the command in each path:

[[email protected]]$ echo $PATH
/home/me/bin:/usr/lib/lightdm/lightdm:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/home/me/work/bin

[[email protected]]$ strace -e execve /usr/bin/env which
execve("/usr/bin/env", ["/usr/bin/env", "which"], [/* 53 vars */]) = 0
execve("/home/me/bin/which", ["which"], [/* 53 vars */]) = -1 ENOENT (No such file or directory)
execve("/usr/lib/lightdm/lightdm/which", ["which"], [/* 53 vars */]) = -1 ENOENT (No such file or directory)
execve("/usr/local/sbin/which", ["which"], [/* 53 vars */]) = -1 ENOENT (No such file or directory)
execve("/usr/local/bin/which", ["which"], [/* 53 vars */]) = -1 ENOENT (No such file or directory)
execve("/usr/sbin/which", ["which"], [/* 53 vars */]) = -1 ENOENT (No such file or directory)
execve("/usr/bin/which", ["which"], [/* 53 vars */]) = 0

As you have confirmed in the comments above, which which does not suffer the same issue and that's because it uses stat instead of execve to probe the paths:

[[email protected]]$ strace -e execve,stat /usr/bin/which which
execve("/usr/bin/which", ["/usr/bin/which", "which"], [/* 53 vars */]) = 0
stat("/home/me", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat(".", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/home/me/bin/which", 0x7fff79ae8760) = -1 ENOENT (No such file or directory)
stat("/usr/lib/lightdm/lightdm/which", 0x7fff79ae8760) = -1 ENOENT (No such file or directory)
stat("/usr/local/sbin/which", 0x7fff79ae8760) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/which", 0x7fff79ae8760) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/which", 0x7fff79ae8760) = -1 ENOENT (No such file or directory)
stat("/usr/bin/which", {st_mode=S_IFREG|0755, st_size=946, ...}) = 0
/usr/bin/which

I'm afraid cannot offer any suggestions to address the underlying problem, but in the mean-time you can work around the issue by:

  1. Using full paths to commands instead of having env resolve them for you.
  2. If you really wish to use env, whenever possible reorder your $PATH to minimise the search. E.g.:

    PATH=/usr/bin:$PATH /usr/bin/env which   # place most likely path first
    

comments:

Thanks a lot for your comment. Unfortunately, the problem with env was just a tip of the iceberg: i have a lot of software running into this problem and I cannot fix all of that.
In that case, it sounds like an issue you need to take up with your sysadmin. Are the other cases suffering from the same symptoms?
The problem is: I am the sysadmin. I noticed the problem when some (not all) software suddenly just hanged in there, spending no CPU cycles for days. After analysis I boiled the problem down to the one described in the post.
Oh dear. Sorry I can't be of more help. Good luck.

Other Answer3

In the end, I found out that I have a very long PATH environment variable. And probable it somehow affected invoking execve of NFS share.

So I've moved a bunch of executables into a signle directory and replaces many entries in PATH with a single one. I do not expriencing any problems since then.