[Octopus-users] debugging help: node file
Emily
listemily at eml.cc
Tue Feb 16 20:41:28 WET 2010
Hi Octopus Users,
I'm trying to run octopus_mpi on a cluster and failing completely, I
guess because of one node down on the cluster.
There is one node down (n166). It has been taken offline. When I run
octopus_mpi it creates a file called PI### where the ### are some
numbers. This file contains a listing of nodes and the path to the
octopus_mpi executable. But in that file, only the first node is one of
the nodes that the queueing system assigned my job, and the rest are
n166. The job fails immediately because it cannot ssh to n166. the
PI### file disappears and no other files are created by octopus.
So what I'd like to know is, what part of the octopus code creates this
file PI###? I'd like to look at the source code and try to figure out
what it might be doing, and how it is interacting with our cluster.
Unfortunately PI is not a rare string that I can easily search for.
Thanks, Emily
Here's an example of the node list file:
$ cat PI21837
n456 0 /home/et/octopus/Oct3/bin/octopus_mpi
n166 1 /home/et/octopus/Oct3/bin/octopus_mpi
n166 1 /home/et/octopus/Oct3/bin/octopus_mpi
n166 1 /home/et/octopus/Oct3/bin/octopus_mpi
n166 1 /home/et/octopus/Oct3/bin/octopus_mpi
n166 1 /home/et/octopus/Oct3/bin/octopus_mpi
More information about the Octopus-users
mailing list