bwbug: beowulf network problem?
summers at stsci.edu
Mon Oct 7 13:47:21 PDT 2002
In June, Gregory Hildstrom posted that his cluster was experiencing
some lock-ups, possible due to a high NFS load. I've jst had a user
begin running some computations on my cluster that appear to
be causing a similar problem. I haven't looked at his code to determine
the NFS load, but it sends lots of MPI traffic across the network.
Over the last week several of my nodes have gone catatonic - alive,
but not responding. About half the time, they have woken up,
although with their system clocks out of sync. Other times I've
had to hard reboot.
The last failure indicates what Gregory had found - no operating
system on the drive.
I was just wondering if Gregory or anyone else had found a
solution to this problem (beyond banning the offending user).
I've had a gigabit ethernet switch on order for some time, but
purchasing was slow and it got stuck at a west coast dock.
Space Telescope Science Institute 410-338-4749
3700 San Martin Drive 410-338-4767 (FAX)
Baltimore, MD 21218 summers at stsci.edu
More information about the bwbug