bwbug: beowulf network problem?

Frank Summers summers at stsci.edu
Mon Oct 7 13:47:21 PDT 2002


In June, Gregory Hildstrom posted that his cluster was experiencing
some lock-ups, possible due to a high NFS load. I've jst had a user
begin running some computations on my cluster that appear to
be causing a similar problem. I haven't looked at his code to determine
the NFS load, but it sends lots of MPI traffic across the network.

Over the last week several of my nodes have gone catatonic - alive,
but not responding. About half the time, they have woken up,
although with their system clocks out of sync. Other times I've
had to hard reboot.

The last failure indicates what Gregory had found - no operating
system on the drive.

I was just wondering if Gregory or anyone else had found a
solution to this problem (beyond banning the offending user).
I've had a gigabit ethernet switch on order for some time, but
purchasing was slow and it got stuck at a west coast dock.

Frank

-- 
Space Telescope Science Institute           410-338-4749
3700 San Martin Drive                            410-338-4767 (FAX)
Baltimore, MD  21218                             summers at stsci.edu
             http://terpsichore.stsci.edu/~summers/



More information about the bwbug mailing list