bwbug: LinuxCluster Problem @ Naval Surface Warfare Center

Greg Hildstrom ghildstr at hotmail.com
Thu Jun 20 04:49:02 PDT 2002


Hello. Thank you for your time in advance.

Background:
-----------
I work for the Naval Surface Warfare Center in Bethesda, MD. We have just 
finished building our third small linux cluster, which is a 6 Node Linux 
Cluster for analyzing ship structure data.

Symptoms:
---------
Cluster works fine for a period of time (<1day). The server node will do 
it's job fine until it has had a singnificant amount of time spent under 
high network load (mostly from NFS traffic), then it locks up hard with two 
keyboard lights flashing. All subsequent reboots will come up with a  kernel 
panic saying that there is no ext3 filesystem on / (the root partition). A 
re-installation of the operating system is the only fix. The compute nodes 
are fine.

Hardware Configuration:
-----------------------
1 Server Node:
-ASUS A7V266-E
-AMD Athlon XP 2000
-1.5 GB DDR 2100
-Seagate Baracuda 80 GB 7200 RPM Drive
-Hard Drive Plugged into the normal IDE/ATA Controller, not the Promise 
Controller
-5 x 3Com 3C905C-TX 100BT NICs with crossover cables for each compute node
-NVidia GeForce 2 Video Card
4 Compute Nodes:
-ASUS A7V266-E
-AMD Athlon XP 2000
-512 MB DDR 2100
-Seagate Baracuda 80 GB 7200 RPM Drives
-Hard Drive Plugged into the normal IDE/ATA Controller, not the Promise 
Controller
-3Com 3C905C-TX 100BT NIC
-ATI Rage Pro 128 Video Cards
1 Data Archive:
-P4 2.2 GHz
-6 x Seagate Baracuda 80 GB 7200 RPM Drives
-3Com 3C905C-TX 100BT NIC
-ATI Rage Pro 128 Video Card

Software Configuration:
-----------------------
Red Hat Linux 7.3
Stock Red Hat 7.3 Kernel
lam-6.5.6-4.i386.rpm //for cluster communication.. this is what our software 
uses
NFS //this is how the nodes get the data from the server node
SMB //this is how we get our results to our windows users
FTP
RSH
Telnet
gkrellm-1.2.9-1.i386.rpm //for monitoring network traffic, cpu load, and 
disk io
linuxconf-1.28-1.i386.rpm //provided by linuxconf for lazy configuration
PowerChutePlus-4.5.3.i386.rpm //provided by APC for power protection
eth0 - 192.168.1.254 //connection to node1
eth1 - 192.168.2.254 //connection to node2
eth2 - 192.168.3.254 //connection to node3
eth3 - 192.168.4.254 //connection to node4
eth4 - DHCP......... //Connection To On Base Network

Please do not hesitate to email me or call me with suggestions. Thank you.

Gregory Alan Hildstrom
Research Scientist
George Washington University and
Naval Surface Warfare Center
ghildstr at hotmail.com or ghildstr at gwu.edu
Desk: 3012273967
Cel:  2406263703

_________________________________________________________________
MSN Photos is the easiest way to share and print your photos: 
http://photos.msn.com/support/worldwide.aspx




More information about the bwbug mailing list