Get your network cables out. Install Linux on the first non-head node. Follow these steps for each non-head node.
Going with my example node names and IP addresses, this is what I chose during setup:
Workstation auto partition remove all partitions on system use LILO as the boot loader put boot loader on the MBR host name wolf01 ip address 192.168.0.101 add the user "wolf" same password as on all other nodes NO firewall |
The ONLY package installed: network servers. Un-select all other packages.
It doesn't matter what else you choose; this is the minimum that you need. Why fill the box up with non-essential software you will never use? My research has been concentrated on finding that minimal configuration to get up and running.
Here's another very important point: when you move on to an automated install and config, you really will NEVER log in to the box. Only during setup and install do I type anything directly on the box.
When the computer starts up, it will complain if it does not have a keyboard connected. I was not able to modify the BIOS, because I had older discarded boxes with no documentation, so I just connected a "fake" keyboard.
I am in the computer industry, and see hundreds of keyboards come and go, and some occasionally end up in the garbage. I get the old dead keyboard out of the garbage, remove JUST the cord with the tiny circuit board up there in the corner, where the num lock and caps lock lights are. Then I plug the cord in, and the computer thinks it has a complete keyboard without incident.
Again, you would be better off modifying your bios, if you are able to. This is just a trick to use in case you don't have the bios program.
After your newly installed box reboots, log on as root again, and...
do the same chkconfig commands stated above to set up the right services.
modify hosts; remove "wolfnn" from localhost, and just add wolfnn and wolf00.
install lam
create the /mnt/wolf directory and set up security for it.
do the ssh configuration
Up to this point, we are pretty much the same as the head node. I do NOT do the modification of the exports file.
Also, do NOT add this line to the .bash_profile:
sh -c 'ssh-add && bash' |
Recall that on the head node, we created a file "authorized_keys". Copy that file, created on your head node, to the ~/.ssh directory on the slave nodes. The HEAD node will log on the all the SLAVE nodes.
The requirement, as stated in the LAM user manual, is that there should be no interaction required when logging in from the head to any of the slaves. So, copying the public key from the head node into each slave node, in the file "authorized_keys", tells each slave that "wolf user on wolf00 is allowed to log on here without any password; we know it is safe."
However you may recall that the documentation states that the first time you log on, it will ask for confirmation. So only once, after doing the above configuration, go back to the head node, and type ssh wolfnn where "wolfnn" is the name of your newly configured slave node. It will ask you for confirmation, and you simply answer "yes" to it, and that will be the last time you will have to interact.
Prove it by logging off, and then ssh back to that node, and it should just immediately log you in, with no dialog whatsoever.
As root, enter these commands:
cat >> /etc/fstab wolf00:/mnt/wolf /mnt/wolf nfs rw,hard,intr 0 0 <control d> |
What we did here was automatically mount the exported directory we put in the /etc/exports file on the head node. More discussion regarding nfs later in this document.
Then modify /etc/lilo.conf.
The 2nd line of this file says
timeout=nn |
Modify that line to say:
timeout=1200 |
After it is modified, we invoke the changes. You type "/sbin/lilo", and it will display back "added linux *" to confirm that it took the changes you made to the lilo.conf file:
/sbin/lilo Added linux * |
Why do I do this lilo modification? If you were researching Beowulf on the web, and understand everything I have done so far, you may wonder, "I don't remember reading anything about lilo.conf."
All my Beowulf nodes share a single power strip. I turn on the power strip, and every box on the cluster starts up immediately. As the startup procedure progresses, it mounts file systems. Seeing that the non-head nodes mount the shared directory from the head node, they all will have to wait a little bit until the head node is up, with NFS ready to go. So I make each slave node wait 2 minutes in the lilo step. Meanwhile, the head node comes up, and making the shared directory available. By then, the slave nodes finally start booting up because lilo has waited 2 minutes.