System Administration

IPMI tool for remote power management and wrong gateway issue

Almost all Fractal amd nodes equip with IPMI management that enable remote power on/off control. For example:

ipmitool -I lan -U <username> -H <ip here> chassis power status

ipmitool -I lan -U <username> -H <ip here> chassis power reset/on/of

Some of the nodes (compute-0-11 to 22) stop working after the reboot at May 22, 2018, because the gateway was wrong. I tried to fix this gate way problem by

su

ssh compute-0-22

ipmitool mc reset cold

ipmitool lan set 1 defgw ipaddr xxx.xx.xxx.x

...

Read more about IPMI tool for remote power management and wrong gateway issue

Enabling the /scratch folder in a rock system

A scratch folder usually refers to a folder that is fast for reading and writing. If you use a supercomputer, like those in XSEDE or NERSC, you will find out that the SCRATCH folder are the best place to run your simulation (though it is not suggested to store the data there). There may be multiple reseaons that the scratch folder is faster than normal disk. In our own in-house cluster, Fractal, the scratch folder is fast, simply because it is local and large.

Fractal has one main node and multiple compute nodes, where most of the data are stored at four hard disks at the main node...

Read more about Enabling the /scratch folder in a rock system

Fixing DHCP failure for insert-ethers in Fractal

Two node in Fractal, compute-0-0 and compute-0-4, was down because a wrong distro installation and was removed by the command

insert-ethers --remove="compute-0-0"

. However, after we moved Fractal to MGHPCC and change the main node IP address. This compute node cannot receive any DHCP installation offer from the main node. We used the standard command to install a new node, since the nodes were deleted from the Rock's list already. insert-ethers; and the node to boot from PXE network. But after waiting for 10-15 mins, the communication...

Read more about Fixing DHCP failure for insert-ethers in Fractal

Set up Environmental Module in a CentOS6 + Rocks system

Module is a convinient way to manage shell environment variables and aliases. And it is simple enough for linux novice users to use the right software/compiler that they want. As far as I know, there are two module software, lmod and environmental module using tcl language.

Our cluster OS came with Environmental Module 3.2.10.

1. Customize module path (only needed for adding a new parent folder for modulefiles)

the Module software searches in a environment variable, MODULEPATH, for available modulefiles. The first step is to add a...

Read more about Set up Environmental Module in a CentOS6 + Rocks system

[SOLVED] slow ssh log-in

Problem: Slow SSH log-in

Fractal is getting more and more troublesome recently... It has a hug lap for ssh and also su . It took me a while to debug. In the end, my labmate, Aravind, found the culprit -- log file explosion. Here's some symptons observed and the final solution offered by Aravind. :)

1. use "vvv" mode for ssh to show error information.

$ssh -vvv username@fractal.mit.edu
OpenSSH_7.2p2, LibreSSL 2.4.1
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 20: Applying options for...
Read more about [SOLVED] slow ssh log-in

[SOLVED] Shutdown failure and 'kernel panic' during the reboot

Problem

The cluster was shutdown remotely since the AC in 4-033 needs to be fixed.Two days later, however, we found that the shutdown process was hanged.  All the node was off except the main node. The screen showed:

Shutting down. . . Shutting dow[ OK ]:

Shutting down xfs: [ OK ]

Stopping httpd: [OK]

Stopping sshd: [OK]

Shutting down postfix: [ OK ]

Shutting down dhcpd: [ OK ]

Shutting down MySQL: [ OK ]

Shutting down GANGLIA gmond: [ OK ]...

Read more about [SOLVED] Shutdown failure and 'kernel panic' during the reboot

Fractal moves to 4-033

After spending years in the third floor of NW12, without proper AC and proper mounting, Fractal is moved to building 4 and open a brand new era on the main campus! And I also finally got the root password and start my life as a system admin, which means my name will be on Fractal's emergency contact and I will be the one taking care of when there are any power outage and system problems. (not sure whether this is good or bad :P)

The moving took us, all computational grad students in the group, 7 business days, plus a lot of waiting for shipment during the Christmas and New Year...

Read more about Fractal moves to 4-033