Server Room Temperature Monitoring

ŠarŻu  nas Burdulis
Version 1, August 10, 2005

Contents

1 Introduction
2 Temperature Sensors
3 DigiTemp software
4 Automated Monitoring

1 Introduction

Server room monitoring has to be automated to warn administrators in case room temperature rises above certain level, probably indicating a failure in air conditioning system. The thermostat is set to keep server room at 72 F. The monitor therefore should start sending warnings whenever temperature reaches 82 F, for example. In addition, automated monitor should shutdown servers in case temperature exceeds certain maximum level. For IBM e325 servers (math-0n.grid) the maximum operating temperature is 95 F and it should be something similar for Dell machines (gauss and webwork). We set 90 F as the upper limit, i.e. when servers will be shut down to avoid permanent damage. Email notifications are sent to both @math and @dartmouth.edu accounts for system and department administrators.

2 Temperature Sensors

We use semiconductor sensors on 1-wire bus. Sensors are Dallas Semiconductor DS9097-U, packaged and supplied by iButtonLink as a “Link45” kit. The kit includes two sensors in plastic enclosures, Ethernet cables and a RJ45-to-RS232 adapter. The temperature range of DS9097-U is at least -55..+125C and the precision is within one C. Sensor uses parasitic power from the serial bus.

Both sensors are put on the same 1-wire bus and the bus is connected to the second serial port on (/dev/ttyS1) on gauss. Sensor 0 is placed on top of gauss case, sensor 1 — on webwork’s.

3 DigiTemp software

DigiTemp is free and open software to scan 1-wire bus and read sensor data. It is available as digitemp package in Debian and includes digitemp_⋆ binaries to read data from specific sensors. As we use DS9097-U, a symlink is created for convenience:

# ln -s /usr/bin/digitemp˘DS9097U /usr/bin/digitemp

DigiTemp requires configuration file either as .digitemprc in the current directory or specified on the command line. A sample/starting .digitemprc can be created by:

# digitemp -i -s /dev/ttyS1

Among other things, DigiTemp config file defines string format, in which sensor data is written to STDOUT. There is no central DigiTemp config file on gauss. We specify one on command line depending on our needs (MRTG graphing or monitoring/shutdown script).

4 Automated Monitoring

Monitoring is done by /usr/local/sbin/local-1utemp shell script run by cron every 15 minutes. DigiTemp configuration file used by the monitoring script is /usr/local/sbin/local-1utemp-digitemp.conf. The latter sets output format by HUM_FORMAT "&.0F", which results in temperature being returned in Fahrenheit degrees, one line per sensor. The script (included at the end of this section) reads temperatures ($TEMP0, $TEMP1) from both of the sensors and compares them to preset values of warning and shutdown thresholds ($WARN_TEMP, $SHUT_TEMP). If either of $TEMP? exceeds $SHUT_TEMP, immediate shutdown is initiated on machines listed in $SERVERS and then on gauss with email notification sent to addresses listed in $ADMINS. If only the $WARN_TEMP is exceeded, only email warning is sent.

Remote shutdown. Machines other than gauss are shut down remotely via SSH. This is arranged as follows:

  1. User mathdown is added to systems which should be “shutdownable” remotely. Directory /home/mathdown/.ssh is created.
  2. mathdown is added to /etc/sudoers for /sbin/shutdown with no password asked:
    mathdown ALL=NOPASSWD: /sbin/shutdown

  3. Public key of root@gauss (/root/.ssh/id_dsa.pub) is copied to remote machines, to file
    /home/mathdown/.ssh/authorized_keys. The key is not passphrase-protected.

As a result, any system prepared in such a way can be remotely shut down from gauss by:

# ssh mathdown@hostname sudo shutdown -h -P now "message to users"

/usr/local/sbin/local-1utemp script:

#! /bin/bash  
#  
# Script to read temperature from sensors in 1U (attached to gauss:/dev/ttyS1)  
# and either notify admins or shutdown servers, if temperature  
# is above defined limits.  
#  
# Docs in /usr/local/share/doc/sysadmin/digitemp/  
#  
# Sarunas, 2005 08 05  
#  
 
MAIL=/usr/bin/mail  
DIGITEMP=/usr/bin/digitemp  
SSH=/usr/bin/ssh  
 
# whom to notify by email  
ADMINS="sarunas trs awg sysadmin helpdesk deptadmin \  
Anne.Webster.Grant@Dartmouth.EDU \  
Thomas.R.Shemanske@Dartmouth.EDU \  
Sarunas.Burdulis@Dartmouth.EDU"  
 
# user that is allowed /sbin/shutdown  
# in /etc/sudoers on remote machines  
# and has root@gauss' pub. key in .ssh/authorized˘keys  
REMOTEUSER=mathdown  
 
HOSTNAME=_hostname -s_  
TIMESTAMP=_/bin/date +&Y-&m-&d/&H:&M_  
 
# Temp.(F deg.) limit to start sending warnings  
WARN˘TEMP=70  
# Temp.(F deg.) to initiate shutdown of servers  
SHUT˘TEMP=90  
 
# -k --- test only  
#SHUT˘MESSAGE="TEST ONLY, NOT A REAL SHUTDOWN. Server room 1U overheated."  
#SHUT˘COMMAND="shutdown -k -h -P now \"${SHUT˘MESSAGE}\" 2>’1"  
SHUT˘MESSAGE="Server room 1U overheated."  
SHUT˘COMMAND="shutdown -h -P now \"${SHUT˘MESSAGE}\" 2>’1"  
 
# servers to shutdown via SSH  
SERVERS="webwork \  
math-01.grid \  
math-02.grid \  
math-03.grid \  
math-04.grid \  
math-05.grid \  
math-06.grid "  
# Note: gauss is shut down locally, by $SHUT˘COMMAND on a separate line  
 
#SERVERS="math-02.grid"  
 
# get temperature from sensors  
# sensor 0  
TEMP0=_$DIGITEMP -q -t 0 -c /usr/local/sbin/local-1utemp-digitemp.conf 2>’1_  
# sensor 1  
TEMP1=_$DIGITEMP -q -t 1 -c /usr/local/sbin/local-1utemp-digitemp.conf 2>’1_  
 
# if either of sensors returns temp. in excess of $SHUT˘TEMP  
#   --- notify $ADMINS, shutdown $SERVERS, shutdown gauss  
# else if in excess of $WARN˘TEMP  
#   --- notify $ADMINS  
#  
if [ $TEMP0 -gt $SHUT˘TEMP ] }} [ $TEMP1 -gt $SHUT˘TEMP ] ; then  
 
    # make a printable list  
    for SRV in $SERVERS ; do  
        SRVS="${SRVS}\n\t$SRV"  
    done  
 
    # send mail  
    SUBJECT="EMERGENCY: Shutting servers down!"  
    MESSAGE="\n\  
             \nWARNING: This is an emergency!\n\  
             \nAir temperature in 1U has reached the ${SHUT˘TEMP}F limit.\n\  
             \n\tSensor 0:  ${TEMP0}F\  
             \n\tSensor 1:  ${TEMP1}F\  
             \n\tDate/Time: ${TIMESTAMP}\  
             \n\n\  
             \nSERVERS LISTED BELOW WILL BE SHUT DOWN NOW:\  
             ${SRVS}\  
             \n\tgauss\n\  
             \nPlease check 1U immediately. Call Work Control 6-2508.\n\  
             \nEND"  
    echo -e "${MESSAGE}" } $MAIL -s "$SUBJECT" $ADMINS  
 
    # remote shutdown via ssh  
    for SRV in $SERVERS ; do  
        CMD="$SSH $REMOTEUSER@$SRV 'sudo $SHUT˘COMMAND'"  
        eval $CMD  
        #echo $CMD  
    done  
 
    #######################  
    #                     #  
    # shutdown gauss !!!  #  
    #                     #  
    eval $SHUT˘COMMAND    #  
    #                     #  
    #######################  
 
elif [ $TEMP0 -gt $WARN˘TEMP ] }} [ $TEMP1 -gt $WARN˘TEMP ] ; then  
 
    # send warning by mail  
    SUBJECT="WARNING: Air temp. in 1U is >${WARN˘TEMP}F!"  
    MESSAGE="ONLY\n\  
             \nWARNING: This is an emergency!\n\  
             \n\tSensor 0:  ${TEMP0}F\  
             \n\tSensor 1:  ${TEMP1}F\  
             \n\tDate/Time: ${TIMESTAMP}\  
             \n\n\  
             \nPlease check 1U immediately. Call Work Control 6-2508. See http://gauss/mrtg.\n\  
             \nServers will be shut down automatically when air temperature reaches ${SHUT˘TEMP}F.\n\  
             \nEND"  
    echo -e "${MESSAGE}" } $MAIL -s "$SUBJECT" $ADMINS  
fi  
 
# the end