Monday, March 11, 2013

Troubleshooting full filesystems


Question

This technote addresses some reasons a filesystem can become full, and ways to find out what may be filling it up in order to free the space. It also discusses some techniques to monitor filesystems for space.

Answer

Reasons a filesystem can be full
* Used cp or tar on sparse files to copy them into the filesystem.

If a sparse file, such as a database table, is copied into the filesystem using 'cp' or 'tar' those utilities will not preserve the sparseness. They will fill any null space in the file with zeroes, which may make it much larger.

Use the 'fileplace' command on the source file to check if it is sparse:

# fileplace myfile

File: sparse Size: 51226 bytes Vol: /dev/hd1
Blk Size: 4096 Frag Size: 4096 Nfrags: 1

Logical Extent
--------------
00006806 1 frags 4096 Bytes, 100.0%
unallocated 12 frags 49152 Bytes 0.0%

The "unallocated" file fragments are filesystem data blocks that are associated with this file, but contain no data. When 'cp' or 'tar' copies this file it will no longer be sparse.

For more information see Technote T1000145 - About Sparse Files

* Large log files or data files created by an application

Use "size" option in the find command to look for large files. If you specify "+number" then it will report on all files greater than that number. The "size" argument to find is in 512-byte blocks:

# find /mtpt -xdev -size +2048 -ls

This will:
  • find all filesystems in the mount point /mtpt
  • not search other filesystems that may be lower in the filesystem tree (-xdev)
  • reporting on files with size greater than 2048 512-byte blocks (-size +2048)
    2048*512 = 1048576 bytes, or 1 MB
  • list the files found using output similar to 'ls -l' (-ls)

If the low size lists too many files, use larger increments:

2048 = 1 MB
20480 = 10 MB
204800 = 100 MB

* Other general search techniques

If the filesystem has recently filled up, use the -newer flag to find recently modified files. To produce a file for the -newer flag to find against, use the following touch command:

$ touch <mmddhhmm filename>

From left to right, the following correspondences apply:
    • mm is month
    • dd is day
    • hh is hour (24 hour format)
    • mm is minute

Execute the following command:

# find /mtpt -xdev -newer <touched_file> -ls

Another useful flag for the find command will allow files to be located that have been changed in the last 24 hours.

For example:

# find /<filesystem_name> -xdev -mtime 0 -ls

* Use du to add up the file and directory sizes in the filesystem:

# du -xk This will give you total sizes in KB for files
and directories. -x stays within the same filesystem.

If using KB is too large, "m" and "g" can be used for Megabytes and Gigabytes:

# du -xm Shows output in MB
# du -xg Shows output in GB

Using du in conjunction with the sort command can allow you to identify the biggest directories:

# du -xk | sort -n

See Technote T1000401 - Why Numbers from "du -s" and "df" Disagree

NOTE: Before removing any files, the user should check to see if the file is currently in use by an active user process. Execute the following command:

# fuser <filename>

filename is the file name that is being checked for an active user process. If a file is open at the time of removal, it is only removed from the directory listing. The blocks allocated to that file are not freed until the process holding the file open is killed. If an open file is removed it will still be counted as "used" in the output from df, but will not be visible via 'ls' or 'du'.

* Application or command core dumps

Core dumps can be very large. Use the find command to search for them:

# find /mtpt -xdev -name 'core*' -ls

* Filesystem mounted on non-empty directory

Sometimes an application or subsystem will be started before the filesystem required to hold the output from it is mounted. For example if auditing is set to write trail files to the /audit filesystem, but /audit is not mounted when auditing is started, it will write them to the *directory* /audit in the root filesystem. Later mounting /audit will obscure the trail files that are being used and may eventually fill up the root filesystem.

The best solution is to unmount filesystems by hand and check in their mount directories. If this cannot be done in multiuser due to applications running, it would be advantageous to boot into single-user where only /, /usr, /var and /tmp are mounted. Technote Number: T1011796 Booting AIX in Single-User Mode can be used as a guide.


Specific AIX filesystems

If root (/) is full
    * Check for filesystems mounted over directories containing data

    Files or data may have been copied into a directory instead of a mount point, then later when the filesystem is mounted it will obscure the files. In this case when du is run on the filesystem it will show a very low number, but df will report the real available space in the filesystem. For example:

    We create a filesystem for oracle data:
    # crfs -v jfs2 -g oravg -m /oracle -a size=200M

    but forget to mount it before copying data into it:
    # cp /lots_of_data /oracle

    Then mount it afterwards:
    # mount /oracle

    To fix this, you must unmount /oracle and remove the files in the /oracle directory. This can affect any filesystem mounted over another, but most often affects the root filesystem.

    * Check the /etc/security/failedlogin file
    Use the following command to read the contents of the file.

# who /etc/security/failedlogin
    The condition of TTYs respawning too rapidly will create failed login entries. To clear the file after reading or saving the output, execute the following command:

# cp /dev/null /etc/security/failedlogin
    * Check the /dev directory
    If a device name is typed incorrectly, as in rmto instead of rmt0, a file will be created in /dev called rmto. The command will normally proceed until the entire root file system is filled before failing. /dev is part of the root (/) file system. Look for entries that are not devices (that do not have a major or minor number).
    Execute the following:

# cd /dev
# ls -l | more
    Whereas a file size on an ordinary file would normally be seen, a device file will have two numbers separated by a comma.
    Example:

crw-rw-rw- 1 root system 12, 0 Oct 25 10:19 rmt0
    If the output looks like the following, the file should be removed.

crw-rw-rw- 1 root system 9375473 Oct 25 10:19 rmto
      NOTE: The /dev directory has some valid file names. Look for a file that has a large size (larger than 500 bytes).
    * If system auditing is running, the /audit directory (default) may rapidly fill up and require attention.

    * Check for very large files in / with the find commands above.


If /var is full
    * In /var/tmp, check for old leftover files.
    * Check for a large wtmp file

    /var/adm/wtmp is a file that is used to log all logins, rlogins and telnet sessions. If it is not monitored it will grow indefinitely unless system accounting is running. System accounting will clear it out nightly. /var/adm/wtmp can either be cleared out or edited to remove old and unwanted information.
    To clear /var/adm/wtmp, execute the following:

# cp /dev/null /var/adm/wtmp
    To edit the file and remove unwanted entries, execute the following:

# /usr/sbin/acct/fwtmp < /var/adm/wtmp >/tmp/out
    Edit the /tmp/out file to remove unwanted entries then put the edited version back in wtmp by executing the following command:

# /usr/sbin/acct/fwtmp -ic < /tmp/out > /var/adm/wtmp
    * In the /var/adm/ras directory, clear the error log

    This directory contains the error log, errlog. It is never cleared unless it is manually cleared. DO NOT cp /dev/null to it or it will disable the error logging functions of the system. In that case a zero (0) length errlog file must be replaced from a backup tape.
    First, stop the error daemon by entering:

# /usr/lib/errstop
    Second, remove or move to a different filesystem the following file:

# /var/adm/ras/errlog

NOTE: The historical error data is deleted if you remove the errlog file.

Third, restart the error daemon by entering:

# /usr/lib/errdemon
    * Check for any trace files

    There may be a trace file in /var/adm/ras The trcfile file in this directory may be large due to a previous trace being run. The file can be removed by executing the following:

# rm /var/adm/ras/trcfile
    * Check for vmcore files

    You may also have vmcore* files in the /var/adm/ras directory if your dump device is set to hd6 (which is the default). If these files are old and/or you do not wish to persue them, you may remove them.
    * Check for spool files

    The /var/spool directory contains the queueing subsystem files. Clear the queueing subsystem by executing the following commands:
      1. # stopsrc -s qdaemon
      2. # rm /var/spool/lpd/qdir/*
      3. # rm /var/spool/lpd/stat/*
      4. # rm /var/spool/qdaemon/*
      5. # startsrc -s qdaemon
    * Check for accounting files
    The /var/adm/acct directory contains accounting records. If accounting is running, this directory may contain several large files. Information on how to manage these files can be found in System Management Guide Chapter 14 (SC23-2457-01).
    * Terminated vi session files

    The /var/preserve directory contains terminated vi sessions. Delete these.
    While old vi sessions can be used to recover files that were abnormally terminated, these files can be deleted. However, the user may want to keep some of the newer ones in case users want to recover files. To recover a file, execute the following:

$ vi -r <filename> or -r
    This will list all available files that are recoverable.
    * Modify /var/adm/sulog
    This file tracks the number of attempted uses of su and whether they are successful or not. This is a flat file and can be viewed and modified with a favorite editor. If it is removed it will be recreated by the next attempted su.
    * Modify /var/tmp/snmpd.log
    This is used by the snmpd daemon as a log. If the file is removed it will be recreated by the snmpd daemon.
    The size of this file can be limited so that it does not grow indefinitely by editing the /etc/snmpd.conf file under the section for size. This is in bytes.
    * Check for large mailboxes
    Files in /var/spool/mail/ are flat text files that serve as the user's mailbox. You can just move them out of the way or zero them out, if you are sure that the mails are not needed by the user.
Use the skuker utility
AIX provides a general system cleanup script called skulker located in the /usr/sbin directory. Before attempting to run the skulker command, look at the skulker entry in the product documentation. Read the script for details to determine what files it will delete and what time frame it will allow files to exist before deletion.

skulker may be run as a cron job using the following crontab entry:

0 3 * * * /usr/sbin/skulker

Consider limiting the errlog by the running these entries in cron:

0 11 * * * /usr/bin/errclear -d S,O 30
0 12 * * * /usr/bin/errclear -d H 90

No comments:

Post a Comment