After yesterday’s Ubuntu upgrade process (once again) broke the Ubuntu installation on my old laptop, I had the chance to back up the most important folders before reinstalling. While doing so, I realized that the folders were much larger than expected—mostly due to unnecessary files I hadn’t touched in ages.

I was struck by a cleaning frenzy and started looking for a simple tool that would show me which folders were most cluttered with “file corpses.” I could then exclude these from the backup and delete them.

File corpses = files downloaded ages ago and never touched since. In other words, access_time(corpse) < now – x_days.

I want to find these corpses so I can delete them afterwards (after manually checking them). They’re usually outdated installation files for some software (which I rarely need anymore) or results of old, finished analyses. Generally, lots of junk that I can safely get rid of.

Whitelist

Sometimes, however, there are also files/folders I’d like to keep for later (“just in case I need it again”—hehehe). That means I want to maintain a whitelist of files and folders to exclude from the search, since I consider them important.

Directory grouping

I also quickly noticed that things get messy if I just print all those files to STDOUT. So I wanted the files grouped by directory and only the number of corpses per directory printed.

Since I only found very bare-bones scripts online without whitelist functionality, I went ahead and wrote a small Bash script myself, based on find, grep, and gawk.

  • First, find searches the given directory tree for all files (not folders) that haven’t been accessed in at least x days.
  • Then, grep checks which of these files match at least one of the whitelist patterns and filters those out.
  • Finally, gawk groups the corpses per directory. It iterates over each file corpse and increments the corpse count by one for each of its parent directories up to root.

Example:

If the file /home/user/Downloads/test.txt is a corpse, the count is incremented for:

  • /
  • /home
  • /home/user
  • /home/user/Downloads

At the end, gawk writes the counts per directory to STDOUT, sorted by directory.

The script

#! /bin/bash

[ $# == 2 ] || (echo "Two arguments expected" && exit)

printf "Finding subfolders in %s with files not accessed at least %d days\n" $1 $2
touch declutter.whitelist
# look for files in given folder which are older than given number of days
find $1 -atime +$2 -type f > declutter.result
# filter files based on whitelist
grep -vf declutter.whitelist declutter.result > declutter.filtered
# print number of such files per folder
printf "#files\tIn Directory\n"
gawk ' BEGIN{OFS="\t"} { i=0; while((newPos = index(substr($0,i+1),"/")) && newPos > 0) { i=i+newPos; counts[substr($0,1,i)]++; }; } END{ asorti(counts,sorted); for (i = 1; i <= length(counts); i=i+1) { elem = sorted[i]; print counts[elem],elem } }' declutter.filtered

The script takes two arguments:

  • the base directory to search in
  • the number of days since the last access, so that a file is considered a corpse if older

The whitelist file is located in the same directory as the script and contains one absolute path per line:

/.
/usr/
/opt/

The first line ensures that no hidden files/folders are considered. It’s also important to specify folders with a trailing /, otherwise folders with just an additional postfix would also be excluded: e.g., /home/user/test would also exclude /home/user/test2.

In my downloads directory, the script finds the following corpses (no access for more than 100 days) when run with sh declutter.sh /home/USERNAME/Downloads 100:

Finding subfolders in /home/USERNAME/Downloads with files not accessed at least 100 days
# files In Directory
53 /
53 /home/
53 /home/USERNAME/
53 /home/USERNAME/Downloads/
10 /home/USERNAME/Downloads/RRW/
1 /home/USERNAME/Downloads/RRW/META-INF/
28 /home/USERNAME/Downloads/freduce/
3 /home/USERNAME/Downloads/freduce/R.packages/
25 /home/USERNAME/Downloads/freduce/bin/
2 /home/USERNAME/Downloads/gecko-1.2.1/
2 /home/USERNAME/Downloads/kual/
4 /home/USERNAME/Downloads/ncbi-blast-2.2.29+/
2 /home/USERNAME/Downloads/ncbi-blast-2.2.29+/bin/
1 /home/USERNAME/Downloads/ncbi-blast-2.2.29+/doc/

Would you also like me to slightly modernize the phrasing (e.g., replacing “file corpses” with something like “stale files”) or should I keep the playful wording as close to your original as possible?