Find and delete duplicate files in Linux


Shell script for managing duplicate files

Duplicate files are copies of the same files that may become redundant, so we may need to remove duplicate files and keep a single copy of them. To screen files and folders for the purpose of identifying duplicate files is a challenging and rewarding task. It can be done using a combination of shell utilities. This tutorial deals with finding duplicate files and performing operations based on the result.

We can identify the duplicate files by simply comparing file content. Checksums are ideal for this task, since files with exactly the same content will produce the same checksum values. Once the duplicate files are identified, we can proceed with deleting them. The removal of duplicate files can be done manually (if you move all duplicate files to a designated trash folder and remove the folder manually) or automatically via shell script. In this tutorial we show you an automated method via shell script for locating and removing redundant files. Here is a good course to learn more about shell scripting in Linux.

Automated method

Here are steps for locating and removing duplicate files automatically.

1. Generate some test files as follows:

$ echo "hello" > test ; cp test test_copy1 ; cp test test_copy2;
$ echo "next" > other;
# test_copy1 and test_copy2 are copy of test

2. The code for the script to remove the duplicate files is as follows:

#!/bin/bash
#Filename: remove_duplicates.sh
#Description: Locate and remove duplicate files and keep one sample of each file.

ls -lS --time-style=long-iso | awk 'BEGIN { getline; getline;
name1=$8; size=$5
}
{
name2=$8;
if (size==$5)
{
"md5sum "name1 | getline; csum1=$1; 
"md5sum "name2 | getline; csum2=$1;
   if ( csum1==csum2 )
        		{
print name1; print name2
}
};

size=$5; name1=name2;
}' | sort -u > duplicate_files


cat duplicate_files | xargs -I {} md5sum {} | sort | uniq -w 32 | awk '{ print "^"$2"$" }' | sort -u >	duplicate_sample

echo Removing..


comm duplicate_files duplicate_sample	-2 -3 | tee /dev/stderr | xargs rm
echo Removed duplicates files successfully.

Note: You may need to make minor adjustment depending on your Linux OS

3. Run it as:

$ ./remove_duplicates.sh

Decoding the script

The preceding commands will find the copies of the same file in a directory and remove all except one copy of the file. Let us go through the code and see how it works.

ls -lS will list the details of the files sorted by file size in the current directory. –time- style=long-iso tells ls to print dates in the ISO format. awk will read the output of ls -lS and perform comparisons on columns and rows of the input text to find out the duplicate files.Here is a good article to learn more about how the Linux file system works.

The logic behind the code is as follows:

  • We list the files sorted by size so that the similarly sized files will be grouped together. The files having the same file size are identified as a first step to finding files that are the same. Next, we calculate the checksum of the files. If the checksums match, the files are duplicates and one set of the duplicates are removed.
  • The BEGIN{} block of awk is executed first before the lines are read from the file. Reading lines takes place in the {} block and after the end of reading and processing all lines, the END{} block statements are executed. The output of ls -lS is:
total 16

-rw-r--r--	1	slynux	slynux	5	6/29/2020	11:50	other
-rw-r--r--	1	slynux	slynux	6	6/29/2020	11:50	test
-rw-r--r--	1	slynux	slynux	6	6/29/2020	11:50	test_copy1
-rw-r--r--	1	slynux	slynux	6	6/29/2020	11:50	test_copy2

The output of the first line tells us the total number of files, which in this case is not useful. We use getline to read the first line and then dump it. We need to compare each of the lines and the next line for sizes. For that, we read the first line explicitly using getline and store the name and size (which are the eighth and fifth columns). Hence, a line is read ahead using getline. Now, when awk enters the {} block (in which the rest of the lines are read), that block is executed for every read of a line.

It compares the size obtained from the current line and the previously stored size kept in the size variable. If they are equal, it means two files are duplicates by size. Hence, they are to be further checked by md5sum.

We have played some tricks on the way to the solution. The external command output can be read inside awk as:

"cmd"| getline

Then, we receive the output in line $0 and each column output can be received in $1, $2,
… ,$n, and so on. Here, we read the md5sum checksum of files in the csum1 and csum2 variables. Variables name1 and name2 are used to store consecutive filenames. If the checksums of two files are the same, they are confirmed to be duplicates and are printed.

We need to find a file from each group of duplicates so that we can remove all other duplicates. We calculate the md5sum value of the duplicates and print one file from each group of duplicates by finding unique lines, comparing md5sum only from each line using
-w 32 (the first 32 characters in the md5sum output; usually, the md5sum output consists of a 32-character hash followed by the filename). Therefore, one sample from each group of duplicates is written in duplicate_sample.
Now, we need to remove all the files listed in duplicate_files, excluding the files listed in duplicate_sample. The comm command prints files in duplicate_files but not in duplicate_sample.

comm always accepts files that are sorted. Therefore, sort -u is used as a filter before redirecting to duplicate_files and duplicate_sample.

Here the tee command is used to perform a trick so that it can pass filenames to the rm command as well as print. The tee command writes lines that appear as stdin to a file and sends them to stdout. We can also print text to the terminal by redirecting to stderr.

/dev/stderr is the device corresponding to stderr (standard error). By redirecting to a stderr device file, text that appears through stdin will be printed in the terminal as standard error.

Summary

In this tutorial, we learn how to run a shell script for locating and removing duplicate or redundant files in our Linux system. As a system admin, you can expand on this by creating a cron job for it and let the script run routinely on your machine. You can also add a file extension filter to it so that only certain files are affected so that you’re not accidentally removing system files. Lastly, better safe than sorry, so as always make a backup before running and testing your scripts.

 

Resources- Self-Paced Linux Courses

If you like to learn more about Linux, taking the following courses is highly recommended:

 

Resources- Free Courses

Here is the list of our 9 free self-paced courses that are highly recommended:

Resources- Live Linux Courses

If you like to learn more about Linux, take the following live Linux classes is highly recommended:

 

Resources- Tutorials

If you like to learn more about Linux, reading the following articles and tutorials is highly recommended: