How To Delete Duplicate E-mails (Particularly During IMAP Move)

This section contains user-submitted tutorials.
Post Reply
matthew1471
New user
New user
Posts: 17
Joined: 2010-06-20 21:59

How To Delete Duplicate E-mails (Particularly During IMAP Move)

Post by matthew1471 » 2019-03-07 16:01

So I moved a load of e-mails into a folder. I hadn't appreciated an IMAP Move is basically a copy and then a delete. Thunderbird I did not realise times-out quite easily (especially on large move operations). This meant I had a LOT of duplicate e-mails.

Here is a C# program to fix that (provided you are using MySQL as the database):

Code: Select all

using MySql.Data.MySqlClient;
using System;
using System.Collections.Generic;
using System.IO;
using System.Security.Cryptography;

namespace hMailServer_Deduplicate
{
    class Program
    {
        static void Main(string[] args)
        {
            // This code is not designed to win any awards.
            Console.WriteLine("Enter Database Password");
            string databasePassword = Console.ReadLine();

            // Get a connection to the database.
            using (MySqlConnection connection = new MySqlConnection("server=DATABASE;user=mail;database=hmailserver;port=3306;password=" + databasePassword))
            using (MySqlConnection connection2 = new MySqlConnection("server=DATABASE;user=mail;database=hmailserver;port=3306;password=" + databasePassword))
            {
                // Perform the connections to the database.
                connection.Open();
                connection2.Open();

                // We store a list of the duplicated messages.
                List<uint[]> duplicateMessageIDsList = new List<uint[]>();

                // Get a command to run against the database.
                using (MySqlCommand command = new MySqlCommand("SELECT GROUP_CONCAT(messageid ORDER BY messageid ASC) FROM hm_messages WHERE (messageflags & 2 = 0) GROUP BY messageaccountid, messagetype, messagefrom, messagesize, messagecurnooftries, messagenexttrytime, messagecreatetime, messagelocked HAVING COUNT(*) > 1 ORDER BY COUNT(*) DESC;", connection))
                {
                    // Execute the command against the database to load candidate messageIDs.
                    using (MySqlDataReader reader = command.ExecuteReader())
                    {
                        // While there are rows to iterate through.
                        while (reader.Read())
                        {
                            // The split function returns a string.
                            string[] temporarySplitNumbers = reader[0].ToString().Split(',');

                            // We want to store the numbers in a uint array.
                            uint[] duplicateMessageIDArray = new uint[temporarySplitNumbers.Length];

                            // Parse each of the uints in the string (which is also another form of validation)
                            for (uint count = 0; count < temporarySplitNumbers.Length; count++)
                            {
                                duplicateMessageIDArray[count] = uint.Parse(temporarySplitNumbers[count]);
                            }

                            // Add this to our List of uints to handle.
                            duplicateMessageIDsList.Add(duplicateMessageIDArray);
                        }
                    }
                }

                uint deleteCount = 0;

                // Now we can handle each of the elements in our list.
                foreach (uint[] duplicateMessageIDs in duplicateMessageIDsList)
                {
                    const string dataPath = @"D:\Data\";

                    string sqlStatement = "SELECT messageid, CONCAT('" + MySqlHelper.EscapeString(dataPath) + "', domainname, '\\\\', LEFT(accountaddress, INSTR(accountaddress, '@')-1),'\\\\', SUBSTRING(messagefilename,2,2), '\\\\', messagefilename) AS filepath ";
                    sqlStatement += "FROM hm_messages ";
                    sqlStatement += "LEFT JOIN hm_accounts ON hm_messages.messageaccountid = hm_accounts.accountid ";
                    sqlStatement += "LEFT JOIN hm_domains ON hm_accounts.accountdomainid = hm_domains.domainid ";
                    sqlStatement += "WHERE messageflags & 2 = False AND messageid IN (" + String.Join(",", duplicateMessageIDs) + ") ";
                    sqlStatement += "ORDER BY messageid;";

                    // Get a command to run against the database.
                    using (MySqlCommand command = new MySqlCommand(sqlStatement, connection))
                    {
                        // Execute the command against the database to load metadata for our perceived duplicate messages.
                        using (MySqlDataReader reader = command.ExecuteReader())
                        {
                            // The first messageID is our original message we would like to preserve.
                            reader.Read();
                            uint originalMessageID = reader.GetUInt32(0);
                            string originalMessageFilename = reader.GetString(1);
                            string originalMessageHash = GetEmailContentHash(originalMessageFilename);

                            // Will skip the first message (as already read above) as that is our original.
                            while (reader.Read())
                            {
                                // Details for this instances.
                                uint newMessageID = reader.GetUInt32(0);
                                string newMessageFilename = reader.GetString(1);
                                string newMessageHash = GetEmailContentHash(newMessageFilename);

                                // Are the files even cryptographically the same?
                                if (newMessageHash.Equals(originalMessageHash))
                                {
                                    Console.WriteLine("Same : \"" + newMessageFilename + "\" (" + newMessageID + ", " + newMessageHash + ") and \"" + originalMessageFilename + "\" (" + originalMessageID + ", " + originalMessageHash + ").");

                                    // Delete the duplicate.
                                    using (MySqlCommand deleteCommand = new MySqlCommand("DELETE FROM hm_messages WHERE messageid = " + newMessageID, connection2))
                                    {
                                        deleteCommand.ExecuteNonQuery();
                                        File.Delete(newMessageFilename);
                                    }

                                    deleteCount++;
                                }
                                else
                                {
                                    Console.WriteLine("Different : \"" + newMessageFilename + "\" (" + newMessageID + ", " + newMessageHash + ") and \"" + originalMessageFilename + "\" (" + originalMessageID + ", " + originalMessageHash + ").");
                                }
                            }
                        }
                    }
                }

                // Output stats.
                Console.WriteLine("Deleted " + deleteCount + " duplicates of " + duplicateMessageIDsList.Count + " messages.");

            }
        }

        static string GetEmailContentHash(string messageFilename)
        {
            using (FileStream stream = File.OpenRead(messageFilename))
            {
                SHA256Managed sha = new SHA256Managed();
                byte[] checksum = sha.ComputeHash(stream);
                return BitConverter.ToString(checksum).Replace("-", String.Empty);
            }
        }
    }
}
The program opens 2 database connections.

It loads a list of messageids where the e-mail has not been deleted but the messageaccountid, message type, sender, file size, number of retries, the next try time, message creation time and lock status is the same and there's more than 1 that's the same (i.e. something to compare against!).

It then asks the database for the file path for each candidate duplicate ID and goes through the Data directory checking both files have exactly the same file contents (rather than just the size being identical) using SHA256 hashing. If it's finally convinced these are true duplicates it will delete it from the database followed by deleting the raw file on disk.

Finally it outputs a count of how many it has deleted.

This program is not designed to delete messages where the sending server sent it multiple times (the hMailServer mail header will be for a different time), just IMAP Move mess-ups :).

You will want to change the hard-coded paths, usernames, hostnames etc. but then again you were going to read the code before just blindly running it anyway right :)?

Post Reply