要比对多个存储目录下的文件是否存在重复文件,可以通过以下步骤实现 MD5 值的比对:
1. 提取文件路径
- 首先从你的目录结构中获取所有文件的路径,可以使用
find
命令递归列出所有文件路径:find /traixxxnent/zpxxxxx -type f > file_list.txt find /yfxxxmanent/zpxxxx -type f >> file_list.txt # 对表中每个目录重复上述命令,保存到同一个文件 file_list.txt
2. 计算文件的 MD5 值
- 使用以下脚本对文件列表中的每个文件计算 MD5 值:
while read filepath; domd5sum "$filepath" >> md5_checksums.txt done < file_list.txt
- 输出的
md5_checksums.txt
文件会包含每个文件的路径和对应的 MD5 值。
3. 查找重复文件
-
使用以下命令找出相同的 MD5 值(重复文件):
awk '{print $1}' md5_checksums.txt | sort | uniq -d > duplicate_md5.txt
-
使用以下脚本列出重复的文件路径:
grep -Ff duplicate_md5.txt md5_checksums.txt > duplicate_files.txt
-
duplicate_files.txt
文件中会列出所有重复文件的路径。
4. 输出结果
- 如果需要输出重复的文件或路径,可以根据你的需求格式化结果。
Python 脚本:
[root@rg2-bgw-prometheus001 mmwei3]# cat test_file_md5_compare_v2.py
import hashlib
import os# 目录列表
directories = ["/train33/asrmlg/permanent/zpxie2","/yfw-b3-mix01/asrmlg/permanent/zpxie2",# 继续添加其他目录
]# 用于存储文件的 MD5 值和路径
md5_dict = {}# 计算文件的 MD5 值
def calculate_md5(file_path):hasher = hashlib.md5()with open(file_path, "rb") as f:for chunk in iter(lambda: f.read(4096), b""):hasher.update(chunk)return hasher.hexdigest()# 遍历目录
for directory in directories:for root, _, files in os.walk(directory):for file in files:file_path = os.path.join(root, file)file_md5 = calculate_md5(file_path)if file_md5 in md5_dict:md5_dict[file_md5].append(file_path)else:md5_dict[file_md5] = [file_path]# 输出重复文件
print("重复的文件路径:")
for md5, paths in md5_dict.items():if len(paths) > 1:print(f"MD5: {md5}")for path in paths:print(f" {path}")
5、当然针对海量的小文件,我们可以换个车略比对,比如直接抛弃大小不同的。
[root@rg2-bgw-prometheus001 mmwei3]# cat test_file_md5_compare.py
import hashlib
import os
from collections import defaultdict# List of directories to check
directories = ["/train33/asrmlg/permanent/zpxie2","/yfw-b3-mix01/asrmlg/permanent/zpxie2",# Add more directories as needed
]size_dict = defaultdict(list)
md5_dict = {}# Group files by size
for directory in directories:for root, _, files in os.walk(directory):for file in files:file_path = os.path.join(root, file)try:file_size = os.path.getsize(file_path)size_dict[file_size].append(file_path)except OSError:continue # Skip files that cannot be accessed# Compute the MD5 hash for files with the same size
def calculate_md5(file_path):hasher = hashlib.md5()with open(file_path, "rb") as f:for chunk in iter(lambda: f.read(4096), b""):hasher.update(chunk)return hasher.hexdigest()for size, files in size_dict.items():if len(files) > 1: # Only calculate MD5 for files with the same sizefor file_path in files:file_md5 = calculate_md5(file_path)if file_md5 in md5_dict:md5_dict[file_md5].append(file_path)else:md5_dict[file_md5] = [file_path]# Print duplicate files
print("Duplicate file paths:")
for md5, paths in md5_dict.items():if len(paths) > 1:print(f"MD5: {md5}")for path in paths:print(f" {path}")
6、或者使用concurrent.futures利用多线程处理也可以
[root@rg2-bgw-prometheus001 mmwei3]# cat test_file_md5_compare_v3.py
from concurrent.futures import ThreadPoolExecutor
import hashlib
import os
from collections import defaultdict# List of directories to check
directories = ["/train33/asrmlg/permanent/zpxie2","/yfw-b3-mix01/asrmlg/permanent/zpxie2",# Add more directories as needed
]size_dict = defaultdict(list)
md5_dict = {}# Group files by size
for directory in directories:for root, _, files in os.walk(directory):for file in files:file_path = os.path.join(root, file)try:file_size = os.path.getsize(file_path)size_dict[file_size].append(file_path)except OSError:continue # Skip files that cannot be accessed# Function to calculate the MD5 hash of a file
def calculate_md5(file_path):hasher = hashlib.md5()with open(file_path, "rb") as f:for chunk in iter(lambda: f.read(4096), b""):hasher.update(chunk)return file_path, hasher.hexdigest()# Use multithreading to compute MD5 hashes
with ThreadPoolExecutor(max_workers=8) as executor:for size, files in size_dict.items():if len(files) > 1: # Only process files with the same sizefutures = {executor.submit(calculate_md5, file): file for file in files}for future in futures:file_path, file_md5 = future.result()if file_md5 in md5_dict:md5_dict[file_md5].append(file_path)else:md5_dict[file_md5] = [file_path]# Print duplicate files
print("Duplicate file paths:")
for md5, paths in md5_dict.items():if len(paths) > 1:print(f"MD5: {md5}")for path in paths:print(f" {path}")
7、可以参考fdupes
https://github.com/adrianlopezroche/fdupes.git