【方案】大文件的一致性稽核(含自定义并行md5计算)
2023-01-19 本文已影响0人
thirsd
在日常工作中,经常碰到需要对于大量数据进行一致性判断。
对于大数据量的文件,分为两类场景:
场景1: 文件夹下大量的文件进行比对;
场景2:一个超大文件进行一致性比对;
一、场景1,并行计算多文件
1.1 可以使用xargs来计算
对于大量文件的场景,可以直接使用xargs利用多核执行
find ./ -type f -print0 | xargs -0 -n1 -I{} sh -c "echo '{}' >> output.md5 && head -c 1M '{}' | md5sum >> output.md5"
另,可以参考mmd5sum python的实现版本,支持文件夹和断点续算。
1.2 可以使用rsync的实时同步来保障
https://www.geeksforgeeks.org/rsync-command-in-linux-with-examples/
二、场景2,单个超级大的文件
例如,一个文件80G的大文件,直接使用md5sum需要若干分钟。
2.1 尝试使用md5和crc32比对,发现md5性能更好。
对于同一个2.3G的文件,crc32(cksum)耗时10秒,而md5(md5sum)仅5秒左右。
-rw-r--r-- 1 trade gapp 2.3G Nov 4 15:31 option.con
-rw-r--r-- 1 trade gapp 52G Nov 4 16:45 mdbpub.con
riskpubsvr1@ZJ/thirsd/trade/backup/gwpubserver.20221202_192208/flow$time cksum option.con
932951796 2443715342 option.con
real 0m9.904s
user 0m8.768s
sys 0m1.130s
riskpubsvr1@ZJ/thirsd/trade/backup/gwpubserver.20221202_192208/flow$time md5sum option.con
69898568c0ad34218844747ba73814ee option.con
real 0m5.513s
user 0m5.001s
sys 0m0.511s
对于同一个52G的文件,crc32(cksum)耗时3分30秒,而md5(md5sum)仅2分4秒左右。
riskpubsvr1@ZJ/thirsd/trade/backup/gwpubserver.20221202_192208/flow$time cksum mdbpub.con
278810009 55811165410 mdbpub.con
real 3m32.652s
user 3m15.799s
sys 0m16.383s
riskpubsvr1@ZJ/thirsd/trade/backup/gwpubserver.20221202_192208/flow$time md5sum mdbpub.con
b7af233dc22a80437cf8c35440a76187 mdbpub.con
real 2m4.491s
user 1m53.574s
sys 0m10.852s
结论为,无论大小文件,均推荐使用md5的hash算法来计算。
2.2 并行MD5计算(自定义版)
思路为,对于一个大文件拆分为多个部分,每个部分单独计算MD5值,将md5值合并后,进行文件比对。
特别说明,该方式,并不兼容现有的md5值。
python的demo实现:
import hashlib
import math
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
default_size = 4096*16
def get_file_md5(file_path, offset=0, length=-1, idx=0 ):
if isinstance(file_path, tuple):
file_path, offset, length, idx = file_path
m = hashlib.md5()
if length == -1:
with open(file_path, 'rb') as fobj:
while True:
data = fobj.read(default_size)
if not data:
break;
m.update(data)
else:
with open(file_path, 'rb') as fobj:
fobj.seek(offset)
read_size = min(default_size, length)
tmp_len = length
while tmp_len>0:
data = fobj.read(default_size)
if not data:
break;
m.update(data)
tmp_len = tmp_len - default_size
return m.hexdigest(), idx
def get_file_md5_parallel(file_path, parallel=8):
def get_file_size(file_path):
import os
return os.path.getsize(file_path)
split_maxlen = 1024*1024*1024 # 256M
split_cnt = int(math.ceil(float(get_file_size(file_path))/split_maxlen))
work_list = [(file_path, i*split_maxlen, split_maxlen, i) for i in range(split_cnt)]
#print(work_list)
parallel = min(parallel, split_cnt)
with ThreadPoolExecutor(max_workers=parallel, thread_name_prefix='md5sum_') as executor:
work_results = list(executor.map(get_file_md5, work_list, timeout=300))
print(sorted(work_results, key=lambda y:y[1]))
all_md5 = hashlib.md5()
for result in sorted(work_results, key=lambda y:y[1]):
part_md5, _ = result
# print(part_md5)
all_md5.update(part_md5.encode())
return all_md5.hexdigest()
if __name__ == "__main__":
# print(get_file_md5("./option.con"))
print("md5: %s" % get_file_md5_parallel("./option.con"))
性能测试情况:
使用一个65G的压缩文件进行性能测试,提升在8倍以上。
采用原始chusum,三次计算,均在2分48秒-2分50秒之间;
采用chk_sum.py,按照1G分割,在21秒左右,采用2G分割,在23秒左右,采用512M分割,在21秒左右。
config3@ZJ/thirsd/backup/database_dump/bizdb$ll -lh /thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar
-rw-r--r-- 1 oracle dba 65G Dec 24 04:16 /thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar
# 三次计算,md5值均在2分48秒-2分50秒之间
config3@ZJ/thirsd/backup/database_dump/bizdb$time md5sum /thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar
7eff9098d0fdc7d1c2db5bcb6ad7eb6c /thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar
real 2m48.090s
user 2m22.975s
sys 0m24.903s
config3@ZJ/thirsd/backup/database_dump/bizdb$time md5sum /thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar
7eff9098d0fdc7d1c2db5bcb6ad7eb6c /thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar
real 2m48.860s
user 2m22.845s
sys 0m25.802s
config3@ZJ/thirsd/backup/database_dump/bizdb$time md5sum /thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar
7eff9098d0fdc7d1c2db5bcb6ad7eb6c /thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar
real 2m50.899s
user 2m22.333s
sys 0m28.335s
## 使用chk_sum.py脚本,分别采用1G、2G、512M等不同参数下的性能情况
## 1G切分
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: a316d0ee149f03c915591825ed72def7
real 0m20.954s
user 2m6.848s
sys 0m34.007s
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: a316d0ee149f03c915591825ed72def7
real 0m20.931s
user 2m6.683s
sys 0m33.405s
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: a316d0ee149f03c915591825ed72def7
real 0m20.962s
user 2m7.134s
sys 0m32.792s
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: a316d0ee149f03c915591825ed72def7
real 0m20.770s
user 2m7.199s
sys 0m32.560s
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: a316d0ee149f03c915591825ed72def7
real 0m20.938s
user 2m7.211s
sys 0m33.177s
config3@ZJ/thirsd/trade/tmp/20221229$
### 2G切分
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: dbf562994612b346fb48a2b36ebb3e02
real 0m22.544s
user 2m7.215s
sys 0m33.665s
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: dbf562994612b346fb48a2b36ebb3e02
real 0m22.584s
user 2m6.656s
sys 0m33.762s
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: dbf562994612b346fb48a2b36ebb3e02
real 0m22.468s
user 2m7.232s
sys 0m33.300s
config3@ZJ/thirsd/trade/tmp/20221229$
## 512M切分
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: b887a4bfc2190146eb4484b40d8e749d
real 0m21.050s
user 2m7.214s
sys 0m33.853s
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: b887a4bfc2190146eb4484b40d8e749d
real 0m21.044s
user 2m7.175s
sys 0m33.585s
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: b887a4bfc2190146eb4484b40d8e749d
real 0m20.958s
user 2m7.165s
sys 0m33.674s
config3@ZJ/thirsd/trade/tmp/20221229$
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: b887a4bfc2190146eb4484b40d8e749d
real 0m20.995s
user 2m7.041s
sys 0m33.679s
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: b887a4bfc2190146eb4484b40d8e749d
real 0m21.040s
user 2m7.359s
sys 0m33.478s
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: b887a4bfc2190146eb4484b40d8e749d
real 0m21.106s
user 2m7.475s
sys 0m33.738s