【方案】大文件的一致性稽核(含自定义并行md5计算)

2023-01-19  本文已影响0人  thirsd

在日常工作中,经常碰到需要对于大量数据进行一致性判断。

对于大数据量的文件,分为两类场景:
场景1: 文件夹下大量的文件进行比对;
场景2:一个超大文件进行一致性比对;

一、场景1,并行计算多文件

1.1 可以使用xargs来计算

对于大量文件的场景,可以直接使用xargs利用多核执行

find ./ -type f -print0 | xargs -0 -n1 -I{} sh -c "echo '{}' >> output.md5 && head -c 1M '{}' | md5sum >> output.md5"

另,可以参考mmd5sum python的实现版本,支持文件夹和断点续算。

1.2 可以使用rsync的实时同步来保障

https://www.geeksforgeeks.org/rsync-command-in-linux-with-examples/

二、场景2,单个超级大的文件

例如,一个文件80G的大文件,直接使用md5sum需要若干分钟。

2.1 尝试使用md5和crc32比对,发现md5性能更好

对于同一个2.3G的文件,crc32(cksum)耗时10秒,而md5(md5sum)仅5秒左右。

-rw-r--r-- 1 trade gapp 2.3G Nov  4 15:31 option.con
-rw-r--r-- 1 trade gapp  52G Nov  4 16:45 mdbpub.con

riskpubsvr1@ZJ/thirsd/trade/backup/gwpubserver.20221202_192208/flow$time cksum option.con
932951796 2443715342 option.con

real    0m9.904s
user    0m8.768s
sys 0m1.130s
riskpubsvr1@ZJ/thirsd/trade/backup/gwpubserver.20221202_192208/flow$time md5sum option.con
69898568c0ad34218844747ba73814ee  option.con

real    0m5.513s
user    0m5.001s
sys 0m0.511s

对于同一个52G的文件,crc32(cksum)耗时3分30秒,而md5(md5sum)仅2分4秒左右。

riskpubsvr1@ZJ/thirsd/trade/backup/gwpubserver.20221202_192208/flow$time cksum mdbpub.con
278810009 55811165410 mdbpub.con

real    3m32.652s
user    3m15.799s
sys 0m16.383s
riskpubsvr1@ZJ/thirsd/trade/backup/gwpubserver.20221202_192208/flow$time md5sum mdbpub.con
b7af233dc22a80437cf8c35440a76187  mdbpub.con

real    2m4.491s
user    1m53.574s
sys 0m10.852s

结论为,无论大小文件,均推荐使用md5的hash算法来计算。

2.2 并行MD5计算(自定义版)

思路为,对于一个大文件拆分为多个部分,每个部分单独计算MD5值,将md5值合并后,进行文件比对。
特别说明,该方式,并不兼容现有的md5值。
python的demo实现:

import hashlib
import math
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

default_size = 4096*16

def get_file_md5(file_path, offset=0, length=-1, idx=0 ):
    if isinstance(file_path, tuple):
        file_path, offset, length, idx = file_path

    m = hashlib.md5()
    if length == -1:
        with open(file_path, 'rb') as fobj:
            while True:
                data = fobj.read(default_size)
                if not data:
                    break;
                m.update(data)
    else:
        with open(file_path, 'rb') as fobj:
            fobj.seek(offset)
            read_size = min(default_size, length)
            tmp_len = length
            while tmp_len>0:
                data = fobj.read(default_size)
                if not data:
                    break;
                m.update(data)
                tmp_len = tmp_len - default_size
    
    return m.hexdigest(), idx

    
def get_file_md5_parallel(file_path, parallel=8):
    def get_file_size(file_path):
        import os
        return os.path.getsize(file_path)
    split_maxlen = 1024*1024*1024 # 256M
    split_cnt = int(math.ceil(float(get_file_size(file_path))/split_maxlen))
    work_list = [(file_path, i*split_maxlen, split_maxlen, i) for i in range(split_cnt)]
    
    #print(work_list)
    parallel = min(parallel, split_cnt)
    with ThreadPoolExecutor(max_workers=parallel, thread_name_prefix='md5sum_') as executor:
        work_results = list(executor.map(get_file_md5, work_list, timeout=300))
        print(sorted(work_results, key=lambda y:y[1]))
        all_md5 = hashlib.md5()
        for result in sorted(work_results, key=lambda y:y[1]):
            part_md5, _ = result
            # print(part_md5)
            all_md5.update(part_md5.encode())
        return all_md5.hexdigest()

    
    
if __name__ == "__main__":
    # print(get_file_md5("./option.con"))
    print("md5: %s" % get_file_md5_parallel("./option.con"))
    

性能测试情况:
使用一个65G的压缩文件进行性能测试,提升在8倍以上。
采用原始chusum,三次计算,均在2分48秒-2分50秒之间;
采用chk_sum.py,按照1G分割,在21秒左右,采用2G分割,在23秒左右,采用512M分割,在21秒左右。

config3@ZJ/thirsd/backup/database_dump/bizdb$ll -lh /thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar 
-rw-r--r-- 1 oracle dba 65G Dec 24 04:16 /thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar

# 三次计算,md5值均在2分48秒-2分50秒之间
config3@ZJ/thirsd/backup/database_dump/bizdb$time md5sum /thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar 
7eff9098d0fdc7d1c2db5bcb6ad7eb6c  /thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar

real    2m48.090s
user    2m22.975s
sys 0m24.903s
config3@ZJ/thirsd/backup/database_dump/bizdb$time md5sum /thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar 
7eff9098d0fdc7d1c2db5bcb6ad7eb6c  /thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar

real    2m48.860s
user    2m22.845s
sys 0m25.802s
config3@ZJ/thirsd/backup/database_dump/bizdb$time md5sum /thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar 
7eff9098d0fdc7d1c2db5bcb6ad7eb6c  /thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar

real    2m50.899s
user    2m22.333s
sys 0m28.335s

## 使用chk_sum.py脚本,分别采用1G、2G、512M等不同参数下的性能情况

## 1G切分
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py 
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: a316d0ee149f03c915591825ed72def7

real    0m20.954s
user    2m6.848s
sys 0m34.007s
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py 
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: a316d0ee149f03c915591825ed72def7

real    0m20.931s
user    2m6.683s
sys 0m33.405s
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py 
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: a316d0ee149f03c915591825ed72def7

real    0m20.962s
user    2m7.134s
sys 0m32.792s
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py 
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: a316d0ee149f03c915591825ed72def7

real    0m20.770s
user    2m7.199s
sys 0m32.560s
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py 
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: a316d0ee149f03c915591825ed72def7

real    0m20.938s
user    2m7.211s
sys 0m33.177s
config3@ZJ/thirsd/trade/tmp/20221229$


### 2G切分
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py 
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: dbf562994612b346fb48a2b36ebb3e02

real    0m22.544s
user    2m7.215s
sys 0m33.665s
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py 
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: dbf562994612b346fb48a2b36ebb3e02

real    0m22.584s
user    2m6.656s
sys 0m33.762s
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py 
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: dbf562994612b346fb48a2b36ebb3e02

real    0m22.468s
user    2m7.232s
sys 0m33.300s
config3@ZJ/thirsd/trade/tmp/20221229$

## 512M切分
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py 
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: b887a4bfc2190146eb4484b40d8e749d

real    0m21.050s
user    2m7.214s
sys 0m33.853s
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py 
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: b887a4bfc2190146eb4484b40d8e749d

real    0m21.044s
user    2m7.175s
sys 0m33.585s
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py 
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: b887a4bfc2190146eb4484b40d8e749d

real    0m20.958s
user    2m7.165s
sys 0m33.674s
config3@ZJ/thirsd/trade/tmp/20221229$
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py 
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: b887a4bfc2190146eb4484b40d8e749d

real    0m20.995s
user    2m7.041s
sys 0m33.679s
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py 
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: b887a4bfc2190146eb4484b40d8e749d

real    0m21.040s
user    2m7.359s
sys 0m33.478s
config3@ZJ/thirsd/trade/tmp/20221229$time python chk_md5.py 
/thirsd/backup/database_dump/bizdb/bizdb1_20221224_expdp.gz.tar md5: b887a4bfc2190146eb4484b40d8e749d

real    0m21.106s
user    2m7.475s
sys 0m33.738s

参考:
1、 md5sum on large files in stackoverflow

上一篇 下一篇

猜你喜欢

热点阅读