面试题：Python高效批量处理海量文件

设计思路

文件类型识别：使用文件扩展名或magic库来识别文件类型。
分块处理：为减少内存占用，对于大文件采用分块读取方式。比如视频文件，在读取时长时可以只读取关键部分。
断点续传：记录已处理的文件，将处理进度保存在一个状态文件中。程序启动时，先读取状态文件，从上次中断处继续处理。
并行处理：使用multiprocessing库进行并行处理，提高处理效率。但要注意进程间通信和资源管理，避免资源竞争。

核心代码示例

import os
import magic
import cv2
import moviepy.editor as mp
from collections import Counter
import multiprocessing
import json


def process_image(file_path):
    try:
        img = cv2.imread(file_path)
        height, width, _ = img.shape
        # 这里简单假设获取到的拍摄时间为文件修改时间
        import time
        capture_time = time.ctime(os.path.getmtime(file_path))
        return {
            'file_path': file_path,
           'resolution': f'{width}x{height}',
            'capture_time': capture_time
        }
    except Exception as e:
        print(f"处理图片 {file_path} 出错: {e}")
        return None


def process_video(file_path):
    try:
        clip = mp.VideoFileClip(file_path)
        duration = clip.duration
        clip.close()
        return {
            'file_path': file_path,
            'duration': duration
        }
    except Exception as e:
        print(f"处理视频 {file_path} 出错: {e}")
        return None


def process_text(file_path):
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            text = f.read()
            words = text.split()
            word_count = len(words)
            return {
                'file_path': file_path,
                'word_count': word_count
            }
    except Exception as e:
        print(f"处理文本 {file_path} 出错: {e}")
        return None


def process_file(file_path):
    file_type = magic.from_file(file_path, mime=True)
    if file_type.startswith('image'):
        return process_image(file_path)
    elif file_type.startswith('video'):
        return process_video(file_path)
    elif file_type.startswith('text'):
        return process_text(file_path)
    else:
        print(f"不支持的文件类型: {file_path}")
        return None


def save_progress(processed_files, progress_file='progress.json'):
    with open(progress_file, 'w') as f:
        json.dump(processed_files, f)


def load_progress(progress_file='progress.json'):
    if not os.path.exists(progress_file):
        return []
    with open(progress_file, 'r') as f:
        return json.load(f)


def main():
    storage_dir = 'your_storage_directory'
    progress_file = 'progress.json'
    processed_files = load_progress(progress_file)
    all_files = [os.path.join(storage_dir, file) for file in os.listdir(storage_dir) if
                 os.path.isfile(os.path.join(storage_dir, file))]
    unprocessed_files = [file for file in all_files if file not in processed_files]

    pool = multiprocessing.Pool()
    results = pool.map(process_file, unprocessed_files)
    pool.close()
    pool.join()

    for result in results:
        if result:
            processed_files.append(result['file_path'])
            save_progress(processed_files, progress_file)


if __name__ == "__main__":
    main()

注意事项：

cv2库用于处理图片，moviepy库用于处理视频，magic库用于识别文件类型。确保这些库已安装。
实际应用中，对于拍摄时间获取，可能需要更复杂的元数据解析方式。
处理超大文件时，moviepy可能需要更优化的方式来获取时长，可考虑使用ffmpeg命令行工具，这里只是简单示例。
代码中的并行处理部分，要根据实际机器的CPU核心数等资源情况进行调整，避免资源耗尽。

面试题：Python高效批量处理海量文件

知识考点

面试题答案

设计思路

核心代码示例