面试题：Python正则优化处理海量URL数据

现在有一个包含百万条URL的文本文件，每行一个URL。要求使用Python的正则表达式，在尽量短的时间内，提取出所有以'.jpg'或'.png'结尾的资源URL，并统计其出现的次数。请设计一个高效的算法并写出代码实现，同时说明你采取的优化策略。

38.7万热度

难度

编程语言Python

import re
from collections import Counter

def extract_and_count_images(file_path):
    pattern = re.compile(r'^(.*?\.(jpg|png))$', re.MULTILINE)
    image_count = Counter()
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            match = pattern.search(line)
            if match:
                image_count[match.group(1)] += 1
    return image_count

file_path = 'your_file.txt'
result = extract_and_count_images(file_path)
for image_url, count in result.items():
    print(f'{image_url}: {count}')

优化策略

使用re.compile：提前编译正则表达式，减少每次匹配时的编译开销，提高匹配效率。
逐行读取文件：避免一次性将整个百万行文件读入内存，减少内存占用，并且逐行处理能够更快地开始处理数据。
Counter类：使用collections.Counter来统计URL出现的次数，Counter专门用于计数，实现高效且代码简洁。

星途面试题库

面试题：Python正则优化处理海量URL数据

知识考点

面试题答案

优化策略