import re
from collections import Counter
def extract_and_count_images(file_path):
pattern = re.compile(r'^(.*?\.(jpg|png))$', re.MULTILINE)
image_count = Counter()
with open(file_path, 'r', encoding='utf-8') as file:
for line in file:
match = pattern.search(line)
if match:
image_count[match.group(1)] += 1
return image_count
file_path = 'your_file.txt'
result = extract_and_count_images(file_path)
for image_url, count in result.items():
print(f'{image_url}: {count}')
优化策略
- 使用
re.compile
:提前编译正则表达式,减少每次匹配时的编译开销,提高匹配效率。
- 逐行读取文件:避免一次性将整个百万行文件读入内存,减少内存占用,并且逐行处理能够更快地开始处理数据。
Counter
类:使用collections.Counter
来统计URL出现的次数,Counter
专门用于计数,实现高效且代码简洁。