面试题：Python复杂文件内容处理与算法优化

算法思路

滑动窗口：由于只关心过去一小时内的访问记录，我们可以使用滑动窗口的概念。维护一个有序的记录列表，按照时间戳排序，每当新的记录到来，移除时间戳超过当前时间一小时之前的记录。
字典统计：使用字典来存储每个IP地址对特定页面且状态码为200的访问次数，这样可以快速查找和更新统计数据。

优化策略

逐行读取：对于大文件，逐行读取而不是一次性加载到内存中，避免内存溢出。
时间戳排序：使用堆数据结构（如heapq）来维护有序的记录列表，这样在添加新记录和移除过期记录时都能保持高效。
缓存：如果可能，可以使用缓存来存储最近处理过的IP地址的统计信息，减少重复计算。

Python 代码实现

import heapq
from collections import defaultdict


def count_ips_in_last_hour(log_file_path, target_page):
    ip_count = defaultdict(int)
    record_heap = []
    with open(log_file_path, 'r') as f:
        for line in f:
            timestamp, ip, page, status_code = line.strip().split(',')
            timestamp = float(timestamp)
            # 移除一小时前的记录
            while record_heap and record_heap[0][0] <= timestamp - 3600:
                _, ip_to_remove, _, _ = heapq.heappop(record_heap)
                if page == target_page and status_code == '200':
                    ip_count[ip_to_remove] -= 1
            if page == target_page and status_code == '200':
                ip_count[ip] += 1
                heapq.heappush(record_heap, (timestamp, ip, page, status_code))
    return ip_count


# 示例调用
log_file_path = 'your_log_file.log'
target_page = 'your_target_page'
result = count_ips_in_last_hour(log_file_path, target_page)
for ip, count in result.items():
    print(f'IP: {ip}, Count: {count}')

面试题：Python复杂文件内容处理与算法优化

知识考点

面试题答案

算法思路

优化策略

Python 代码实现