面试题：Python读取大文件时的编码优化策略

分块读取：

使用open函数以二进制模式（'rb'）打开文件，这样可以避免在读取过程中因字符编码转换带来的潜在问题。例如：

with open('large_file_utf16', 'rb') as f:
    while True:
        chunk = f.read(1024 * 1024)  # 每次读取1MB
        if not chunk:
            break
        # 这里处理读取的chunk

编码转换：

读取的二进制数据是UTF - 16编码的，需要将其正确解码。使用chunk.decode('utf - 16')进行解码，将二进制数据转换为字符串。完整代码如下：

with open('large_file_utf16', 'rb') as f:
    while True:
        chunk = f.read(1024 * 1024)
        if not chunk:
            break
        try:
            text = chunk.decode('utf - 16')
            # 处理解码后的文本
        except UnicodeDecodeError as e:
            print(f"解码错误: {e}")

合适的数据结构：

如果读取的数据需要进行进一步处理，例如统计单词出现次数等，可以使用collections.Counter这样的数据结构。假设解码后的文本按空格分割成单词：

from collections import Counter
word_counter = Counter()
with open('large_file_utf16', 'rb') as f:
    while True:
        chunk = f.read(1024 * 1024)
        if not chunk:
            break
        try:
            text = chunk.decode('utf - 16')
            words = text.split()
            word_counter.update(words)
        except UnicodeDecodeError as e:
            print(f"解码错误: {e}")

迭代器：
- 利用io.TextIOWrapper结合open函数的二进制模式来创建一个迭代器，逐行读取文件。这种方式对于按行处理数据的场景很有用，并且可以减少内存占用。
```
import io
with open('large_file_utf16', 'rb') as binary_file:
    text_file = io.TextIOWrapper(binary_file, encoding='utf - 16')
    for line in text_file:
        # 处理每一行
        pass
```

面试题：Python读取大文件时的编码优化策略

知识考点

面试题答案