- 使用
StringTokenizer
解析单条数据:
- 由于文件每一行格式为
'key1=value1;key2=value2;...'
,可以使用StringTokenizer
按';'
分割字符串,再按'='
分割每个子串来获取键值对。
- 示例代码如下:
import java.util.StringTokenizer;
public class TextParser {
public static void main(String[] args) {
String line = "key1=value1;key2=value2";
StringTokenizer outerTokenizer = new StringTokenizer(line, ";");
while (outerTokenizer.hasMoreTokens()) {
String pair = outerTokenizer.nextToken();
StringTokenizer innerTokenizer = new StringTokenizer(pair, "=");
if (innerTokenizer.countTokens() == 2) {
String key = innerTokenizer.nextToken();
String value = innerTokenizer.nextToken();
System.out.println(key + " : " + value);
}
}
}
}
- 避免内存溢出的策略:
- 逐行读取:使用
BufferedReader
逐行读取大文件,而不是一次性加载整个文件到内存。示例代码如下:
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
public class BigFileParser {
public static void main(String[] args) {
try (BufferedReader br = new BufferedReader(new FileReader("largeFile.txt"))) {
String line;
while ((line = br.readLine()) != null) {
// 在这里对每一行进行解析
StringTokenizer outerTokenizer = new StringTokenizer(line, ";");
while (outerTokenizer.hasMoreTokens()) {
String pair = outerTokenizer.nextToken();
StringTokenizer innerTokenizer = new StringTokenizer(pair, "=");
if (innerTokenizer.countTokens() == 2) {
String key = innerTokenizer.nextToken();
String value = innerTokenizer.nextToken();
System.out.println(key + " : " + value);
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
- 及时释放资源:在处理完每一行后,确保及时释放相关资源,如
StringTokenizer
对象。在上述代码中,StringTokenizer
对象在每一行处理完毕后,会等待垃圾回收机制回收其占用的内存。
- 多线程环境下的优化:
- 任务划分:将大文件按行划分成多个任务,每个线程负责处理一部分行。例如,可以使用
ExecutorService
和Callable
来实现多线程处理。
- 示例代码如下:
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;
public class MultithreadedBigFileParser {
public static void main(String[] args) {
int numThreads = 4;
ExecutorService executorService = Executors.newFixedThreadPool(numThreads);
List<Future<List<String>>> futures = new ArrayList<>();
try (BufferedReader br = new BufferedReader(new FileReader("largeFile.txt"))) {
List<String> lines = new ArrayList<>();
String line;
int lineCount = 0;
while ((line = br.readLine()) != null) {
lines.add(line);
lineCount++;
if (lineCount % (numThreads) == 0) {
Callable<List<String>> task = new ParseTask(lines);
futures.add(executorService.submit(task));
lines = new ArrayList<>();
}
}
if (!lines.isEmpty()) {
Callable<List<String>> task = new ParseTask(lines);
futures.add(executorService.submit(task));
}
for (Future<List<String>> future : futures) {
try {
List<String> result = future.get();
for (String res : result) {
System.out.println(res);
}
} catch (InterruptedException | ExecutionException e) {
e.printStackTrace();
}
}
} catch (IOException e) {
e.printStackTrace();
} finally {
executorService.shutdown();
}
}
static class ParseTask implements Callable<List<String>> {
private final List<String> lines;
public ParseTask(List<String> lines) {
this.lines = lines;
}
@Override
public List<String> call() throws Exception {
List<String> results = new ArrayList<>();
for (String line : lines) {
StringTokenizer outerTokenizer = new StringTokenizer(line, ";");
while (outerTokenizer.hasMoreTokens()) {
String pair = outerTokenizer.nextToken();
StringTokenizer innerTokenizer = new StringTokenizer(pair, "=");
if (innerTokenizer.countTokens() == 2) {
String key = innerTokenizer.nextToken();
String value = innerTokenizer.nextToken();
results.add(key + " : " + value);
}
}
}
return results;
}
}
}
- 数据一致性和解析准确性:
- 同步访问共享资源:如果多线程处理过程中有共享资源(例如统计解析结果的总数等),需要使用
synchronized
关键字或者java.util.concurrent.atomic
包下的原子类来保证数据一致性。
- 解析准确性:在解析过程中,严格按照文件格式进行处理,对每个
StringTokenizer
的结果进行有效性检查,如上述代码中检查innerTokenizer.countTokens() == 2
,确保键值对的正确解析。