当前位置：首页 > wzjs >正文

网上免费做网站厦门手机网站制作

wzjs 2025/9/5 22:31:41

网上免费做网站,厦门手机网站制作,手机网站分享,周村区建设网站常用示例当我们面临将文本文件分成最大大小块的时，我们可能会尝试编写如下代码: public class TestSplit {private static final long maxFileSizeBytes 10 * 1024 * 1024; // 默认10MBpublic void split(Path inputFile, Path outputDir) throws IOException {…

常用示例

当我们面临将文本文件分成最大大小块的时，我们可能会尝试编写如下代码:

public class TestSplit {private static final long maxFileSizeBytes = 10 * 1024 * 1024; // 默认10MBpublic void split(Path inputFile, Path outputDir) throws IOException {if (!Files.exists(inputFile)) {throw new IOException("输入文件不存在: " + inputFile);}if (Files.size(inputFile) == 0) {throw new IOException("输入文件为空: " + inputFile);}Files.createDirectories(outputDir);try (BufferedReader reader = Files.newBufferedReader(inputFile)) {int fileIndex = 0;long currentSize = 0;BufferedWriter writer = null;try {writer = newWriter(outputDir, fileIndex++);String line;while ((line = reader.readLine()) != null) {byte[] lineBytes = (line + System.lineSeparator()).getBytes();if (currentSize + lineBytes.length > maxFileSizeBytes) {if (writer != null) {writer.close();}writer = newWriter(outputDir, fileIndex++);currentSize = 0;}writer.write(line);writer.newLine();currentSize += lineBytes.length;}} finally {if (writer != null) {writer.close();}}}}private BufferedWriter newWriter(Path dir, int index) throws IOException {Path filePath = dir.resolve("part_" + index + ".txt");return Files.newBufferedWriter(filePath);}public static void main(String[] args) {String inputFilePath = "C:\Users\fei\Desktop\testTwo.txt";String outputDirPath = "C:\Users\fei\Desktop\testTwo";TestSplit splitter = new TestSplit();try {long startTime = System.currentTimeMillis();splitter.split(Paths.get(inputFilePath), Paths.get(outputDirPath));long endTime = System.currentTimeMillis();long duration = endTime - startTime;System.out.println("文件拆分完成！");System.out.printf("总耗时：%d 毫秒%n", duration);} catch (IOException e) {System.out.println("文件拆分过程中发生错误：" + e.getMessage());}}
}

效率分析

此代码在技术上是可以的，但是将大文件拆分为多个块的效率非常低。具体如下

它执行许多堆分配（行），导致创建和丢弃大量临时对象（字符串、字节数组）。
还有一个不太明显的问题，它将数据复制到多个缓冲区，并在用户和内核模式之间执行上下文切换。

代码详细分析

BufferedReader: BufferedReader 的 BufferedReader 中：

在底层 FileReader 或 InputStreamReader 上调用 read()
数据从内核空间→用户空间缓冲区复制。
然后解析为 Java 字符串（堆分配）。

getBytes() : getBytes（） 的

将 String 转换为新的 byte[] →更多的堆分配。

BufferedWriter: BufferedWriter 的 BufferedWriter 中：

从用户空间获取 byte/char 数据。
调用 write()这又涉及将用户空间复制到内核空间→。
最终刷新到磁盘。

因此，数据在内核和用户空间之间来回移动多次，并产生额外的堆改动。除了垃圾收集压力外，它还具有以下后果：

内存带宽浪费在缓冲区之间进行复制。
磁盘到磁盘传输的 CPU 利用率较高。
操作系统本可直接处理批量拷贝（通过DMA或优化I/O），但Java代码通过引入用户空间逻辑拦截了这种高效性。

方案

那么，我们如何避免上述问题呢？

答案是尽可能使用 zero copy，即尽可能避免离开 kernel 空间。这可以通过使用 FileChannel 方法 long transferTo(long position, long count, WritableByteChannel target) 在 java 中完成。它直接是磁盘到磁盘的传输，还会利用作系统的一些 IO 优化。

有问题就是所描述的方法对字节块进行作，可能会破坏行的完整性。为了解决这个问题，我们需要一种策略来确保即使通过移动字节段处理文件时，行也保持完整

没有上述的问题就很容易，只需为每个块调用 transferTo，将position递增为 position = position + maxFileSize，直到无法传输更多数据。

为了保持行的完整性，我们需要确定每个字节块中最后一个完整行的结尾。为此，我们首先查找 chunk 的预期末尾，然后向后扫描以找到前面的换行符。这将为我们提供 chunk 的准确字节计数，确保包含最后的、不间断的行。这将是执行缓冲区分配和复制的代码的唯一部分，并且由于这些作应该最小，因此预计性能影响可以忽略不计。

private static final int LINE_ENDING_SEARCH_WINDOW = 8 * 1024;

private long maxSizePerFileInBytes;
private Path outputDirectory;
private Path tempDir;

private void split(Path fileToSplit) throws IOException {try (RandomAccessFile raf = new RandomAccessFile(fileToSplit.toFile(), "r");FileChannel inputChannel = raf.getChannel()) {
long fileSize = raf.length();long position = 0;int fileCounter = 1;
while (position < fileSize) {// Calculate end position (try to get close to max size)long targetEndPosition = Math.min(position + maxSizePerFileInBytes, fileSize);
// If we're not at the end of the file, find the last line ending before max sizelong endPosition = targetEndPosition;if (endPosition < fileSize) {endPosition = findLastLineEndBeforePosition(raf, position, targetEndPosition);}
long chunkSize = endPosition - position;var outputFilePath = tempDir.resolve("_part" + fileCounter);try (FileOutputStream fos = new FileOutputStream(outputFilePath.toFile());FileChannel outputChannel = fos.getChannel()) {inputChannel.transferTo(position, chunkSize, outputChannel);}
position = endPosition;fileCounter++;}
}
}

private long findLastLineEndBeforePosition(RandomAccessFile raf, long startPosition, long maxPosition)throws IOException {long originalPosition = raf.getFilePointer();
try {int bufferSize = LINE_ENDING_SEARCH_WINDOW;long chunkSize = maxPosition - startPosition;
if (chunkSize < bufferSize) {bufferSize = (int) chunkSize;}
byte[] buffer = new byte[bufferSize];long searchPos = maxPosition;
while (searchPos > startPosition) {long distanceToStart = searchPos - startPosition;int bytesToRead = (int) Math.min(bufferSize, distanceToStart);
long readStartPos = searchPos - bytesToRead;raf.seek(readStartPos);
int bytesRead = raf.read(buffer, 0, bytesToRead);if (bytesRead <= 0)break;
// Search backwards through the buffer for newlinefor (int i = bytesRead - 1; i >= 0; i--) {if (buffer[i] == '\n') {return readStartPos + i + 1;}}
searchPos -= bytesRead;}
throw new IllegalArgumentException("File " + fileToSplit + " cannot be split. No newline found within the limits.");} finally {raf.seek(originalPosition);}
}

findLastLineEndBeforePosition 方法具有某些限制。具体来说，它仅适用于类 Unix 系统（\n），非常长的行可能会导致大量向后读取迭代，并且包含超过 maxSizePerFileInBytes 的行的文件无法拆分。但是，它非常适合拆分访问日志文件等场景，这些场景通常具有短行和大量条目。

性能分析

理论上，我们zero copy拆分文件应该【常用方式】更快，现在是时候衡量它能有多快了。为此，我为这两个实现运行了一些基准测试，这些是结果。

Benchmark                                                    Mode  Cnt           Score      Error   Units
FileSplitterBenchmark.splitFile                              avgt   15        1179.429 ±   54.271   ms/op
FileSplitterBenchmark.splitFile:·gc.alloc.rate               avgt   15        1349.613 ±   60.903  MB/sec
FileSplitterBenchmark.splitFile:·gc.alloc.rate.norm          avgt   15  1694927403.481 ± 6060.581    B/op
FileSplitterBenchmark.splitFile:·gc.count                    avgt   15         718.000             counts
FileSplitterBenchmark.splitFile:·gc.time                     avgt   15         317.000                 ms
FileSplitterBenchmark.splitFileZeroCopy                      avgt   15          77.352 ±    1.339   ms/op
FileSplitterBenchmark.splitFileZeroCopy:·gc.alloc.rate       avgt   15          23.759 ±    0.465  MB/sec
FileSplitterBenchmark.splitFileZeroCopy:·gc.alloc.rate.norm  avgt   15     2555608.877 ± 8644.153    B/op
FileSplitterBenchmark.splitFileZeroCopy:·gc.count            avgt   15          10.000             counts
FileSplitterBenchmark.splitFileZeroCopy:·gc.time             avgt   15           5.000                 ms

以下是用于上述结果的基准测试代码和文件大小。

int maxSizePerFileInBytes = 1024 * 1024 // 1 MB chunks

public void setup() throws Exception {inputFile = Paths.get("/tmp/large_input.txt");outputDir = Paths.get("/tmp/split_output");// Create a large file for benchmarking if it doesn't existif (!Files.exists(inputFile)) {try (BufferedWriter writer = Files.newBufferedWriter(inputFile)) {for (int i = 0; i < 10_000_000; i++) {writer.write("This is line number " + i);writer.newLine();}}}
}

public void splitFile() throws Exception {splitter.split(inputFile, outputDir);
}

public void splitFileZeroCopy() throws Exception {zeroCopySplitter.split(inputFile);
}

zeroCopy表现出相当大的加速，仅用了 77 毫秒，而对于这种特定情况，【常用方式】需要 1179 毫秒。在处理大量数据或许多文件时，这种性能优势可能至关重要。