mirror of
https://github.com/HDFGroup/hdf5.git
synced 2024-12-09 07:32:32 +08:00
194 lines
8.6 KiB
HTML
194 lines
8.6 KiB
HTML
|
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
|
||
|
<html>
|
||
|
<head>
|
||
|
<title>Dataset Chunking Pitfalls</title>
|
||
|
</head>
|
||
|
|
||
|
<body>
|
||
|
<h1>Dataset Chunking Pitfalls</h1>
|
||
|
|
||
|
<h2>Table of Contents</h2>
|
||
|
|
||
|
<ul>
|
||
|
<li><a href="#S1">1. Introduction</a>
|
||
|
<li><a href="#S2">2. Raw Data Chunk Cache</a>
|
||
|
<li><a href="#S3">3. Cache Efficiency</a>
|
||
|
<li><a href="#S4">4. Fragmentation</a>
|
||
|
<li><a href="#S5">5. File Storage Overhead</a>
|
||
|
</ul>
|
||
|
|
||
|
<h2><a name="S1">1. Introduction</a></h2>
|
||
|
|
||
|
|
||
|
<p><em>Chunking</em> refers to a storage layout where a dataset is
|
||
|
partitioned into fixed-size multi-dimensional chunks. The
|
||
|
chunks cover the dataset but the dataset need not be an integral
|
||
|
number of chunks. If no data is ever written to a chunk then
|
||
|
that chunk isn't allocated on disk. Figure 1 shows a 25x48
|
||
|
element dataset covered by nine 10x20 chunks and 11 data points
|
||
|
written to the dataset. No data was written to the region of
|
||
|
the dataset covered by three of the chunks so those chunks were
|
||
|
never allocated in the file -- the other chunks are allocated at
|
||
|
independent locations in the file and written in their entirety.
|
||
|
|
||
|
<center><image src="Chunk_f1.gif"><br><b>Figure 1</b></center>
|
||
|
|
||
|
<p>The HDF5 library treats chunks as atomic objects -- disk I/O is
|
||
|
always in terms of complete chunks<a href="#fn1">(1)</a>. This
|
||
|
allows data filters to be defined by the application to perform
|
||
|
tasks such as compression, encryption, checksumming,
|
||
|
<em>etc</em>. on entire chunks. As shown in Figure 2, if
|
||
|
<code>H5Dwrite()</code> touches only a few bytes of the chunk,
|
||
|
the entire chunk is read from the file, the data passes upward
|
||
|
through the filter pipeline, the few bytes are modified, the
|
||
|
data passes downward through the filter pipeline, and the entire
|
||
|
chunk is written back to the file.
|
||
|
|
||
|
<center><image src="Chunk_f2.gif"><br><b>Figure 2</b></center>
|
||
|
|
||
|
<h2><a name="S2">2. The Raw Data Chunk Cache</a></h2>
|
||
|
|
||
|
<p>It's obvious from Figure 2 that calling <code>H5Dwrite()</code>
|
||
|
many times from the application would result in poor performance
|
||
|
even if the data being written all falls within a single chunk.
|
||
|
A raw data chunk cache layer was added between the top of the
|
||
|
filter stack and the bottom of the byte modification layer<a
|
||
|
href="#fn2">(2)</a>. By default, the chunk cache will store 521
|
||
|
chunks or 1MB of data (whichever is less) but these values can
|
||
|
be modified with <code>H5Pset_cache()</code>.
|
||
|
|
||
|
<p>The preemption policy for the cache favors certain chunks and
|
||
|
tries not to preempt them.
|
||
|
|
||
|
<ul>
|
||
|
<li>Chunks that have been accessed frequently in the near past
|
||
|
are favored.
|
||
|
<li>A chunk which has just entered the cache is favored.
|
||
|
<li>A chunk which has been completely read or completely written
|
||
|
but not partially read or written is penalized according to
|
||
|
some application specified weighting between zero and one.
|
||
|
<li>A chunk which is larger than the maximum cache size is not
|
||
|
eligible for caching.
|
||
|
</ul>
|
||
|
|
||
|
<h2><a name="S3">3. Cache Efficiency</a></h2>
|
||
|
|
||
|
<p>Now for some real numbers... A 2000x2000 element dataset is
|
||
|
created and covered by a 20x20 array of chunks (each chunk is 100x100
|
||
|
elements). The raw data cache is adjusted to hold at most 25 chunks by
|
||
|
setting the maximum number of bytes to 25 times the chunk size in
|
||
|
bytes. Then the application creates a square, two-dimensional memory
|
||
|
buffer and uses it as a window into the dataset, first reading and then
|
||
|
rewriting in row-major order by moving the window across the dataset
|
||
|
(the read and write tests both start with a cold cache).
|
||
|
|
||
|
<p>The measure of efficiency in Figure 3 is the number of bytes requested
|
||
|
by the application divided by the number of bytes transferred from the
|
||
|
file. There are at least a couple ways to get an estimate of the cache
|
||
|
performance: one way is to turn on <a href="Debugging.html">cache
|
||
|
debugging</a> and look at the number of cache misses. A more accurate
|
||
|
and specific way is to register a data filter whose sole purpose is to
|
||
|
count the number of bytes that pass through it (that's the method used
|
||
|
below).
|
||
|
|
||
|
<center><image src="Chunk_f3.gif"><br><b>Figure 3</b></center>
|
||
|
|
||
|
<p>The read efficiency is less than one for two reasons: collisions in the
|
||
|
cache are handled by preempting one of the colliding chunks, and the
|
||
|
preemption algorithm occasionally preempts a chunk which hasn't been
|
||
|
referenced for a long time but is about to be referenced in the near
|
||
|
future.
|
||
|
|
||
|
<p>The write test results in lower efficiency for most window
|
||
|
sizes because HDF5 is unaware that the application is about to
|
||
|
overwrite the entire dataset and must read in most chunks before
|
||
|
modifying parts of them.
|
||
|
|
||
|
<p>There is a simple way to improve efficiency for this example.
|
||
|
It turns out that any chunk that has been completely read or
|
||
|
written is a good candidate for preemption. If we increase the
|
||
|
penalty for such chunks from the default 0.75 to the maximum
|
||
|
1.00 then efficiency improves.
|
||
|
|
||
|
<center><image src="Chunk_f4.gif"><br><b>Figure 4</b></center>
|
||
|
|
||
|
<p>The read efficiency is still less than one because of
|
||
|
collisions in the cache. The number of collisions can often be
|
||
|
reduced by increasing the number of slots in the cache. Figure
|
||
|
5 shows what happens when the maximum number of slots is
|
||
|
increased by an order of magnitude from the default (this change
|
||
|
has no major effect on memory used by the test since the byte
|
||
|
limit was not increased for the cache).
|
||
|
|
||
|
<center><image src="Chunk_f5.gif"><br><b>Figure 5</b></center>
|
||
|
|
||
|
<p>Although the application eventually overwrites every chunk
|
||
|
completely the library has know way of knowing this before hand
|
||
|
since most calls to <code>H5Dwrite()</code> modify only a
|
||
|
portion of any given chunk. Therefore, the first modification of
|
||
|
a chunk will cause the chunk to be read from disk into the chunk
|
||
|
buffer through the filter pipeline. Eventually HDF5 might
|
||
|
contain a data set transfer property that can turn off this read
|
||
|
operation resulting in write efficiency which is equal to read
|
||
|
efficiency.
|
||
|
|
||
|
|
||
|
<h2><a name="S4">4. Fragmentation</a></h2>
|
||
|
|
||
|
<p>Even if the application transfers the entire dataset contents with a
|
||
|
single call to <code>H5Dread()</code> or <code>H5Dwrite()</code> it's
|
||
|
possible the request will be broken into smaller, more manageable
|
||
|
pieces by the library. This is almost certainly true if the data
|
||
|
transfer includes a type conversion.
|
||
|
|
||
|
<center><image src="Chunk_f6.gif"><br><b>Figure 6</b></center>
|
||
|
|
||
|
<p>By default the strip size is 1MB but it can be changed by calling
|
||
|
<code>H5Pset_buffer()</code>.
|
||
|
|
||
|
|
||
|
<h2><a name="S5">5. File Storage Overhead</a></h2>
|
||
|
|
||
|
<p>The chunks of the dataset are allocated at independent
|
||
|
locations throughout the HDF5 file and a B-tree maps chunk
|
||
|
N-dimensional addresses to file addresses. The more chunks that
|
||
|
are allocated for a dataset the larger the B-tree. Large B-trees
|
||
|
have two disadvantages:
|
||
|
|
||
|
<ul>
|
||
|
<li>The file storage overhead is higher and more disk I/O is
|
||
|
required to traverse the tree from root to leaves.
|
||
|
<li>The increased number of B-tree nodes will result in higher
|
||
|
contention for the meta data cache.
|
||
|
</ul>
|
||
|
|
||
|
<p>There are three ways to reduce the number of B-tree nodes. The
|
||
|
obvious way is to reduce the number of chunks by choosing a larger chunk
|
||
|
size (doubling the chunk size will cut the number of B-tree nodes in
|
||
|
half). Another method is to adjust the split ratios for the B-tree by
|
||
|
calling <code>H5Pset_split_ratios()</code>, but this method typically
|
||
|
results in only a slight improvement over the default settings.
|
||
|
Finally, the out-degree of each node can be increased by calling
|
||
|
<code>H5Pset_istore_k()</code> (increasing the out degree actually
|
||
|
increases file overhead while decreasing the number of nodes).
|
||
|
|
||
|
<hr>
|
||
|
|
||
|
<p><a name="fn1">Footnote 1:</a> Parallel versions of the library
|
||
|
can access individual bytes of a chunk when the underlying file
|
||
|
uses MPI-IO.
|
||
|
|
||
|
<p><a name="fn2">Footnote 2:</a> The raw data chunk cache was
|
||
|
added before the second alpha release.
|
||
|
|
||
|
|
||
|
|
||
|
<hr>
|
||
|
<address><a href="mailto:matzke@llnl.gov">Robb Matzke</a></address>
|
||
|
<!-- Created: Tue Oct 20 12:38:40 EDT 1998 -->
|
||
|
<!-- hhmts start -->
|
||
|
Last modified: Fri Oct 23 10:30:52 EDT 1998
|
||
|
<!-- hhmts end -->
|
||
|
</body>
|
||
|
</html>
|