mirror of
https://github.com/HDFGroup/hdf5.git
synced 2025-01-18 15:15:56 +08:00
365dac33e3
---------------------- ./doc/html/H5.format.html ./src/H5Gent.c ./src/H5Gprivate.h ./src/H5Oattr.c ./src/H5Oprivate.h ./src/H5Oshared.c ./src/H5HG.c ./src/H5HGprivate.h Added padding fields in symbol table entries, attribute messages, shared messages, and global heap objects to insure that things are aligned on 8-byte boundaries in the file, and thus in memory. Otherwise some little endian machines complain (DEC Alpha) during encoding/decoding of file meta data. I chose to add alignment to the file rather than rewriting the ENCODE/DECODE macros for the little endian case. Completely rewrote the section on attribute messages. More alignment stuff will follow. ./src/H5detect.c Fixed a typo `nd'->`dn' ./test/dtypes.c Commented out conversion tests to/from `long double' on machines where it's the same size as `double' to get rid of compiler warnings. ./doc/html/Big.html Fixed a couple typos.
123 lines
4.7 KiB
HTML
123 lines
4.7 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
|
|
<html>
|
|
<head>
|
|
<title>Big Datasets on Small Machines</title>
|
|
</head>
|
|
|
|
<body>
|
|
<h1>Big Datasets on Small Machines</h1>
|
|
|
|
<h2>1. Introduction</h2>
|
|
|
|
<p>The HDF5 library is able to handle files larger than the
|
|
maximum file size, and datasets larger than the maximum memory
|
|
size. For instance, a machine where <code>sizeof(off_t)</code>
|
|
and <code>sizeof(size_t)</code> are both four bytes can handle
|
|
datasets and files as large as 18x10^18 bytes. However, most
|
|
Unix systems limit the number of concurrently open files, so a
|
|
practical file size limit is closer to 512GB or 1TB.
|
|
|
|
<p>Two "tricks" must be imployed on these small systems in order
|
|
to store large datasets. The first trick circumvents the
|
|
<code>off_t</code> file size limit and the second circumvents
|
|
the <code>size_t</code> main memory limit.
|
|
|
|
<h2>2. File Size Limits</h2>
|
|
|
|
<p>Systems that have 64-bit file addresses will be able to access
|
|
those files automatically. One should see the following output
|
|
from configure:
|
|
|
|
<p><code><pre>
|
|
checking size of off_t... 8
|
|
</pre></code>
|
|
|
|
<p>Also, some 32-bit operating systems have special file systems
|
|
that can support large (>2GB) files and HDF5 will detect
|
|
these and use them automatically. If this is the case, the
|
|
output from configure will show:
|
|
|
|
<p><code><pre>
|
|
checking for lseek64... yes
|
|
checking for fseek64... yes
|
|
</pre></code>
|
|
|
|
<p>Otherwise one must use an HDF5 file family. Such a family is
|
|
created by setting file family properties in a file access
|
|
property list and then supplying a file name that includes a
|
|
<code>printf</code>-style integer format. For instance:
|
|
|
|
<p><code><pre>
|
|
hid_t plist, file;
|
|
plist = H5Pcreate (H5P_FILE_ACCESS);
|
|
H5Pset_family (plist, 1<<30, H5P_DEFAULT);
|
|
file = H5Fcreate ("big%03d.h5", H5F_ACC_TRUNC, H5P_DEFAULT, plist);
|
|
</code></pre>
|
|
|
|
<p>The second argument (<code>1<<30</code>) to
|
|
<code>H5Pset_family()</code> indicates that the family members
|
|
are to be 2^30 bytes (1GB) each although we could have used any
|
|
reasonably large value. In general, family members cannot be
|
|
2GB because writes to byte number 2,147,483,647 will fail, so
|
|
the largest safe value for a family member is 2,147,483,647.
|
|
HDF5 will create family members on demand as the HDF5 address
|
|
space increases, but since most Unix systems limit the number of
|
|
concurrently open files the effective maximum size of the HDF5
|
|
address space will be limited (the system on which this was
|
|
developed allows 1024 open files, so if each family member is
|
|
approx 2GB then the largest HDF5 file is approx 2TB).
|
|
|
|
<p>If the effective HDF5 address space is limited then one may be
|
|
able to store datasets as external datasets each spanning
|
|
multiple files of any length since HDF5 opens external dataset
|
|
files one at a time. To arrange storage for a 5TB dataset split
|
|
among 1GB files one could say:
|
|
|
|
<p><code><pre>
|
|
hid_t plist = H5Pcreate (H5P_DATASET_CREATE);
|
|
for (i=0; i<5*1024; i++) {
|
|
sprintf (name, "velocity-%04d.raw", i);
|
|
H5Pset_external (plist, name, 0, (size_t)1<<30);
|
|
}
|
|
</code></pre>
|
|
|
|
<h2>3. Dataset Size Limits</h2>
|
|
|
|
<p>The second limit which must be overcome is that of
|
|
<code>sizeof(size_t)</code>. HDF5 defines a data type called
|
|
<code>hsize_t</code> which is used for sizes of datasets and is,
|
|
by default, defined as <code>unsigned long long</code>.
|
|
|
|
<p>To create a dataset with 8*2^30 4-byte integers for a total of
|
|
32GB one first creates the dataspace. We give two examples
|
|
here: a 4-dimensional dataset whose dimension sizes are smaller
|
|
than the maximum value of a <code>size_t</code>, and a
|
|
1-dimensional dataset whose dimension size is too large to fit
|
|
in a <code>size_t</code>.
|
|
|
|
<p><code><pre>
|
|
hsize_t size1[4] = {8, 1024, 1024, 1024};
|
|
hid_t space1 = H5Screate_simple (4, size1, size1);
|
|
|
|
hsize_t size2[1] = {8589934592LL};
|
|
hid_t space2 = H5Screate_simple (1, size2, size2};
|
|
</pre></code>
|
|
|
|
<p>However, the <code>LL</code> suffix is not portable, so it may
|
|
be better to replace the number with
|
|
<code>(hsize_t)8*1024*1024*1024</code>.
|
|
|
|
<p>For compilers that don't support <code>long long</code> large
|
|
datasets will not be possible. The library performs too much
|
|
arithmetic on <code>hsize_t</code> types to make the use of a
|
|
struct feasible.
|
|
|
|
<hr>
|
|
<address><a href="mailto:matzke@llnl.gov">Robb Matzke</a></address>
|
|
<!-- Created: Fri Apr 10 13:26:04 EDT 1998 -->
|
|
<!-- hhmts start -->
|
|
Last modified: Sun Jul 19 11:37:25 EDT 1998
|
|
<!-- hhmts end -->
|
|
</body>
|
|
</html>
|