mirror of
https://github.com/HDFGroup/hdf5.git
synced 2024-11-27 02:10:55 +08:00
063e4b2e2e
Description: honest3 v1.8 failed in parallel test. It got stuck in the same testpar/testphdf5 subtest (cbhsssdrpio). This is an old problem. Upon closer inspection, the testphdf5, when terminated, had clocked up 1hr 9min 46 sec wall clock time. Honest1 system also sent a message that an mpi process has used up 30+ CPU minutes which exceeded their login node cpu time limit and they killed the process. I also did a hand-run of testphdf5. All subtests before cbhsssdrpio completed in a few minutes. Therefore, it is safe to say the majority of the 70 minutes of wall clock time are spent in the sub-test cbhsssdrpio. It also used up lots of CPU time. cbhsssdrpio is likely infinite looping. Since MPI application is prone to infinite looping due to message deadlock, the testphdf5 has a built-in protection to give each subtest at most 20 minutes of wall-clock time to run. When the 20 minutes wall-clock time is exceeded, the testphdf5 will attempt to terminate itself. This prevents unnecessary CPU time consumption in infinite looping. But that clock limit was changed to 30 and then 60 minutes. I should have but failed to, noticed the change mentioned by Quincey. IMO, 20 wall clock time is more than sufficient for each subtest of testphdf5 to complete. If a subtest takes longer than 20 minutes, it is likely infinite looping. Giving it more time will not help. If a subtest of testphdf5 takes more than 20 minutes, it should be broken down to small tests that will finish way under 20 minutes so that it is much easier to see progress and identify any deadlock problems. In view of this, I am changing the testphdf5 time limit back to 20 minutes. This will at least stop the CPU TIME exceeding limits and annoying the system administrators. Maybe there could be a provision, such as environment variable like $HDF5_ALARM_SECOND to modify the alarm duration on individual execution. Even so, that should be used temporary to see if an execution just needs a little more time. Tested: just eyeballed as the change is trivia. |
||
---|---|---|
bin | ||
c++ | ||
config | ||
examples | ||
fortran | ||
hl | ||
perform | ||
release_docs | ||
src | ||
test | ||
testpar | ||
tools | ||
vms | ||
windows | ||
.autom4te.cfg | ||
.h5chkright.ini | ||
ACKNOWLEDGMENTS | ||
aclocal.m4 | ||
acsite.m4 | ||
CMakeLists.txt | ||
configure | ||
configure.in | ||
COPYING | ||
CTestConfig.cmake | ||
Makefile.am | ||
Makefile.dist | ||
Makefile.in | ||
MANIFEST | ||
README.txt |
HDF5 version 1.9.75 currently under development Please refer to the release_docs/INSTALL file for installation instructions. ------------------------------------------------------------------------------ This release is fully functional for the API described in the documentation. See the RELEASE.txt file in the release_docs/ directory for information specific to this release of the library. Several INSTALL* files can also be found in the release_docs/ directory: INSTALL contains instructions for compiling and installing the library; INSTALL_parallel contains instructions for installing the parallel version of the library; similarly-named files contain instructions for VMS and several environments on MS Windows systems. Documentation for this release can be found at the following URL: http://www.hdfgroup.org/HDF5/doc/. The following mailing lists are currently set up for HDF5 Library users: news - For announcements of HDF5 related developments, not a discussion list. hdf-forum - For general discussion of the HDF5 library with other users. hdf5dev - For discussion of the HDF5 library development with developers and other interested parties. To subscribe to a list, send mail to "<list>-subscribe@hdfgroup.org". where <list> is the name of the list. For example, send a request to subscribe to the 'news' mail list to the following address: news-subscribe@hdfgroup.org Messages to be sent to the list should be sent to "<list>@hdfgroup.org". Periodic code snapshots are provided at the following URL: ftp://ftp.hdfgroup.uiuc.edu/pub/outgoing/hdf5/snapshots Please read the README.txt file in that directory before working with a library snapshot. The HDF5 website is located at http://hdfgroup.org/HDF5/ Bugs should be reported to help@hdfgroup.org.