Denis Davydov
2015-11-03 08:07:15 UTC
Dear all,
I experience strange convergence problems in SLEPc for GHEP with Krylov-Schur and CG + GAMG.
The issue appears to be contingent on the number of MPI cores used.
Say for 8 cores there is no issue and for 4 cores there is an issue.
When I substitute GAMG with Jacobi for the problematic number of cores -- all works.
To be more specific, I solve Ax=\lambda Bx for a sequence of A’s where A is a function of eigenvectors.
On each iteration step currently the eigensolver EPS is initialised from scratch. And thus should the underlying
ST, KSP, PC objects: -st_ksp_type cg -st_pc_type gamg -st_ksp_rtol 1e-12.
For these particular matrices the issue appears on the 4th iteration, even though the matrix to be inverted (mass/overlap matrix)
is the same, does not change!!!
From my debuging info the A matrix has the same norm for CG + GAMG vs CG + Jacobi cases:
DEAL:: frobenius_norm = 365.7
DEAL:: linfty_norm = 19.87
DEAL:: l1_norm = 19.87
Just to be sure that there are no bugs on my side which would result in different mass matrices i check that
it has the same norm for CG + GAMG vs CG + Jacobi BEFORE i start iteration:
DEAL:: frobenius_norm = 166.4
DEAL:: linfty_norm = 8.342
DEAL:: l1_norm = 8.342
All the dependent scalar quantities I calculate on each iteration are identical for the two cases, which makes me believe that
the solution path is the same up to the certain tolerance.
The only output which is slightly different are the number iterations for convergence in EPS (e.g. 113 vs 108) and the
resulting maxing EPSComputeResidualNorm : 4.1524e-07 vs 2.9639e-08.
Any ideas what could be an issue, especially given the fact that it does work for some numbers of cores and does not for other?
Perhaps some default settings in GAMG preconditioner? Although that does not explain why it works for the first 3 iterations
and does not on 4th as the mass matrix is unchanged...
Lastly, i suppose ideally i should keep the eigensolver context between the iterations and just update the matrices by EPSSetOperators.
Is it correct to assume that since B matrix does not change between iterations and I use the default shift transformation with zero shift
(operator is B^{-1)A ), the GAMG preconditioner will not be re-initialised and thus I should save some time?
p.s. the relevant error message is below. I have the same issues on CentOS cluster, so it is not related to OS-X.
Kind regards,
Denis
===
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: --------------------- Stack Frames ------------------------------------
[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[0]PETSC ERROR: INSTEAD the line number of the start of the function
[0]PETSC ERROR: is given.
[0]PETSC ERROR: [0] KSPSolve line 510 /private/tmp/petsc20151102-50378-1t7b3in/petsc-3.6.2/src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: [0] STMatSolve line 148 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/sys/classes/st/interface/stsles.c
[0]PETSC ERROR: [0] STApply_Shift line 33 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/sys/classes/st/impls/shift/shift.c
[0]PETSC ERROR: [0] STApply line 50 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/sys/classes/st/interface/stsolve.c
[0]PETSC ERROR: [0] EPSGetStartVector line 726 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/eps/interface/epssolve.c
[0]PETSC ERROR: [0] EPSSolve_KrylovSchur_Symm line 41 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/eps/impls/krylov/krylovschur/ks-symm.c
[0]PETSC ERROR: [0] EPSSolve line 83 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/eps/interface/epssolve.c
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Signal received
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.6.2, Oct, 02, 2015
[0]PETSC ERROR: /Users/davydden/Desktop/work/C++/deal.ii-dft/build_debug~/dft on a real named MBP-Denis.fritz.box by davydden Tue Nov 3 07:02:47 2015
[0]PETSC ERROR: Configure options CC=/usr/local/bin/mpicc CXX=/usr/local/bin/mpicxx F77=/usr/local/bin/mpif77 FC=/usr/local/bin/mpif90 --with-shared-libraries=1 --with-pthread=0 --with-openmp=0 --with-debugging=1 --with-ssl=0 --with-superlu_dist-include=/usr/local/opt/superlu_dist/include/superlu_dist --with-superlu_dist-lib="-L/usr/local/opt/superlu_dist/lib -lsuperlu_dist" --with-superlu-include=/usr/local/Cellar/superlu43/4.3/include/superlu --with-superlu-lib="-L/usr/local/Cellar/superlu43/4.3/lib -lsuperlu" --with-fftw-dir=/usr/local/opt/fftw --with-netcdf-dir=/usr/local/opt/netcdf --with-suitesparse-dir=/usr/local/opt/suite-sparse --with-hdf5-dir=/usr/local/opt/hdf5 --with-metis-dir=/usr/local/opt/metis --with-parmetis-dir=/usr/local/opt/parmetis --with-scalapack-dir=/usr/local/opt/scalapack --with-mumps-dir=/usr/local/opt/mumps --with-x=0 --prefix=/usr/local/Cellar/petsc/3.6.2/real --with-scalar-type=real --with-hypre-dir=/usr/local/opt/hypre --with-sundials-dir=/usr/local/opt/sundials --with-hwloc-dir=/usr/local/opt/hwloc
[0]PETSC ERROR: #8 User provided function() line 0 in unknown file
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 96754 on node MBP-Denis exited on signal 6 (Abort trap: 6).
--------------------------------------------------------------------------
I experience strange convergence problems in SLEPc for GHEP with Krylov-Schur and CG + GAMG.
The issue appears to be contingent on the number of MPI cores used.
Say for 8 cores there is no issue and for 4 cores there is an issue.
When I substitute GAMG with Jacobi for the problematic number of cores -- all works.
To be more specific, I solve Ax=\lambda Bx for a sequence of A’s where A is a function of eigenvectors.
On each iteration step currently the eigensolver EPS is initialised from scratch. And thus should the underlying
ST, KSP, PC objects: -st_ksp_type cg -st_pc_type gamg -st_ksp_rtol 1e-12.
For these particular matrices the issue appears on the 4th iteration, even though the matrix to be inverted (mass/overlap matrix)
is the same, does not change!!!
From my debuging info the A matrix has the same norm for CG + GAMG vs CG + Jacobi cases:
DEAL:: frobenius_norm = 365.7
DEAL:: linfty_norm = 19.87
DEAL:: l1_norm = 19.87
Just to be sure that there are no bugs on my side which would result in different mass matrices i check that
it has the same norm for CG + GAMG vs CG + Jacobi BEFORE i start iteration:
DEAL:: frobenius_norm = 166.4
DEAL:: linfty_norm = 8.342
DEAL:: l1_norm = 8.342
All the dependent scalar quantities I calculate on each iteration are identical for the two cases, which makes me believe that
the solution path is the same up to the certain tolerance.
The only output which is slightly different are the number iterations for convergence in EPS (e.g. 113 vs 108) and the
resulting maxing EPSComputeResidualNorm : 4.1524e-07 vs 2.9639e-08.
Any ideas what could be an issue, especially given the fact that it does work for some numbers of cores and does not for other?
Perhaps some default settings in GAMG preconditioner? Although that does not explain why it works for the first 3 iterations
and does not on 4th as the mass matrix is unchanged...
Lastly, i suppose ideally i should keep the eigensolver context between the iterations and just update the matrices by EPSSetOperators.
Is it correct to assume that since B matrix does not change between iterations and I use the default shift transformation with zero shift
(operator is B^{-1)A ), the GAMG preconditioner will not be re-initialised and thus I should save some time?
p.s. the relevant error message is below. I have the same issues on CentOS cluster, so it is not related to OS-X.
Kind regards,
Denis
===
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: --------------------- Stack Frames ------------------------------------
[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[0]PETSC ERROR: INSTEAD the line number of the start of the function
[0]PETSC ERROR: is given.
[0]PETSC ERROR: [0] KSPSolve line 510 /private/tmp/petsc20151102-50378-1t7b3in/petsc-3.6.2/src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: [0] STMatSolve line 148 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/sys/classes/st/interface/stsles.c
[0]PETSC ERROR: [0] STApply_Shift line 33 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/sys/classes/st/impls/shift/shift.c
[0]PETSC ERROR: [0] STApply line 50 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/sys/classes/st/interface/stsolve.c
[0]PETSC ERROR: [0] EPSGetStartVector line 726 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/eps/interface/epssolve.c
[0]PETSC ERROR: [0] EPSSolve_KrylovSchur_Symm line 41 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/eps/impls/krylov/krylovschur/ks-symm.c
[0]PETSC ERROR: [0] EPSSolve line 83 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/eps/interface/epssolve.c
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Signal received
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.6.2, Oct, 02, 2015
[0]PETSC ERROR: /Users/davydden/Desktop/work/C++/deal.ii-dft/build_debug~/dft on a real named MBP-Denis.fritz.box by davydden Tue Nov 3 07:02:47 2015
[0]PETSC ERROR: Configure options CC=/usr/local/bin/mpicc CXX=/usr/local/bin/mpicxx F77=/usr/local/bin/mpif77 FC=/usr/local/bin/mpif90 --with-shared-libraries=1 --with-pthread=0 --with-openmp=0 --with-debugging=1 --with-ssl=0 --with-superlu_dist-include=/usr/local/opt/superlu_dist/include/superlu_dist --with-superlu_dist-lib="-L/usr/local/opt/superlu_dist/lib -lsuperlu_dist" --with-superlu-include=/usr/local/Cellar/superlu43/4.3/include/superlu --with-superlu-lib="-L/usr/local/Cellar/superlu43/4.3/lib -lsuperlu" --with-fftw-dir=/usr/local/opt/fftw --with-netcdf-dir=/usr/local/opt/netcdf --with-suitesparse-dir=/usr/local/opt/suite-sparse --with-hdf5-dir=/usr/local/opt/hdf5 --with-metis-dir=/usr/local/opt/metis --with-parmetis-dir=/usr/local/opt/parmetis --with-scalapack-dir=/usr/local/opt/scalapack --with-mumps-dir=/usr/local/opt/mumps --with-x=0 --prefix=/usr/local/Cellar/petsc/3.6.2/real --with-scalar-type=real --with-hypre-dir=/usr/local/opt/hypre --with-sundials-dir=/usr/local/opt/sundials --with-hwloc-dir=/usr/local/opt/hwloc
[0]PETSC ERROR: #8 User provided function() line 0 in unknown file
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 96754 on node MBP-Denis exited on signal 6 (Abort trap: 6).
--------------------------------------------------------------------------