Discussion:
[petsc-users] CG+GAMG convergence issues in GHEP Krylov-Schur for some MPI runs
Denis Davydov
2015-11-03 08:07:15 UTC
Permalink
Dear all,

I experience strange convergence problems in SLEPc for GHEP with Krylov-Schur and CG + GAMG.
The issue appears to be contingent on the number of MPI cores used.
Say for 8 cores there is no issue and for 4 cores there is an issue.
When I substitute GAMG with Jacobi for the problematic number of cores -- all works.

To be more specific, I solve Ax=\lambda Bx for a sequence of A’s where A is a function of eigenvectors.
On each iteration step currently the eigensolver EPS is initialised from scratch. And thus should the underlying
ST, KSP, PC objects: -st_ksp_type cg -st_pc_type gamg -st_ksp_rtol 1e-12.
For these particular matrices the issue appears on the 4th iteration, even though the matrix to be inverted (mass/overlap matrix)
is the same, does not change!!!
From my debuging info the A matrix has the same norm for CG + GAMG vs CG + Jacobi cases:
DEAL:: frobenius_norm = 365.7
DEAL:: linfty_norm = 19.87
DEAL:: l1_norm = 19.87
Just to be sure that there are no bugs on my side which would result in different mass matrices i check that
it has the same norm for CG + GAMG vs CG + Jacobi BEFORE i start iteration:
DEAL:: frobenius_norm = 166.4
DEAL:: linfty_norm = 8.342
DEAL:: l1_norm = 8.342
All the dependent scalar quantities I calculate on each iteration are identical for the two cases, which makes me believe that
the solution path is the same up to the certain tolerance.
The only output which is slightly different are the number iterations for convergence in EPS (e.g. 113 vs 108) and the
resulting maxing EPSComputeResidualNorm : 4.1524e-07 vs 2.9639e-08.


Any ideas what could be an issue, especially given the fact that it does work for some numbers of cores and does not for other?
Perhaps some default settings in GAMG preconditioner? Although that does not explain why it works for the first 3 iterations
and does not on 4th as the mass matrix is unchanged...

Lastly, i suppose ideally i should keep the eigensolver context between the iterations and just update the matrices by EPSSetOperators.
Is it correct to assume that since B matrix does not change between iterations and I use the default shift transformation with zero shift
(operator is B^{-1)A ), the GAMG preconditioner will not be re-initialised and thus I should save some time?

p.s. the relevant error message is below. I have the same issues on CentOS cluster, so it is not related to OS-X.

Kind regards,
Denis


===
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: --------------------- Stack Frames ------------------------------------
[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[0]PETSC ERROR: INSTEAD the line number of the start of the function
[0]PETSC ERROR: is given.
[0]PETSC ERROR: [0] KSPSolve line 510 /private/tmp/petsc20151102-50378-1t7b3in/petsc-3.6.2/src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: [0] STMatSolve line 148 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/sys/classes/st/interface/stsles.c
[0]PETSC ERROR: [0] STApply_Shift line 33 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/sys/classes/st/impls/shift/shift.c
[0]PETSC ERROR: [0] STApply line 50 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/sys/classes/st/interface/stsolve.c
[0]PETSC ERROR: [0] EPSGetStartVector line 726 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/eps/interface/epssolve.c
[0]PETSC ERROR: [0] EPSSolve_KrylovSchur_Symm line 41 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/eps/impls/krylov/krylovschur/ks-symm.c
[0]PETSC ERROR: [0] EPSSolve line 83 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/eps/interface/epssolve.c
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Signal received
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.6.2, Oct, 02, 2015
[0]PETSC ERROR: /Users/davydden/Desktop/work/C++/deal.ii-dft/build_debug~/dft on a real named MBP-Denis.fritz.box by davydden Tue Nov 3 07:02:47 2015
[0]PETSC ERROR: Configure options CC=/usr/local/bin/mpicc CXX=/usr/local/bin/mpicxx F77=/usr/local/bin/mpif77 FC=/usr/local/bin/mpif90 --with-shared-libraries=1 --with-pthread=0 --with-openmp=0 --with-debugging=1 --with-ssl=0 --with-superlu_dist-include=/usr/local/opt/superlu_dist/include/superlu_dist --with-superlu_dist-lib="-L/usr/local/opt/superlu_dist/lib -lsuperlu_dist" --with-superlu-include=/usr/local/Cellar/superlu43/4.3/include/superlu --with-superlu-lib="-L/usr/local/Cellar/superlu43/4.3/lib -lsuperlu" --with-fftw-dir=/usr/local/opt/fftw --with-netcdf-dir=/usr/local/opt/netcdf --with-suitesparse-dir=/usr/local/opt/suite-sparse --with-hdf5-dir=/usr/local/opt/hdf5 --with-metis-dir=/usr/local/opt/metis --with-parmetis-dir=/usr/local/opt/parmetis --with-scalapack-dir=/usr/local/opt/scalapack --with-mumps-dir=/usr/local/opt/mumps --with-x=0 --prefix=/usr/local/Cellar/petsc/3.6.2/real --with-scalar-type=real --with-hypre-dir=/usr/local/opt/hypre --with-sundials-dir=/usr/local/opt/sundials --with-hwloc-dir=/usr/local/opt/hwloc
[0]PETSC ERROR: #8 User provided function() line 0 in unknown file
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 96754 on node MBP-Denis exited on signal 6 (Abort trap: 6).
--------------------------------------------------------------------------
Jose E. Roman
2015-11-03 11:20:43 UTC
Permalink
I am answering the SLEPc-related questions:
- Having different number of iterations when changing the number of processes is normal.
- Yes, if you do not destroy the EPS solver, then the preconditioner would be reused.

Regarding the segmentation fault, I have no clue. Not sure if this is related to GAMG or not. Maybe running under valgrind could provide more information.

Jose
Post by Denis Davydov
Dear all,
I experience strange convergence problems in SLEPc for GHEP with Krylov-Schur and CG + GAMG.
The issue appears to be contingent on the number of MPI cores used.
Say for 8 cores there is no issue and for 4 cores there is an issue.
When I substitute GAMG with Jacobi for the problematic number of cores -- all works.
To be more specific, I solve Ax=\lambda Bx for a sequence of A’s where A is a function of eigenvectors.
On each iteration step currently the eigensolver EPS is initialised from scratch. And thus should the underlying
ST, KSP, PC objects: -st_ksp_type cg -st_pc_type gamg -st_ksp_rtol 1e-12.
For these particular matrices the issue appears on the 4th iteration, even though the matrix to be inverted (mass/overlap matrix)
is the same, does not change!!!
DEAL:: frobenius_norm = 365.7
DEAL:: linfty_norm = 19.87
DEAL:: l1_norm = 19.87
Just to be sure that there are no bugs on my side which would result in different mass matrices i check that
DEAL:: frobenius_norm = 166.4
DEAL:: linfty_norm = 8.342
DEAL:: l1_norm = 8.342
All the dependent scalar quantities I calculate on each iteration are identical for the two cases, which makes me believe that
the solution path is the same up to the certain tolerance.
The only output which is slightly different are the number iterations for convergence in EPS (e.g. 113 vs 108) and the
resulting maxing EPSComputeResidualNorm : 4.1524e-07 vs 2.9639e-08.
Any ideas what could be an issue, especially given the fact that it does work for some numbers of cores and does not for other?
Perhaps some default settings in GAMG preconditioner? Although that does not explain why it works for the first 3 iterations
and does not on 4th as the mass matrix is unchanged...
Lastly, i suppose ideally i should keep the eigensolver context between the iterations and just update the matrices by EPSSetOperators.
Is it correct to assume that since B matrix does not change between iterations and I use the default shift transformation with zero shift
(operator is B^{-1)A ), the GAMG preconditioner will not be re-initialised and thus I should save some time?
p.s. the relevant error message is below. I have the same issues on CentOS cluster, so it is not related to OS-X.
Kind regards,
Denis
===
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: --------------------- Stack Frames ------------------------------------
[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[0]PETSC ERROR: INSTEAD the line number of the start of the function
[0]PETSC ERROR: is given.
[0]PETSC ERROR: [0] KSPSolve line 510 /private/tmp/petsc20151102-50378-1t7b3in/petsc-3.6.2/src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: [0] STMatSolve line 148 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/sys/classes/st/interface/stsles.c
[0]PETSC ERROR: [0] STApply_Shift line 33 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/sys/classes/st/impls/shift/shift.c
[0]PETSC ERROR: [0] STApply line 50 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/sys/classes/st/interface/stsolve.c
[0]PETSC ERROR: [0] EPSGetStartVector line 726 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/eps/interface/epssolve.c
[0]PETSC ERROR: [0] EPSSolve_KrylovSchur_Symm line 41 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/eps/impls/krylov/krylovschur/ks-symm.c
[0]PETSC ERROR: [0] EPSSolve line 83 /private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/eps/interface/epssolve.c
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Signal received
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.6.2, Oct, 02, 2015
[0]PETSC ERROR: /Users/davydden/Desktop/work/C++/deal.ii-dft/build_debug~/dft on a real named MBP-Denis.fritz.box by davydden Tue Nov 3 07:02:47 2015
[0]PETSC ERROR: Configure options CC=/usr/local/bin/mpicc CXX=/usr/local/bin/mpicxx F77=/usr/local/bin/mpif77 FC=/usr/local/bin/mpif90 --with-shared-libraries=1 --with-pthread=0 --with-openmp=0 --with-debugging=1 --with-ssl=0 --with-superlu_dist-include=/usr/local/opt/superlu_dist/include/superlu_dist --with-superlu_dist-lib="-L/usr/local/opt/superlu_dist/lib -lsuperlu_dist" --with-superlu-include=/usr/local/Cellar/superlu43/4.3/include/superlu --with-superlu-lib="-L/usr/local/Cellar/superlu43/4.3/lib -lsuperlu" --with-fftw-dir=/usr/local/opt/fftw --with-netcdf-dir=/usr/local/opt/netcdf --with-suitesparse-dir=/usr/local/opt/suite-sparse --with-hdf5-dir=/usr/local/opt/hdf5 --with-metis-dir=/usr/local/opt/metis --with-parmetis-dir=/usr/local/opt/parmetis --with-scalapack-dir=/usr/local/opt/scalapack --with-mumps-dir=/usr/local/opt/mumps --with-x=0 --prefix=/usr/local/Cellar/petsc/3.6.2/real --with-scalar-type=real --with-hypre-dir=/usr/local/opt/hypre --with-sundials-dir=/usr/local/opt/sundials --with-hwloc-dir=/usr/local/opt/hwloc
[0]PETSC ERROR: #8 User provided function() line 0 in unknown file
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 96754 on node MBP-Denis exited on signal 6 (Abort trap: 6).
--------------------------------------------------------------------------
Denis Davydov
2015-11-03 12:32:27 UTC
Permalink
Hi Jose,
Post by Jose E. Roman
- Having different number of iterations when changing the number of processes is normal.
the change in iterations i mentioned are for different preconditioners, but the same number of MPI processes.
Post by Jose E. Roman
- Yes, if you do not destroy the EPS solver, then the preconditioner would be reused.
Regarding the segmentation fault, I have no clue. Not sure if this is related to GAMG or not. Maybe running under valgrind could provide more information.
will try that.

Denis.
Mark Adams
2015-11-03 13:47:39 UTC
Permalink
BTW, I think that our advice for segv is use a debugger. DDT or Totalview,
and gdb if need be, will get you right to the source code and will get 90%
of bugs diagnosed. Valgrind is noisy and cumbersome to use but can
diagnose 90% of the other 10%.
Post by Denis Davydov
Hi Jose,
Post by Jose E. Roman
- Having different number of iterations when changing the number of
processes is normal.
the change in iterations i mentioned are for different preconditioners,
but the same number of MPI processes.
Post by Jose E. Roman
- Yes, if you do not destroy the EPS solver, then the preconditioner
would be reused.
Post by Jose E. Roman
Regarding the segmentation fault, I have no clue. Not sure if this is
related to GAMG or not. Maybe running under valgrind could provide more
information.
will try that.
Denis.
Barry Smith
2015-11-03 16:55:59 UTC
Permalink
I am more optimistic about valgrind than Mark. I first try valgrind and if that fails to be helpful then use the debugger. valgrind has the advantage that it finds the FIRST place that something is wrong, while in the debugger it is kind of late at the crash.

Valgrind should not be noisy, if it is then the applications/libraries should be cleaned up so that they are valgrind clean and then valgrind is useful.

Barry
BTW, I think that our advice for segv is use a debugger. DDT or Totalview, and gdb if need be, will get you right to the source code and will get 90% of bugs diagnosed. Valgrind is noisy and cumbersome to use but can diagnose 90% of the other 10%.
Hi Jose,
Post by Jose E. Roman
- Having different number of iterations when changing the number of processes is normal.
the change in iterations i mentioned are for different preconditioners, but the same number of MPI processes.
Post by Jose E. Roman
- Yes, if you do not destroy the EPS solver, then the preconditioner would be reused.
Regarding the segmentation fault, I have no clue. Not sure if this is related to GAMG or not. Maybe running under valgrind could provide more information.
will try that.
Denis.
Mark Adams
2015-11-10 13:51:48 UTC
Permalink
I ran an 8 processor job on Edison of a small code for a short run (just a
linear solve) and got 37 Mb of output!

Here is a 'Petsc' grep.

Perhaps we should build an ignore file for things that we believe is a
false positive.
Post by Barry Smith
I am more optimistic about valgrind than Mark. I first try valgrind and
if that fails to be helpful then use the debugger. valgrind has the
advantage that it finds the FIRST place that something is wrong, while in
the debugger it is kind of late at the crash.
Valgrind should not be noisy, if it is then the applications/libraries
should be cleaned up so that they are valgrind clean and then valgrind is
useful.
Barry
Post by Mark Adams
BTW, I think that our advice for segv is use a debugger. DDT or
Totalview, and gdb if need be, will get you right to the source code and
will get 90% of bugs diagnosed. Valgrind is noisy and cumbersome to use
but can diagnose 90% of the other 10%.
Post by Mark Adams
Hi Jose,
Post by Jose E. Roman
- Having different number of iterations when changing the number of
processes is normal.
Post by Mark Adams
the change in iterations i mentioned are for different preconditioners,
but the same number of MPI processes.
Post by Mark Adams
Post by Jose E. Roman
- Yes, if you do not destroy the EPS solver, then the preconditioner
would be reused.
Post by Mark Adams
Post by Jose E. Roman
Regarding the segmentation fault, I have no clue. Not sure if this is
related to GAMG or not. Maybe running under valgrind could provide more
information.
Post by Mark Adams
will try that.
Denis.
Mark Adams
2015-11-10 14:20:34 UTC
Permalink
valgrind on Edison does seem to give a lot of false positives. The line
number are accurate (not always the case). "assert" triggers it, as does
SETERRQ.
Post by Mark Adams
I ran an 8 processor job on Edison of a small code for a short run (just a
linear solve) and got 37 Mb of output!
Here is a 'Petsc' grep.
Perhaps we should build an ignore file for things that we believe is a
false positive.
Post by Barry Smith
I am more optimistic about valgrind than Mark. I first try valgrind and
if that fails to be helpful then use the debugger. valgrind has the
advantage that it finds the FIRST place that something is wrong, while in
the debugger it is kind of late at the crash.
Valgrind should not be noisy, if it is then the applications/libraries
should be cleaned up so that they are valgrind clean and then valgrind is
useful.
Barry
Post by Mark Adams
BTW, I think that our advice for segv is use a debugger. DDT or
Totalview, and gdb if need be, will get you right to the source code and
will get 90% of bugs diagnosed. Valgrind is noisy and cumbersome to use
but can diagnose 90% of the other 10%.
Post by Mark Adams
Hi Jose,
Post by Jose E. Roman
- Having different number of iterations when changing the number of
processes is normal.
Post by Mark Adams
the change in iterations i mentioned are for different preconditioners,
but the same number of MPI processes.
Post by Mark Adams
Post by Jose E. Roman
- Yes, if you do not destroy the EPS solver, then the preconditioner
would be reused.
Post by Mark Adams
Post by Jose E. Roman
Regarding the segmentation fault, I have no clue. Not sure if this is
related to GAMG or not. Maybe running under valgrind could provide more
information.
Post by Mark Adams
will try that.
Denis.
Barry Smith
2015-11-10 16:15:10 UTC
Permalink
Please send me the full output. This is nuts and should be reported once we understand it better to NERSc as something to be fixed. When I pay $60 million in taxes to a computing center I expect something that works fine for free on my laptop to work also there.

Barry
I ran an 8 processor job on Edison of a small code for a short run (just a linear solve) and got 37 Mb of output!
Here is a 'Petsc' grep.
Perhaps we should build an ignore file for things that we believe is a false positive.
I am more optimistic about valgrind than Mark. I first try valgrind and if that fails to be helpful then use the debugger. valgrind has the advantage that it finds the FIRST place that something is wrong, while in the debugger it is kind of late at the crash.
Valgrind should not be noisy, if it is then the applications/libraries should be cleaned up so that they are valgrind clean and then valgrind is useful.
Barry
BTW, I think that our advice for segv is use a debugger. DDT or Totalview, and gdb if need be, will get you right to the source code and will get 90% of bugs diagnosed. Valgrind is noisy and cumbersome to use but can diagnose 90% of the other 10%.
Hi Jose,
Post by Jose E. Roman
- Having different number of iterations when changing the number of processes is normal.
the change in iterations i mentioned are for different preconditioners, but the same number of MPI processes.
Post by Jose E. Roman
- Yes, if you do not destroy the EPS solver, then the preconditioner would be reused.
Regarding the segmentation fault, I have no clue. Not sure if this is related to GAMG or not. Maybe running under valgrind could provide more information.
will try that.
Denis.
<petsc_val.gz>
Denis Davydov
2015-11-03 18:46:00 UTC
Permalink
Jose,

Even when I have PETSc --with-debugging=1 and SLEPc picks it up during configure,
i don’t seem to have debug symbols in resulting SLEPc lib (make stage):

warning: no debug symbols in executable (-arch x86_64)

Same when starting a debugger:
warning: (x86_64) /usr/local/opt/slepc/real/lib/libslepc.3.6.dylib empty dSYM file detected, dSYM was created with an executable with no debug info.

C/Fortran flags seems to have debug flags:

Using C/C++ linker: /usr/local/bin/mpicc
Using C/C++ flags: -Wl,-multiply_defined,suppress -Wl,-multiply_defined -Wl,suppress -Wl,-commons,use_dylibs -Wl,-search_paths_first -fPIC -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -g3 -O0
Using Fortran linker: /usr/local/bin/mpif90
Using Fortran flags: -Wl,-multiply_defined,suppress -Wl,-multiply_defined -Wl,suppress -Wl,-commons,use_dylibs -Wl,-search_paths_first -fPIC -Wall -Wno-unused-variable -ffree-line-length-0 -Wno-unused-dummy-argument -g -O0

Any ideas?

Kind regards,
Denis
Post by Jose E. Roman
- Having different number of iterations when changing the number of processes is normal.
- Yes, if you do not destroy the EPS solver, then the preconditioner would be reused.
Regarding the segmentation fault, I have no clue. Not sure if this is related to GAMG or not. Maybe running under valgrind could provide more information.
Jose
Jose E. Roman
2015-11-03 18:54:43 UTC
Permalink
In MacOSX you have to keep the *.o files, and not delete them.
With PETSc's makefiles, this can be done easily with e.g.
$ make ex1 RM=echo

Jose
Post by Denis Davydov
Jose,
Even when I have PETSc --with-debugging=1 and SLEPc picks it up during configure,
warning: no debug symbols in executable (-arch x86_64)
warning: (x86_64) /usr/local/opt/slepc/real/lib/libslepc.3.6.dylib empty dSYM file detected, dSYM was created with an executable with no debug info.
Using C/C++ linker: /usr/local/bin/mpicc
Using C/C++ flags: -Wl,-multiply_defined,suppress -Wl,-multiply_defined -Wl,suppress -Wl,-commons,use_dylibs -Wl,-search_paths_first -fPIC -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -g3 -O0
Using Fortran linker: /usr/local/bin/mpif90
Using Fortran flags: -Wl,-multiply_defined,suppress -Wl,-multiply_defined -Wl,suppress -Wl,-commons,use_dylibs -Wl,-search_paths_first -fPIC -Wall -Wno-unused-variable -ffree-line-length-0 -Wno-unused-dummy-argument -g -O0
Any ideas?
Kind regards,
Denis
Denis Davydov
2015-11-06 14:51:52 UTC
Permalink
After running in debug mode it seems that the GAMG solver indeed did not
converge, however throwing the error leads to SIGABRT (backtrace and frames
are below).
It is still very suspicious why would solving for (unchanged) mass matrix
wouldn't converge inside SLEPc's spectral transformation.

p.s. valgrind takes enormous amount of time on this problem,
will try to leave it over the weekend.

Denis.

===============
Program received signal SIGABRT, Aborted.
0x00007fffea87fcc9 in __GI_raise (sig=***@entry=6)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007fffea87fcc9 in __GI_raise (sig=***@entry=6)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fffea8830d8 in __GI_abort () at abort.c:89
#2 0x00007fffeb790c91 in PetscTraceBackErrorHandler (comm=0x2a09bd0,
line=798, fun=0x7fffed0e24b9 <__func__.20043> "KSPSolve",
file=0x7fffed0e1620
"/home/davydden/.hashdist/tmp/petsc-hujktg3j6hq7/src/ksp/ksp/interface/itfunc.c",
n=91, p=PETSC_ERROR_INITIAL,
mess=0x7fffffffac30 "KSPSolve has not converged", ctx=0x0)
at
/home/davydden/.hashdist/tmp/petsc-hujktg3j6hq7/src/sys/error/errtrace.c:243
#3 0x00007fffeb78b8b9 in PetscError (comm=0x2a09bd0, line=798,
func=0x7fffed0e24b9 <__func__.20043> "KSPSolve",
file=0x7fffed0e1620
"/home/davydden/.hashdist/tmp/petsc-hujktg3j6hq7/src/ksp/ksp/interface/itfunc.c",
n=91, p=PETSC_ERROR_INITIAL,
mess=0x7fffed0e1e7a "KSPSolve has not converged")
at
/home/davydden/.hashdist/tmp/petsc-hujktg3j6hq7/src/sys/error/err.c:377
#4 0x00007fffec75e1e7 in KSPSolve (ksp=0x367227d0, b=0x35b285c0,
x=0x35d89250)
at
/home/davydden/.hashdist/tmp/petsc-hujktg3j6hq7/src/ksp/ksp/interface/itfunc.c:798
#5 0x00007fffe32a8657 in STMatSolve (st=0x3672d820, b=0x35b285c0,
x=0x35d89250)
at
/home/davydden/.hashdist/tmp/slepc-22nb32nbgvhx/src/sys/classes/st/interface/stsles.c:166
---Type <return> to continue, or q <return> to quit---q
Quit
(gdb) f 5
#5 0x00007fffe32a8657 in STMatSolve (st=0x3672d820, b=0x35b285c0,
x=0x35d89250)
at
/home/davydden/.hashdist/tmp/slepc-22nb32nbgvhx/src/sys/classes/st/interface/stsles.c:166
166 ierr = KSPSolve(st->ksp,b,x);CHKERRQ(ierr);
(gdb) f 4
#4 0x00007fffec75e1e7 in KSPSolve (ksp=0x367227d0, b=0x35b285c0,
x=0x35d89250)
at
/home/davydden/.hashdist/tmp/petsc-hujktg3j6hq7/src/ksp/ksp/interface/itfunc.c:798
798 if (ksp->errorifnotconverged && ksp->reason < 0)
SETERRQ(comm,PETSC_ERR_NOT_CONVERGED,"KSPSolve has not converged");
(gdb) f 3
#3 0x00007fffeb78b8b9 in PetscError (comm=0x2a09bd0, line=798,
func=0x7fffed0e24b9 <__func__.20043> "KSPSolve",
file=0x7fffed0e1620
"/home/davydden/.hashdist/tmp/petsc-hujktg3j6hq7/src/ksp/ksp/interface/itfunc.c",
n=91, p=PETSC_ERROR_INITIAL,
mess=0x7fffed0e1e7a "KSPSolve has not converged")
at
/home/davydden/.hashdist/tmp/petsc-hujktg3j6hq7/src/sys/error/err.c:377
377 if (!eh) ierr =
PetscTraceBackErrorHandler(comm,line,func,file,n,p,lbuf,0);
(gdb) f 2
#2 0x00007fffeb790c91 in PetscTraceBackErrorHandler (comm=0x2a09bd0,
line=798, fun=0x7fffed0e24b9 <__func__.20043> "KSPSolve",
file=0x7fffed0e1620
"/home/davydden/.hashdist/tmp/petsc-hujktg3j6hq7/src/ksp/ksp/interface/itfunc.c",
n=91, p=PETSC_ERROR_INITIAL,
mess=0x7fffffffac30 "KSPSolve has not converged", ctx=0x0)
at
/home/davydden/.hashdist/tmp/petsc-hujktg3j6hq7/src/sys/error/errtrace.c:243
243 abort();
(gdb) f 1
#1 0x00007fffea8830d8 in __GI_abort () at abort.c:89
89 abort.c: No such file or directory.
Hong
2015-11-06 15:09:02 UTC
Permalink
Denis:
Do you use shift-and-invert method for solving eigenvalue problem?
If so, the linear problems would be extremely ill-conditioned, for which
the direct solver, such LU or Cholesky are usually the only working option.

You may run your petsc/slepc code with option '-ksp_monitor' to observe
convergence behavior.

Hong

After running in debug mode it seems that the GAMG solver indeed did not
Post by Denis Davydov
converge, however throwing the error leads to SIGABRT (backtrace and frames
are below).
It is still very suspicious why would solving for (unchanged) mass matrix
wouldn't converge inside SLEPc's spectral transformation.
p.s. valgrind takes enormous amount of time on this problem,
will try to leave it over the weekend.
Denis.
===============
Program received signal SIGABRT, Aborted.
at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fffea8830d8 in __GI_abort () at abort.c:89
#2 0x00007fffeb790c91 in PetscTraceBackErrorHandler (comm=0x2a09bd0,
line=798, fun=0x7fffed0e24b9 <__func__.20043> "KSPSolve",
file=0x7fffed0e1620
"/home/davydden/.hashdist/tmp/petsc-hujktg3j6hq7/src/ksp/ksp/interface/itfunc.c",
n=91, p=PETSC_ERROR_INITIAL,
mess=0x7fffffffac30 "KSPSolve has not converged", ctx=0x0)
at
/home/davydden/.hashdist/tmp/petsc-hujktg3j6hq7/src/sys/error/errtrace.c:243
#3 0x00007fffeb78b8b9 in PetscError (comm=0x2a09bd0, line=798,
func=0x7fffed0e24b9 <__func__.20043> "KSPSolve",
file=0x7fffed0e1620
"/home/davydden/.hashdist/tmp/petsc-hujktg3j6hq7/src/ksp/ksp/interface/itfunc.c",
n=91, p=PETSC_ERROR_INITIAL,
mess=0x7fffed0e1e7a "KSPSolve has not converged")
at
/home/davydden/.hashdist/tmp/petsc-hujktg3j6hq7/src/sys/error/err.c:377
#4 0x00007fffec75e1e7 in KSPSolve (ksp=0x367227d0, b=0x35b285c0,
x=0x35d89250)
at
/home/davydden/.hashdist/tmp/petsc-hujktg3j6hq7/src/ksp/ksp/interface/itfunc.c:798
#5 0x00007fffe32a8657 in STMatSolve (st=0x3672d820, b=0x35b285c0,
x=0x35d89250)
at
/home/davydden/.hashdist/tmp/slepc-22nb32nbgvhx/src/sys/classes/st/interface/stsles.c:166
---Type <return> to continue, or q <return> to quit---q
Quit
(gdb) f 5
#5 0x00007fffe32a8657 in STMatSolve (st=0x3672d820, b=0x35b285c0,
x=0x35d89250)
at
/home/davydden/.hashdist/tmp/slepc-22nb32nbgvhx/src/sys/classes/st/interface/stsles.c:166
166 ierr = KSPSolve(st->ksp,b,x);CHKERRQ(ierr);
(gdb) f 4
#4 0x00007fffec75e1e7 in KSPSolve (ksp=0x367227d0, b=0x35b285c0,
x=0x35d89250)
at
/home/davydden/.hashdist/tmp/petsc-hujktg3j6hq7/src/ksp/ksp/interface/itfunc.c:798
798 if (ksp->errorifnotconverged && ksp->reason < 0)
SETERRQ(comm,PETSC_ERR_NOT_CONVERGED,"KSPSolve has not converged");
(gdb) f 3
#3 0x00007fffeb78b8b9 in PetscError (comm=0x2a09bd0, line=798,
func=0x7fffed0e24b9 <__func__.20043> "KSPSolve",
file=0x7fffed0e1620
"/home/davydden/.hashdist/tmp/petsc-hujktg3j6hq7/src/ksp/ksp/interface/itfunc.c",
n=91, p=PETSC_ERROR_INITIAL,
mess=0x7fffed0e1e7a "KSPSolve has not converged")
at
/home/davydden/.hashdist/tmp/petsc-hujktg3j6hq7/src/sys/error/err.c:377
377 if (!eh) ierr =
PetscTraceBackErrorHandler(comm,line,func,file,n,p,lbuf,0);
(gdb) f 2
#2 0x00007fffeb790c91 in PetscTraceBackErrorHandler (comm=0x2a09bd0,
line=798, fun=0x7fffed0e24b9 <__func__.20043> "KSPSolve",
file=0x7fffed0e1620
"/home/davydden/.hashdist/tmp/petsc-hujktg3j6hq7/src/ksp/ksp/interface/itfunc.c",
n=91, p=PETSC_ERROR_INITIAL,
mess=0x7fffffffac30 "KSPSolve has not converged", ctx=0x0)
at
/home/davydden/.hashdist/tmp/petsc-hujktg3j6hq7/src/sys/error/errtrace.c:243
243 abort();
(gdb) f 1
#1 0x00007fffea8830d8 in __GI_abort () at abort.c:89
89 abort.c: No such file or directory.
Denis Davydov
2015-11-06 15:15:18 UTC
Permalink
Hi Hong,
Post by Hong
Do you use shift-and-invert method for solving eigenvalue problem?
no, it’s just shift with zero value. So for GHEP one inverts B-matrix.
Post by Hong
If so, the linear problems would be extremely ill-conditioned, for which the direct solver, such LU or Cholesky are usually the only working option.
Depends on the shift, i would say.
In any case the same problem works with jacobi preconditioner no no other changes,
so i would not relate it to any settings on SLEPc part.
Post by Hong
You may run your petsc/slepc code with option '-ksp_monitor' to observe convergence behavior.
Will do, thanks.

Regards,
Denis.
Matthew Knepley
2015-11-06 15:22:03 UTC
Permalink
Post by Denis Davydov
Hi Hong,
Post by Hong
Do you use shift-and-invert method for solving eigenvalue problem?
no, it’s just shift with zero value. So for GHEP one inverts B-matrix.
Post by Hong
If so, the linear problems would be extremely ill-conditioned, for which
the direct solver, such LU or Cholesky are usually the only working option.
Depends on the shift, i would say.
In any case the same problem works with jacobi preconditioner no no other changes,
so i would not relate it to any settings on SLEPc part.
Is it possible that the matrix is rank deficient? Jacobi will just chug
along and sometimes work, but
AMG will fail spectacularly in that case.

Matt
Post by Denis Davydov
Post by Hong
You may run your petsc/slepc code with option '-ksp_monitor' to observe
convergence behavior.
Will do, thanks.
Regards,
Denis.
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
Denis Davydov
2015-11-06 15:29:42 UTC
Permalink
Is it possible that the matrix is rank deficient? Jacobi will just chug along and sometimes work, but
AMG will fail spectacularly in that case.
It should not. It is just a mass (overlap) matrix coming from linear FEs with zero Dirichlet BC assembled in deal.II.
Due to elimination of some algebraic constraints on DoFs there are lines with only diagonal element,
but it should still be SPD.

More interestingly is that it does not fail immediately (i.e. the first time it’s used in SLEPc solvers),
but only on the 4th step. So 3 times SLEPc worked just fine to solve GHEP with Gamg and zero shift.

Regards,
Denis.
Matthew Knepley
2015-11-06 15:32:59 UTC
Permalink
Post by Matthew Knepley
Post by Matthew Knepley
Is it possible that the matrix is rank deficient? Jacobi will just chug
along and sometimes work, but
Post by Matthew Knepley
AMG will fail spectacularly in that case.
It should not. It is just a mass (overlap) matrix coming from linear FEs
with zero Dirichlet BC assembled in deal.II.
Due to elimination of some algebraic constraints on DoFs there are lines
with only diagonal element,
but it should still be SPD.
More interestingly is that it does not fail immediately (i.e. the first
time it’s used in SLEPc solvers),
but only on the 4th step. So 3 times SLEPc worked just fine to solve GHEP
with Gamg and zero shift.
Then I think it is not doing what you suppose. I am not inclined to believe
that it behaves differently on the
same matrix.

Matt
Post by Matthew Knepley
Regards,
Denis.
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
Barry Smith
2015-11-06 16:39:50 UTC
Permalink
If it is a true mass matrix in the finite element sense of the word then it should be very well conditioned and one definitely would not use something like GAMG on. Jacobi + CG or maybe SSOR + CG should converge rapidly

Barry
Post by Denis Davydov
Is it possible that the matrix is rank deficient? Jacobi will just chug along and sometimes work, but
AMG will fail spectacularly in that case.
It should not. It is just a mass (overlap) matrix coming from linear FEs with zero Dirichlet BC assembled in deal.II.
Due to elimination of some algebraic constraints on DoFs there are lines with only diagonal element,
but it should still be SPD.
More interestingly is that it does not fail immediately (i.e. the first time it’s used in SLEPc solvers),
but only on the 4th step. So 3 times SLEPc worked just fine to solve GHEP with Gamg and zero shift.
Regards,
Denis.
Denis Davydov
2015-11-06 17:35:32 UTC
Permalink
Post by Barry Smith
If it is a true mass matrix in the finite element sense of the word then it should be very well conditioned and one definitely would not use something like GAMG on. Jacobi + CG or maybe SSOR + CG should converge rapidly
That I understand and absolutely agree.
It just does not explain why GAMG would fail, especially on 4 cores and not on 8.

Regards,
Denis.
Mark Adams
2015-11-06 18:39:42 UTC
Permalink
You can run with -info and grep on GAMG, and send this.

If you are shifting a matrix then it can/will get indefinite. If it is
just a mass matrix then Jacobi should converge quickly - does it?
Post by Barry Smith
Post by Barry Smith
If it is a true mass matrix in the finite element sense of the word
then it should be very well conditioned and one definitely would not use
something like GAMG on. Jacobi + CG or maybe SSOR + CG should converge
rapidly
That I understand and absolutely agree.
It just does not explain why GAMG would fail, especially on 4 cores and not on 8.
Regards,
Denis.
Matthew Knepley
2015-11-03 12:31:46 UTC
Permalink
Post by Denis Davydov
Dear all,
I experience strange convergence problems in SLEPc for GHEP with
Krylov-Schur and CG + GAMG.
The issue appears to be contingent on the number of MPI cores used.
Say for 8 cores there is no issue and for 4 cores there is an issue.
When I substitute GAMG with Jacobi for the problematic number of cores -- all works.
To be more specific, I solve Ax=\lambda Bx for a sequence of A’s where A
is a function of eigenvectors.
On each iteration step currently the eigensolver EPS is initialised from
scratch. And thus should the underlying
ST, KSP, PC objects: -st_ksp_type cg -st_pc_type gamg -st_ksp_rtol 1e-12.
For these particular matrices the issue appears on the 4th iteration, even
though the matrix to be inverted (mass/overlap matrix)
is the same, does not change!!!'
I assume the issue is the SEGV below? I agree with Jose that you need to
run valgrind. An SEGV can result from
memory corruption in a distant part of the code. This seems very likely to
me since it is the same matrix coming in.

Thanks,

Matt
Post by Denis Davydov
DEAL:: frobenius_norm = 365.7
DEAL:: linfty_norm = 19.87
DEAL:: l1_norm = 19.87
Just to be sure that there are no bugs on my side which would result in
different mass matrices i check that
DEAL:: frobenius_norm = 166.4
DEAL:: linfty_norm = 8.342
DEAL:: l1_norm = 8.342
All the dependent scalar quantities I calculate on each iteration are
identical for the two cases, which makes me believe that
the solution path is the same up to the certain tolerance.
The only output which is slightly different are the number iterations for
convergence in EPS (e.g. 113 vs 108) and the
resulting maxing EPSComputeResidualNorm : 4.1524e-07 vs 2.9639e-08.
Any ideas what could be an issue, especially given the fact that it does
work for some numbers of cores and does not for other?
Perhaps some default settings in GAMG preconditioner? Although that does
not explain why it works for the first 3 iterations
and does not on 4th as the mass matrix is unchanged...
Lastly, i suppose ideally i should keep the eigensolver context between
the iterations and just update the matrices by EPSSetOperators.
Is it correct to assume that since B matrix does not change between
iterations and I use the default shift transformation with zero shift
(operator is B^{-1)A ), the GAMG preconditioner will not be re-initialised
and thus I should save some time?
p.s. the relevant error message is below. I have the same issues on CentOS
cluster, so it is not related to OS-X.
Kind regards,
Denis
===
------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
batch system) has told this process to end
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see
http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS
X to find memory corruption errors
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: --------------------- Stack Frames
------------------------------------
[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[0]PETSC ERROR: INSTEAD the line number of the start of the function
[0]PETSC ERROR: is given.
[0]PETSC ERROR: [0] KSPSolve line 510
/private/tmp/petsc20151102-50378-1t7b3in/petsc-3.6.2/src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: [0] STMatSolve line 148
/private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/sys/classes/st/interface/stsles.c
[0]PETSC ERROR: [0] STApply_Shift line 33
/private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/sys/classes/st/impls/shift/shift.c
[0]PETSC ERROR: [0] STApply line 50
/private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/sys/classes/st/interface/stsolve.c
[0]PETSC ERROR: [0] EPSGetStartVector line 726
/private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/eps/interface/epssolve.c
[0]PETSC ERROR: [0] EPSSolve_KrylovSchur_Symm line 41
/private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/eps/impls/krylov/krylovschur/ks-symm.c
[0]PETSC ERROR: [0] EPSSolve line 83
/private/tmp/slepc20151102-3081-1xln4h0/slepc-3.6.1/src/eps/interface/epssolve.c
[0]PETSC ERROR: --------------------- Error Message
--------------------------------------------------------------
[0]PETSC ERROR: Signal received
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html
for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.6.2, Oct, 02, 2015
/Users/davydden/Desktop/work/C++/deal.ii-dft/build_debug~/dft on a real
named MBP-Denis.fritz.box by davydden Tue Nov 3 07:02:47 2015
[0]PETSC ERROR: Configure options CC=/usr/local/bin/mpicc
CXX=/usr/local/bin/mpicxx F77=/usr/local/bin/mpif77
FC=/usr/local/bin/mpif90 --with-shared-libraries=1 --with-pthread=0
--with-openmp=0 --with-debugging=1 --with-ssl=0
--with-superlu_dist-include=/usr/local/opt/superlu_dist/include/superlu_dist
--with-superlu_dist-lib="-L/usr/local/opt/superlu_dist/lib -lsuperlu_dist"
--with-superlu-include=/usr/local/Cellar/superlu43/4.3/include/superlu
--with-superlu-lib="-L/usr/local/Cellar/superlu43/4.3/lib -lsuperlu"
--with-fftw-dir=/usr/local/opt/fftw --with-netcdf-dir=/usr/local/opt/netcdf
--with-suitesparse-dir=/usr/local/opt/suite-sparse
--with-hdf5-dir=/usr/local/opt/hdf5 --with-metis-dir=/usr/local/opt/metis
--with-parmetis-dir=/usr/local/opt/parmetis
--with-scalapack-dir=/usr/local/opt/scalapack
--with-mumps-dir=/usr/local/opt/mumps --with-x=0
--prefix=/usr/local/Cellar/petsc/3.6.2/real --with-scalar-type=real
--with-hypre-dir=/usr/local/opt/hypre
--with-sundials-dir=/usr/local/opt/sundials
--with-hwloc-dir=/usr/local/opt/hwloc
[0]PETSC ERROR: #8 User provided function() line 0 in unknown file
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 96754 on node MBP-Denis exited
on signal 6 (Abort trap: 6).
--------------------------------------------------------------------------
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
Loading...