Discussion:
[petsc-users] Number of levels of multigrid : 2-3 is sufficient ??
Timothée Nicolas
2015-10-14 02:23:17 UTC
Permalink
Dear all,

I have been playing around with multigrid recently, namely with
/ksp/ksp/examples/tutorials/ex42.c, with /snes/examples/tutorial/ex5.c and
with my own implementation of a laplacian type problem. In all cases, I
have noted no improvement whatsoever in the performance, whether in CPU
time or KSP iteration, by varying the number of levels of the multigrid
solver. As an example, I have attached the log_summary for ex5.c with
nlevels = 2 to 7, launched by

mpiexec -n 1 ./ex5 -da_grid_x 21 -da_grid_y 21 -ksp_rtol 1.0e-9 -da_refine
6 -pc_type mg -pc_mg_levels # -snes_monitor -ksp_monitor -log_summary

where -pc_mg_levels is set to a number between 2 and 7.

So there is a noticeable CPU time improvement from 2 levels to 3 levels
(30%), and then no improvement whatsoever. I am surprised because with 6
levels of refinement of the DMDA the fine grid has more than 1200 points so
with 3 levels the coarse grid still has more than 300 points which is still
pretty large (I assume the ratio between grids is 2). I am wondering how
the coarse solver efficiently solves the problem on the coarse grid with
such a large number of points ? Given the principle of multigrid which is
to erase the smooth part of the error with relaxation methods, which are
usually efficient only for high frequency, I would expect optimal
performance when the coarse grid is basically just a few points in each
direction. Does anyone know why the performance saturates at low number of
levels ? Basically what happens internally seems to be quite different from
what I would expect...

Best

Timothee
Matthew Knepley
2015-10-14 12:22:59 UTC
Permalink
On Tue, Oct 13, 2015 at 9:23 PM, Timothée Nicolas <
Post by Timothée Nicolas
Dear all,
I have been playing around with multigrid recently, namely with
/ksp/ksp/examples/tutorials/ex42.c, with /snes/examples/tutorial/ex5.c and
with my own implementation of a laplacian type problem. In all cases, I
have noted no improvement whatsoever in the performance, whether in CPU
time or KSP iteration, by varying the number of levels of the multigrid
solver. As an example, I have attached the log_summary for ex5.c with
nlevels = 2 to 7, launched by
mpiexec -n 1 ./ex5 -da_grid_x 21 -da_grid_y 21 -ksp_rtol 1.0e-9 -da_refine
6 -pc_type mg -pc_mg_levels # -snes_monitor -ksp_monitor -log_summary
where -pc_mg_levels is set to a number between 2 and 7.
So there is a noticeable CPU time improvement from 2 levels to 3 levels
(30%), and then no improvement whatsoever. I am surprised because with 6
levels of refinement of the DMDA the fine grid has more than 1200 points so
with 3 levels the coarse grid still has more than 300 points which is still
pretty large (I assume the ratio between grids is 2). I am wondering how
the coarse solver efficiently solves the problem on the coarse grid with
such a large number of points ? Given the principle of multigrid which is
to erase the smooth part of the error with relaxation methods, which are
usually efficient only for high frequency, I would expect optimal
performance when the coarse grid is basically just a few points in each
direction. Does anyone know why the performance saturates at low number of
levels ? Basically what happens internally seems to be quite different from
what I would expect...
A performance model that counts only flops is not sophisticated enough to
understand this effect. Unfortunately, nearly all MG
books/papers use this model. What we need is a model that incorporates
memory bandwidth (for pulling down the values), and
also maybe memory latency. For instance, your relaxation pulls down all the
values and makes a little progress. It does few flops,
but lots of memory access. An LU solve does a little memory access, many
more flops, but makes a lots more progress. If memory
access is more expensive, then we have a tradeoff, and can understand using
a coarse grid which is not just a few points.

Thanks,

Matt
Post by Timothée Nicolas
Best
Timothee
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
Timothée Nicolas
2015-10-14 12:34:04 UTC
Permalink
OK, I see. Does it mean that the coarse grid solver is by default set up
with the options -ksp_type preonly -pc_type lu ? What about the
multiprocessor case ?

Thx

Timothee
Post by Matthew Knepley
On Tue, Oct 13, 2015 at 9:23 PM, Timothée Nicolas <
Post by Timothée Nicolas
Dear all,
I have been playing around with multigrid recently, namely with
/ksp/ksp/examples/tutorials/ex42.c, with /snes/examples/tutorial/ex5.c and
with my own implementation of a laplacian type problem. In all cases, I
have noted no improvement whatsoever in the performance, whether in CPU
time or KSP iteration, by varying the number of levels of the multigrid
solver. As an example, I have attached the log_summary for ex5.c with
nlevels = 2 to 7, launched by
mpiexec -n 1 ./ex5 -da_grid_x 21 -da_grid_y 21 -ksp_rtol 1.0e-9
-da_refine 6 -pc_type mg -pc_mg_levels # -snes_monitor -ksp_monitor
-log_summary
where -pc_mg_levels is set to a number between 2 and 7.
So there is a noticeable CPU time improvement from 2 levels to 3 levels
(30%), and then no improvement whatsoever. I am surprised because with 6
levels of refinement of the DMDA the fine grid has more than 1200 points so
with 3 levels the coarse grid still has more than 300 points which is still
pretty large (I assume the ratio between grids is 2). I am wondering how
the coarse solver efficiently solves the problem on the coarse grid with
such a large number of points ? Given the principle of multigrid which is
to erase the smooth part of the error with relaxation methods, which are
usually efficient only for high frequency, I would expect optimal
performance when the coarse grid is basically just a few points in each
direction. Does anyone know why the performance saturates at low number of
levels ? Basically what happens internally seems to be quite different from
what I would expect...
A performance model that counts only flops is not sophisticated enough to
understand this effect. Unfortunately, nearly all MG
books/papers use this model. What we need is a model that incorporates
memory bandwidth (for pulling down the values), and
also maybe memory latency. For instance, your relaxation pulls down all
the values and makes a little progress. It does few flops,
but lots of memory access. An LU solve does a little memory access, many
more flops, but makes a lots more progress. If memory
access is more expensive, then we have a tradeoff, and can understand
using a coarse grid which is not just a few points.
Thanks,
Matt
Post by Timothée Nicolas
Best
Timothee
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
Matthew Knepley
2015-10-14 14:50:13 UTC
Permalink
On Wed, Oct 14, 2015 at 7:34 AM, Timothée Nicolas <
Post by Timothée Nicolas
OK, I see. Does it mean that the coarse grid solver is by default set up
with the options -ksp_type preonly -pc_type lu ? What about the
multiprocessor case ?
Small scale: We use redundant LU

Large Scale: We use GAMG

Matt
Post by Timothée Nicolas
Thx
Timothee
Post by Matthew Knepley
On Tue, Oct 13, 2015 at 9:23 PM, Timothée Nicolas <
Post by Timothée Nicolas
Dear all,
I have been playing around with multigrid recently, namely with
/ksp/ksp/examples/tutorials/ex42.c, with /snes/examples/tutorial/ex5.c and
with my own implementation of a laplacian type problem. In all cases, I
have noted no improvement whatsoever in the performance, whether in CPU
time or KSP iteration, by varying the number of levels of the multigrid
solver. As an example, I have attached the log_summary for ex5.c with
nlevels = 2 to 7, launched by
mpiexec -n 1 ./ex5 -da_grid_x 21 -da_grid_y 21 -ksp_rtol 1.0e-9
-da_refine 6 -pc_type mg -pc_mg_levels # -snes_monitor -ksp_monitor
-log_summary
where -pc_mg_levels is set to a number between 2 and 7.
So there is a noticeable CPU time improvement from 2 levels to 3 levels
(30%), and then no improvement whatsoever. I am surprised because with 6
levels of refinement of the DMDA the fine grid has more than 1200 points so
with 3 levels the coarse grid still has more than 300 points which is still
pretty large (I assume the ratio between grids is 2). I am wondering how
the coarse solver efficiently solves the problem on the coarse grid with
such a large number of points ? Given the principle of multigrid which is
to erase the smooth part of the error with relaxation methods, which are
usually efficient only for high frequency, I would expect optimal
performance when the coarse grid is basically just a few points in each
direction. Does anyone know why the performance saturates at low number of
levels ? Basically what happens internally seems to be quite different from
what I would expect...
A performance model that counts only flops is not sophisticated enough to
understand this effect. Unfortunately, nearly all MG
books/papers use this model. What we need is a model that incorporates
memory bandwidth (for pulling down the values), and
also maybe memory latency. For instance, your relaxation pulls down all
the values and makes a little progress. It does few flops,
but lots of memory access. An LU solve does a little memory access, many
more flops, but makes a lots more progress. If memory
access is more expensive, then we have a tradeoff, and can understand
using a coarse grid which is not just a few points.
Thanks,
Matt
Post by Timothée Nicolas
Best
Timothee
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
Dave May
2015-10-14 15:01:33 UTC
Permalink
Post by Matthew Knepley
On Wed, Oct 14, 2015 at 7:34 AM, Timothée Nicolas <
Post by Timothée Nicolas
OK, I see. Does it mean that the coarse grid solver is by default set up
with the options -ksp_type preonly -pc_type lu ? What about the
multiprocessor case ?
Small scale: We use redundant LU
Large Scale: We use GAMG
Is your answer what "you" recommend, or what PETSc does by default?

Your answer gives the impression that PETSc makes a decision regarding the
choice of either redundant/LU or gamg based on something - e.g. the size of
the matrix, the number of cores (or some combination of the two).
Is that really what is happening inside PCMG?
Post by Matthew Knepley
Matt
Post by Timothée Nicolas
Thx
Timothee
Post by Matthew Knepley
On Tue, Oct 13, 2015 at 9:23 PM, Timothée Nicolas <
Post by Timothée Nicolas
Dear all,
I have been playing around with multigrid recently, namely with
/ksp/ksp/examples/tutorials/ex42.c, with /snes/examples/tutorial/ex5.c and
with my own implementation of a laplacian type problem. In all cases, I
have noted no improvement whatsoever in the performance, whether in CPU
time or KSP iteration, by varying the number of levels of the multigrid
solver. As an example, I have attached the log_summary for ex5.c with
nlevels = 2 to 7, launched by
mpiexec -n 1 ./ex5 -da_grid_x 21 -da_grid_y 21 -ksp_rtol 1.0e-9
-da_refine 6 -pc_type mg -pc_mg_levels # -snes_monitor -ksp_monitor
-log_summary
where -pc_mg_levels is set to a number between 2 and 7.
So there is a noticeable CPU time improvement from 2 levels to 3 levels
(30%), and then no improvement whatsoever. I am surprised because with 6
levels of refinement of the DMDA the fine grid has more than 1200 points so
with 3 levels the coarse grid still has more than 300 points which is still
pretty large (I assume the ratio between grids is 2). I am wondering how
the coarse solver efficiently solves the problem on the coarse grid with
such a large number of points ? Given the principle of multigrid which is
to erase the smooth part of the error with relaxation methods, which are
usually efficient only for high frequency, I would expect optimal
performance when the coarse grid is basically just a few points in each
direction. Does anyone know why the performance saturates at low number of
levels ? Basically what happens internally seems to be quite different from
what I would expect...
A performance model that counts only flops is not sophisticated enough
to understand this effect. Unfortunately, nearly all MG
books/papers use this model. What we need is a model that incorporates
memory bandwidth (for pulling down the values), and
also maybe memory latency. For instance, your relaxation pulls down all
the values and makes a little progress. It does few flops,
but lots of memory access. An LU solve does a little memory access, many
more flops, but makes a lots more progress. If memory
access is more expensive, then we have a tradeoff, and can understand
using a coarse grid which is not just a few points.
Thanks,
Matt
Post by Timothée Nicolas
Best
Timothee
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
Matthew Knepley
2015-10-14 15:02:35 UTC
Permalink
Post by Dave May
Post by Matthew Knepley
On Wed, Oct 14, 2015 at 7:34 AM, Timothée Nicolas <
Post by Timothée Nicolas
OK, I see. Does it mean that the coarse grid solver is by default set up
with the options -ksp_type preonly -pc_type lu ? What about the
multiprocessor case ?
Small scale: We use redundant LU
Large Scale: We use GAMG
Is your answer what "you" recommend, or what PETSc does by default?
Your answer gives the impression that PETSc makes a decision regarding the
choice of either redundant/LU or gamg based on something - e.g. the size of
the matrix, the number of cores (or some combination of the two).
Is that really what is happening inside PCMG?
No, the default is redundant LU, and then you can decide to use GAMG.

Matt
Post by Dave May
Post by Matthew Knepley
Matt
Post by Timothée Nicolas
Thx
Timothee
Post by Matthew Knepley
On Tue, Oct 13, 2015 at 9:23 PM, Timothée Nicolas <
Post by Timothée Nicolas
Dear all,
I have been playing around with multigrid recently, namely with
/ksp/ksp/examples/tutorials/ex42.c, with /snes/examples/tutorial/ex5.c and
with my own implementation of a laplacian type problem. In all cases, I
have noted no improvement whatsoever in the performance, whether in CPU
time or KSP iteration, by varying the number of levels of the multigrid
solver. As an example, I have attached the log_summary for ex5.c with
nlevels = 2 to 7, launched by
mpiexec -n 1 ./ex5 -da_grid_x 21 -da_grid_y 21 -ksp_rtol 1.0e-9
-da_refine 6 -pc_type mg -pc_mg_levels # -snes_monitor -ksp_monitor
-log_summary
where -pc_mg_levels is set to a number between 2 and 7.
So there is a noticeable CPU time improvement from 2 levels to 3
levels (30%), and then no improvement whatsoever. I am surprised because
with 6 levels of refinement of the DMDA the fine grid has more than 1200
points so with 3 levels the coarse grid still has more than 300 points
which is still pretty large (I assume the ratio between grids is 2). I am
wondering how the coarse solver efficiently solves the problem on the
coarse grid with such a large number of points ? Given the principle of
multigrid which is to erase the smooth part of the error with relaxation
methods, which are usually efficient only for high frequency, I would
expect optimal performance when the coarse grid is basically just a few
points in each direction. Does anyone know why the performance saturates at
low number of levels ? Basically what happens internally seems to be quite
different from what I would expect...
A performance model that counts only flops is not sophisticated enough
to understand this effect. Unfortunately, nearly all MG
books/papers use this model. What we need is a model that incorporates
memory bandwidth (for pulling down the values), and
also maybe memory latency. For instance, your relaxation pulls down all
the values and makes a little progress. It does few flops,
but lots of memory access. An LU solve does a little memory access,
many more flops, but makes a lots more progress. If memory
access is more expensive, then we have a tradeoff, and can understand
using a coarse grid which is not just a few points.
Thanks,
Matt
Post by Timothée Nicolas
Best
Timothee
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
Barry Smith
2015-10-14 18:02:37 UTC
Permalink
1) Your timings are meaningless! You cannot compare timings when built with all debugging on, PERIOD!

##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option, #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################

2) Please run with -snes_view .

3) Note that with 7 levels

SNESJacobianEval 21 1.0 2.4364e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 54 0 0 0 0 54 0 0 0 0 0

with 2 levels

SNESJacobianEval 6 1.0 2.2441e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0


The Jacobian evaluation is dominating the time! Likely if you fix the debugging this will be less the case

Barry
Post by Timothée Nicolas
Dear all,
I have been playing around with multigrid recently, namely with /ksp/ksp/examples/tutorials/ex42.c, with /snes/examples/tutorial/ex5.c and with my own implementation of a laplacian type problem. In all cases, I have noted no improvement whatsoever in the performance, whether in CPU time or KSP iteration, by varying the number of levels of the multigrid solver. As an example, I have attached the log_summary for ex5.c with nlevels = 2 to 7, launched by
mpiexec -n 1 ./ex5 -da_grid_x 21 -da_grid_y 21 -ksp_rtol 1.0e-9 -da_refine 6 -pc_type mg -pc_mg_levels # -snes_monitor -ksp_monitor -log_summary
where -pc_mg_levels is set to a number between 2 and 7.
So there is a noticeable CPU time improvement from 2 levels to 3 levels (30%), and then no improvement whatsoever. I am surprised because with 6 levels of refinement of the DMDA the fine grid has more than 1200 points so with 3 levels the coarse grid still has more than 300 points which is still pretty large (I assume the ratio between grids is 2). I am wondering how the coarse solver efficiently solves the problem on the coarse grid with such a large number of points ? Given the principle of multigrid which is to erase the smooth part of the error with relaxation methods, which are usually efficient only for high frequency, I would expect optimal performance when the coarse grid is basically just a few points in each direction. Does anyone know why the performance saturates at low number of levels ? Basically what happens internally seems to be quite different from what I would expect...
Best
Timothee
<ex5_2_levels_of_multigrid.log><ex5_3_levels_of_multigrid.log><ex5_4_levels_of_multigrid.log><ex5_5_levels_of_multigrid.log><ex5_6_levels_of_multigrid.log><ex5_7_levels_of_multigrid.log>
Timothée Nicolas
2015-10-15 02:32:20 UTC
Permalink
Thank you Barry for pointing this out. Indeed on a system with no debugging
the Jacobian evaluations no longer dominate the time (less than 10%).
However the rest is similar, except the improvement from 2 to 3 levels is
much better. Still it saturates after levels=3. I understand it in terms of
CPU time thanks to Matthew's explanations, however what surprises me more
is that KSP iterations are not more efficient. At the least, even if it
takes more time to have more levels because of memory issues, I would
expect KSP iterations to converge more rapidly with more levels, but it is
not the case as you can see. Probably there is also a rationale behind this
but I cannot see easily.

I send the new outputs

Best

Timothee
Post by Barry Smith
1) Your timings are meaningless! You cannot compare timings when built
with all debugging on, PERIOD!
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option, #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
2) Please run with -snes_view .
3) Note that with 7 levels
SNESJacobianEval 21 1.0 2.4364e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 54 0 0 0 0 54 0 0 0 0 0
with 2 levels
SNESJacobianEval 6 1.0 2.2441e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 34 0 0 0 0 34 0 0 0 0 0
The Jacobian evaluation is dominating the time! Likely if you fix the
debugging this will be less the case
Barry
On Oct 13, 2015, at 9:23 PM, Timothée Nicolas <
Dear all,
I have been playing around with multigrid recently, namely with
/ksp/ksp/examples/tutorials/ex42.c, with /snes/examples/tutorial/ex5.c and
with my own implementation of a laplacian type problem. In all cases, I
have noted no improvement whatsoever in the performance, whether in CPU
time or KSP iteration, by varying the number of levels of the multigrid
solver. As an example, I have attached the log_summary for ex5.c with
nlevels = 2 to 7, launched by
mpiexec -n 1 ./ex5 -da_grid_x 21 -da_grid_y 21 -ksp_rtol 1.0e-9
-da_refine 6 -pc_type mg -pc_mg_levels # -snes_monitor -ksp_monitor
-log_summary
where -pc_mg_levels is set to a number between 2 and 7.
So there is a noticeable CPU time improvement from 2 levels to 3 levels
(30%), and then no improvement whatsoever. I am surprised because with 6
levels of refinement of the DMDA the fine grid has more than 1200 points so
with 3 levels the coarse grid still has more than 300 points which is still
pretty large (I assume the ratio between grids is 2). I am wondering how
the coarse solver efficiently solves the problem on the coarse grid with
such a large number of points ? Given the principle of multigrid which is
to erase the smooth part of the error with relaxation methods, which are
usually efficient only for high frequency, I would expect optimal
performance when the coarse grid is basically just a few points in each
direction. Does anyone know why the performance saturates at low number of
levels ? Basically what happens internally seems to be quite different from
what I would expect...
Best
Timothee
<ex5_2_levels_of_multigrid.log><ex5_3_levels_of_multigrid.log><ex5_4_levels_of_multigrid.log><ex5_5_levels_of_multigrid.log><ex5_6_levels_of_multigrid.log><ex5_7_levels_of_multigrid.log>
Barry Smith
2015-10-15 03:37:08 UTC
Permalink
Timothee,

Thank you for reporting this issue, it is indeed disturbing and could be due to a performance regression we may have introduced by being too clever for our own good. Could you please rerun with the additional option -mg_levels_ksp_type richardson and send the same output?

Thanks

Barry
Thank you Barry for pointing this out. Indeed on a system with no debugging the Jacobian evaluations no longer dominate the time (less than 10%). However the rest is similar, except the improvement from 2 to 3 levels is much better. Still it saturates after levels=3. I understand it in terms of CPU time thanks to Matthew's explanations, however what surprises me more is that KSP iterations are not more efficient. At the least, even if it takes more time to have more levels because of memory issues, I would expect KSP iterations to converge more rapidly with more levels, but it is not the case as you can see. Probably there is also a rationale behind this but I cannot see easily.
I send the new outputs
Best
Timothee
1) Your timings are meaningless! You cannot compare timings when built with all debugging on, PERIOD!
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option, #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
2) Please run with -snes_view .
3) Note that with 7 levels
SNESJacobianEval 21 1.0 2.4364e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 54 0 0 0 0 54 0 0 0 0 0
with 2 levels
SNESJacobianEval 6 1.0 2.2441e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
The Jacobian evaluation is dominating the time! Likely if you fix the debugging this will be less the case
Barry
Post by Timothée Nicolas
Dear all,
I have been playing around with multigrid recently, namely with /ksp/ksp/examples/tutorials/ex42.c, with /snes/examples/tutorial/ex5.c and with my own implementation of a laplacian type problem. In all cases, I have noted no improvement whatsoever in the performance, whether in CPU time or KSP iteration, by varying the number of levels of the multigrid solver. As an example, I have attached the log_summary for ex5.c with nlevels = 2 to 7, launched by
mpiexec -n 1 ./ex5 -da_grid_x 21 -da_grid_y 21 -ksp_rtol 1.0e-9 -da_refine 6 -pc_type mg -pc_mg_levels # -snes_monitor -ksp_monitor -log_summary
where -pc_mg_levels is set to a number between 2 and 7.
So there is a noticeable CPU time improvement from 2 levels to 3 levels (30%), and then no improvement whatsoever. I am surprised because with 6 levels of refinement of the DMDA the fine grid has more than 1200 points so with 3 levels the coarse grid still has more than 300 points which is still pretty large (I assume the ratio between grids is 2). I am wondering how the coarse solver efficiently solves the problem on the coarse grid with such a large number of points ? Given the principle of multigrid which is to erase the smooth part of the error with relaxation methods, which are usually efficient only for high frequency, I would expect optimal performance when the coarse grid is basically just a few points in each direction. Does anyone know why the performance saturates at low number of levels ? Basically what happens internally seems to be quite different from what I would expect...
Best
Timothee
<ex5_2_levels_of_multigrid.log><ex5_3_levels_of_multigrid.log><ex5_4_levels_of_multigrid.log><ex5_5_levels_of_multigrid.log><ex5_6_levels_of_multigrid.log><ex5_7_levels_of_multigrid.log>
<ex5_2_multigrid_levels.log><ex5_3_multigrid_levels.log><ex5_4_multigrid_levels.log><ex5_5_multigrid_levels.log><ex5_6_multigrid_levels.log><ex5_7_multigrid_levels.log>
Timothée Nicolas
2015-10-15 03:53:29 UTC
Permalink
OK,

Richardson is 30-70% faster for these tests, but other than this I don't
see any change.

Timothee
Post by Barry Smith
Timothee,
Thank you for reporting this issue, it is indeed disturbing and could
be due to a performance regression we may have introduced by being too
clever for our own good. Could you please rerun with the additional option
-mg_levels_ksp_type richardson and send the same output?
Thanks
Barry
On Oct 14, 2015, at 9:32 PM, Timothée Nicolas <
Thank you Barry for pointing this out. Indeed on a system with no
debugging the Jacobian evaluations no longer dominate the time (less than
10%). However the rest is similar, except the improvement from 2 to 3
levels is much better. Still it saturates after levels=3. I understand it
in terms of CPU time thanks to Matthew's explanations, however what
surprises me more is that KSP iterations are not more efficient. At the
least, even if it takes more time to have more levels because of memory
issues, I would expect KSP iterations to converge more rapidly with more
levels, but it is not the case as you can see. Probably there is also a
rationale behind this but I cannot see easily.
I send the new outputs
Best
Timothee
1) Your timings are meaningless! You cannot compare timings when built
with all debugging on, PERIOD!
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option, #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
2) Please run with -snes_view .
3) Note that with 7 levels
SNESJacobianEval 21 1.0 2.4364e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 54 0 0 0 0 54 0 0 0 0 0
with 2 levels
SNESJacobianEval 6 1.0 2.2441e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 34 0 0 0 0 34 0 0 0 0 0
The Jacobian evaluation is dominating the time! Likely if you fix the
debugging this will be less the case
Barry
On Oct 13, 2015, at 9:23 PM, Timothée Nicolas <
Dear all,
I have been playing around with multigrid recently, namely with
/ksp/ksp/examples/tutorials/ex42.c, with /snes/examples/tutorial/ex5.c and
with my own implementation of a laplacian type problem. In all cases, I
have noted no improvement whatsoever in the performance, whether in CPU
time or KSP iteration, by varying the number of levels of the multigrid
solver. As an example, I have attached the log_summary for ex5.c with
nlevels = 2 to 7, launched by
mpiexec -n 1 ./ex5 -da_grid_x 21 -da_grid_y 21 -ksp_rtol 1.0e-9
-da_refine 6 -pc_type mg -pc_mg_levels # -snes_monitor -ksp_monitor
-log_summary
where -pc_mg_levels is set to a number between 2 and 7.
So there is a noticeable CPU time improvement from 2 levels to 3
levels (30%), and then no improvement whatsoever. I am surprised because
with 6 levels of refinement of the DMDA the fine grid has more than 1200
points so with 3 levels the coarse grid still has more than 300 points
which is still pretty large (I assume the ratio between grids is 2). I am
wondering how the coarse solver efficiently solves the problem on the
coarse grid with such a large number of points ? Given the principle of
multigrid which is to erase the smooth part of the error with relaxation
methods, which are usually efficient only for high frequency, I would
expect optimal performance when the coarse grid is basically just a few
points in each direction. Does anyone know why the performance saturates at
low number of levels ? Basically what happens internally seems to be quite
different from what I would expect...
Best
Timothee
<ex5_2_levels_of_multigrid.log><ex5_3_levels_of_multigrid.log><ex5_4_levels_of_multigrid.log><ex5_5_levels_of_multigrid.log><ex5_6_levels_of_multigrid.log><ex5_7_levels_of_multigrid.log>
<ex5_2_multigrid_levels.log><ex5_3_multigrid_levels.log><ex5_4_multigrid_levels.log><ex5_5_multigrid_levels.log><ex5_6_multigrid_levels.log><ex5_7_multigrid_levels.log>
Barry Smith
2015-10-15 04:15:04 UTC
Permalink
Wow, quick response! Yes the times still indicate that after 4 levels you get no improvement in time.

t = [1.5629e+01 , 6.2692e+00, 5.3451e+00, 5.4948e+00, 5.4940e+00, 5.7643e+00 ]

I'll look more specifically at the numbers to see where the time is being transformed tomorrow when I am less drunk. It is a trade off between the work saved in the direct solve vs the work needed for the coarser levels in the multigrid cycle.

Try refining the grid a couple more times, likely more levels will still help in that case

Ahh, you should also try -pc_mg_type full


Barry
OK,
Richardson is 30-70% faster for these tests, but other than this I don't see any change.
Timothee
Timothee,
Thank you for reporting this issue, it is indeed disturbing and could be due to a performance regression we may have introduced by being too clever for our own good. Could you please rerun with the additional option -mg_levels_ksp_type richardson and send the same output?
Thanks
Barry
Thank you Barry for pointing this out. Indeed on a system with no debugging the Jacobian evaluations no longer dominate the time (less than 10%). However the rest is similar, except the improvement from 2 to 3 levels is much better. Still it saturates after levels=3. I understand it in terms of CPU time thanks to Matthew's explanations, however what surprises me more is that KSP iterations are not more efficient. At the least, even if it takes more time to have more levels because of memory issues, I would expect KSP iterations to converge more rapidly with more levels, but it is not the case as you can see. Probably there is also a rationale behind this but I cannot see easily.
I send the new outputs
Best
Timothee
1) Your timings are meaningless! You cannot compare timings when built with all debugging on, PERIOD!
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option, #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
2) Please run with -snes_view .
3) Note that with 7 levels
SNESJacobianEval 21 1.0 2.4364e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 54 0 0 0 0 54 0 0 0 0 0
with 2 levels
SNESJacobianEval 6 1.0 2.2441e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
The Jacobian evaluation is dominating the time! Likely if you fix the debugging this will be less the case
Barry
Post by Timothée Nicolas
Dear all,
I have been playing around with multigrid recently, namely with /ksp/ksp/examples/tutorials/ex42.c, with /snes/examples/tutorial/ex5.c and with my own implementation of a laplacian type problem. In all cases, I have noted no improvement whatsoever in the performance, whether in CPU time or KSP iteration, by varying the number of levels of the multigrid solver. As an example, I have attached the log_summary for ex5.c with nlevels = 2 to 7, launched by
mpiexec -n 1 ./ex5 -da_grid_x 21 -da_grid_y 21 -ksp_rtol 1.0e-9 -da_refine 6 -pc_type mg -pc_mg_levels # -snes_monitor -ksp_monitor -log_summary
where -pc_mg_levels is set to a number between 2 and 7.
So there is a noticeable CPU time improvement from 2 levels to 3 levels (30%), and then no improvement whatsoever. I am surprised because with 6 levels of refinement of the DMDA the fine grid has more than 1200 points so with 3 levels the coarse grid still has more than 300 points which is still pretty large (I assume the ratio between grids is 2). I am wondering how the coarse solver efficiently solves the problem on the coarse grid with such a large number of points ? Given the principle of multigrid which is to erase the smooth part of the error with relaxation methods, which are usually efficient only for high frequency, I would expect optimal performance when the coarse grid is basically just a few points in each direction. Does anyone know why the performance saturates at low number of levels ? Basically what happens internally seems to be quite different from what I would expect...
Best
Timothee
<ex5_2_levels_of_multigrid.log><ex5_3_levels_of_multigrid.log><ex5_4_levels_of_multigrid.log><ex5_5_levels_of_multigrid.log><ex5_6_levels_of_multigrid.log><ex5_7_levels_of_multigrid.log>
<ex5_2_multigrid_levels.log><ex5_3_multigrid_levels.log><ex5_4_multigrid_levels.log><ex5_5_multigrid_levels.log><ex5_6_multigrid_levels.log><ex5_7_multigrid_levels.log>
<ex5_2_multigrid_levels_richardson.log><ex5_3_multigrid_levels_richardson.log><ex5_4_multigrid_levels_richardson.log><ex5_5_multigrid_levels_richardson.log><ex5_6_multigrid_levels_richardson.log><ex5_7_multigrid_levels_richardson.log>
Timothée Nicolas
2015-10-15 05:26:51 UTC
Permalink
OK,

I ran an other battery of tests, here are the outputs. It seems to get a
bit better when refining more as you suggested. For instance, for one more
level of refinement, the CPU time saturation occurs for 5 levels instead of
3 previously. However the number of KSP iterations always tends to
(marginally) increase with the number of levels. But in the same time, it
always remain pretty low (less than 5 with extremely good convergence) so
maybe it is not really surprising ?

Timothee
Post by Barry Smith
Wow, quick response! Yes the times still indicate that after 4 levels
you get no improvement in time.
t = [1.5629e+01 , 6.2692e+00, 5.3451e+00, 5.4948e+00, 5.4940e+00, 5.7643e+00 ]
I'll look more specifically at the numbers to see where the time is being
transformed tomorrow when I am less drunk. It is a trade off between the
work saved in the direct solve vs the work needed for the coarser levels in
the multigrid cycle.
Try refining the grid a couple more times, likely more levels will still help in that case
Ahh, you should also try -pc_mg_type full
Barry
On Oct 14, 2015, at 10:53 PM, Timothée Nicolas <
OK,
Richardson is 30-70% faster for these tests, but other than this I don't
see any change.
Timothee
Timothee,
Thank you for reporting this issue, it is indeed disturbing and
could be due to a performance regression we may have introduced by being
too clever for our own good. Could you please rerun with the additional
option -mg_levels_ksp_type richardson and send the same output?
Thanks
Barry
On Oct 14, 2015, at 9:32 PM, Timothée Nicolas <
Thank you Barry for pointing this out. Indeed on a system with no
debugging the Jacobian evaluations no longer dominate the time (less than
10%). However the rest is similar, except the improvement from 2 to 3
levels is much better. Still it saturates after levels=3. I understand it
in terms of CPU time thanks to Matthew's explanations, however what
surprises me more is that KSP iterations are not more efficient. At the
least, even if it takes more time to have more levels because of memory
issues, I would expect KSP iterations to converge more rapidly with more
levels, but it is not the case as you can see. Probably there is also a
rationale behind this but I cannot see easily.
I send the new outputs
Best
Timothee
1) Your timings are meaningless! You cannot compare timings when built
with all debugging on, PERIOD!
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option, #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
2) Please run with -snes_view .
3) Note that with 7 levels
SNESJacobianEval 21 1.0 2.4364e+01 1.0 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 54 0 0 0 0 54 0 0 0 0 0
with 2 levels
SNESJacobianEval 6 1.0 2.2441e+01 1.0 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
The Jacobian evaluation is dominating the time! Likely if you fix the
debugging this will be less the case
Barry
On Oct 13, 2015, at 9:23 PM, Timothée Nicolas <
Dear all,
I have been playing around with multigrid recently, namely with
/ksp/ksp/examples/tutorials/ex42.c, with /snes/examples/tutorial/ex5.c and
with my own implementation of a laplacian type problem. In all cases, I
have noted no improvement whatsoever in the performance, whether in CPU
time or KSP iteration, by varying the number of levels of the multigrid
solver. As an example, I have attached the log_summary for ex5.c with
nlevels = 2 to 7, launched by
mpiexec -n 1 ./ex5 -da_grid_x 21 -da_grid_y 21 -ksp_rtol 1.0e-9
-da_refine 6 -pc_type mg -pc_mg_levels # -snes_monitor -ksp_monitor
-log_summary
where -pc_mg_levels is set to a number between 2 and 7.
So there is a noticeable CPU time improvement from 2 levels to 3
levels (30%), and then no improvement whatsoever. I am surprised because
with 6 levels of refinement of the DMDA the fine grid has more than 1200
points so with 3 levels the coarse grid still has more than 300 points
which is still pretty large (I assume the ratio between grids is 2). I am
wondering how the coarse solver efficiently solves the problem on the
coarse grid with such a large number of points ? Given the principle of
multigrid which is to erase the smooth part of the error with relaxation
methods, which are usually efficient only for high frequency, I would
expect optimal performance when the coarse grid is basically just a few
points in each direction. Does anyone know why the performance saturates at
low number of levels ? Basically what happens internally seems to be quite
different from what I would expect...
Best
Timothee
<ex5_2_levels_of_multigrid.log><ex5_3_levels_of_multigrid.log><ex5_4_levels_of_multigrid.log><ex5_5_levels_of_multigrid.log><ex5_6_levels_of_multigrid.log><ex5_7_levels_of_multigrid.log>
<ex5_2_multigrid_levels.log><ex5_3_multigrid_levels.log><ex5_4_multigrid_levels.log><ex5_5_multigrid_levels.log><ex5_6_multigrid_levels.log><ex5_7_multigrid_levels.log>
<ex5_2_multigrid_levels_richardson.log><ex5_3_multigrid_levels_richardson.log><ex5_4_multigrid_levels_richardson.log><ex5_5_multigrid_levels_richardson.log><ex5_6_multigrid_levels_richardson.log><ex5_7_multigrid_levels_richardson.log>
Barry Smith
2015-10-15 05:34:43 UTC
Permalink
OK,
I ran an other battery of tests, here are the outputs. It seems to get a bit better when refining more as you suggested. For instance, for one more level of refinement, the CPU time saturation occurs for 5 levels instead of 3 previously. However the number of KSP iterations always tends to (marginally) increase with the number of levels. But in the same time, it always remain pretty low (less than 5 with extremely good convergence) so maybe it is not really surprising ?
Yes, this is normal. You should not expect the number of outter KSP iterations to decrease with the number of levels, at best it will be the same. It will not decrease.

Barry
Timothee
Wow, quick response! Yes the times still indicate that after 4 levels you get no improvement in time.
t = [1.5629e+01 , 6.2692e+00, 5.3451e+00, 5.4948e+00, 5.4940e+00, 5.7643e+00 ]
I'll look more specifically at the numbers to see where the time is being transformed tomorrow when I am less drunk. It is a trade off between the work saved in the direct solve vs the work needed for the coarser levels in the multigrid cycle.
Try refining the grid a couple more times, likely more levels will still help in that case
Ahh, you should also try -pc_mg_type full
Barry
OK,
Richardson is 30-70% faster for these tests, but other than this I don't see any change.
Timothee
Timothee,
Thank you for reporting this issue, it is indeed disturbing and could be due to a performance regression we may have introduced by being too clever for our own good. Could you please rerun with the additional option -mg_levels_ksp_type richardson and send the same output?
Thanks
Barry
Thank you Barry for pointing this out. Indeed on a system with no debugging the Jacobian evaluations no longer dominate the time (less than 10%). However the rest is similar, except the improvement from 2 to 3 levels is much better. Still it saturates after levels=3. I understand it in terms of CPU time thanks to Matthew's explanations, however what surprises me more is that KSP iterations are not more efficient. At the least, even if it takes more time to have more levels because of memory issues, I would expect KSP iterations to converge more rapidly with more levels, but it is not the case as you can see. Probably there is also a rationale behind this but I cannot see easily.
I send the new outputs
Best
Timothee
1) Your timings are meaningless! You cannot compare timings when built with all debugging on, PERIOD!
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option, #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
2) Please run with -snes_view .
3) Note that with 7 levels
SNESJacobianEval 21 1.0 2.4364e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 54 0 0 0 0 54 0 0 0 0 0
with 2 levels
SNESJacobianEval 6 1.0 2.2441e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
The Jacobian evaluation is dominating the time! Likely if you fix the debugging this will be less the case
Barry
Post by Timothée Nicolas
Dear all,
I have been playing around with multigrid recently, namely with /ksp/ksp/examples/tutorials/ex42.c, with /snes/examples/tutorial/ex5.c and with my own implementation of a laplacian type problem. In all cases, I have noted no improvement whatsoever in the performance, whether in CPU time or KSP iteration, by varying the number of levels of the multigrid solver. As an example, I have attached the log_summary for ex5.c with nlevels = 2 to 7, launched by
mpiexec -n 1 ./ex5 -da_grid_x 21 -da_grid_y 21 -ksp_rtol 1.0e-9 -da_refine 6 -pc_type mg -pc_mg_levels # -snes_monitor -ksp_monitor -log_summary
where -pc_mg_levels is set to a number between 2 and 7.
So there is a noticeable CPU time improvement from 2 levels to 3 levels (30%), and then no improvement whatsoever. I am surprised because with 6 levels of refinement of the DMDA the fine grid has more than 1200 points so with 3 levels the coarse grid still has more than 300 points which is still pretty large (I assume the ratio between grids is 2). I am wondering how the coarse solver efficiently solves the problem on the coarse grid with such a large number of points ? Given the principle of multigrid which is to erase the smooth part of the error with relaxation methods, which are usually efficient only for high frequency, I would expect optimal performance when the coarse grid is basically just a few points in each direction. Does anyone know why the performance saturates at low number of levels ? Basically what happens internally seems to be quite different from what I would expect...
Best
Timothee
<ex5_2_levels_of_multigrid.log><ex5_3_levels_of_multigrid.log><ex5_4_levels_of_multigrid.log><ex5_5_levels_of_multigrid.log><ex5_6_levels_of_multigrid.log><ex5_7_levels_of_multigrid.log>
<ex5_2_multigrid_levels.log><ex5_3_multigrid_levels.log><ex5_4_multigrid_levels.log><ex5_5_multigrid_levels.log><ex5_6_multigrid_levels.log><ex5_7_multigrid_levels.log>
<ex5_2_multigrid_levels_richardson.log><ex5_3_multigrid_levels_richardson.log><ex5_4_multigrid_levels_richardson.log><ex5_5_multigrid_levels_richardson.log><ex5_6_multigrid_levels_richardson.log><ex5_7_multigrid_levels_richardson.log>
<ex5_2_multigrid_levels_richardson_7_refine.log><ex5_3_multigrid_levels_richardson_7_refine.log><ex5_3_multigrid_levels_richardson_8_refine.log><ex5_4_multigrid_levels_richardson_7_refine.log><ex5_4_multigrid_levels_richardson_8_refine.log><ex5_5_multigrid_levels_richardson_7_refine.log><ex5_5_multigrid_levels_richardson_8_refine.log><ex5_6_multigrid_levels_richardson_7_refine.log><ex5_6_multigrid_levels_richardson_8_refine.log><ex5_7_multigrid_levels_richardson_7_refine.log><ex5_7_multigrid_levels_richardson_8_refine.log><ex5_8_multigrid_levels_richardson_7_refine.log><ex5_8_multigrid_levels_richardson_8_refine.log><ex5_pc_mg_full_2_multigrid_levels_richardson_7_refine.log><ex5_pc_mg_full_3_multigrid_levels_richardson_7_refine.log><ex5_pc_mg_full_4_multigrid_levels_richardson_7_refine.log><ex5_pc_mg_full_5_multigrid_levels_richardson_7_refine.log><ex5_pc_mg_full_6_multigrid_levels_richardson_7_refine.log><ex5_pc_mg_full_7_multigrid_levels_richardson_7_refine.log><ex5_pc_mg_full_8_multigrid_levels_richardson_7_refine.log>
Barry Smith
2015-10-15 21:01:31 UTC
Permalink
I guess the -pc_mg_type full is the best you are going to get. In parallel the coarser grids are problematic because they have little local work but still communication.

Barry
OK,
I ran an other battery of tests, here are the outputs. It seems to get a bit better when refining more as you suggested. For instance, for one more level of refinement, the CPU time saturation occurs for 5 levels instead of 3 previously. However the number of KSP iterations always tends to (marginally) increase with the number of levels. But in the same time, it always remain pretty low (less than 5 with extremely good convergence) so maybe it is not really surprising ?
Timothee
Wow, quick response! Yes the times still indicate that after 4 levels you get no improvement in time.
t = [1.5629e+01 , 6.2692e+00, 5.3451e+00, 5.4948e+00, 5.4940e+00, 5.7643e+00 ]
I'll look more specifically at the numbers to see where the time is being transformed tomorrow when I am less drunk. It is a trade off between the work saved in the direct solve vs the work needed for the coarser levels in the multigrid cycle.
Try refining the grid a couple more times, likely more levels will still help in that case
Ahh, you should also try -pc_mg_type full
Barry
OK,
Richardson is 30-70% faster for these tests, but other than this I don't see any change.
Timothee
Timothee,
Thank you for reporting this issue, it is indeed disturbing and could be due to a performance regression we may have introduced by being too clever for our own good. Could you please rerun with the additional option -mg_levels_ksp_type richardson and send the same output?
Thanks
Barry
Thank you Barry for pointing this out. Indeed on a system with no debugging the Jacobian evaluations no longer dominate the time (less than 10%). However the rest is similar, except the improvement from 2 to 3 levels is much better. Still it saturates after levels=3. I understand it in terms of CPU time thanks to Matthew's explanations, however what surprises me more is that KSP iterations are not more efficient. At the least, even if it takes more time to have more levels because of memory issues, I would expect KSP iterations to converge more rapidly with more levels, but it is not the case as you can see. Probably there is also a rationale behind this but I cannot see easily.
I send the new outputs
Best
Timothee
1) Your timings are meaningless! You cannot compare timings when built with all debugging on, PERIOD!
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option, #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
2) Please run with -snes_view .
3) Note that with 7 levels
SNESJacobianEval 21 1.0 2.4364e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 54 0 0 0 0 54 0 0 0 0 0
with 2 levels
SNESJacobianEval 6 1.0 2.2441e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
The Jacobian evaluation is dominating the time! Likely if you fix the debugging this will be less the case
Barry
Post by Timothée Nicolas
Dear all,
I have been playing around with multigrid recently, namely with /ksp/ksp/examples/tutorials/ex42.c, with /snes/examples/tutorial/ex5.c and with my own implementation of a laplacian type problem. In all cases, I have noted no improvement whatsoever in the performance, whether in CPU time or KSP iteration, by varying the number of levels of the multigrid solver. As an example, I have attached the log_summary for ex5.c with nlevels = 2 to 7, launched by
mpiexec -n 1 ./ex5 -da_grid_x 21 -da_grid_y 21 -ksp_rtol 1.0e-9 -da_refine 6 -pc_type mg -pc_mg_levels # -snes_monitor -ksp_monitor -log_summary
where -pc_mg_levels is set to a number between 2 and 7.
So there is a noticeable CPU time improvement from 2 levels to 3 levels (30%), and then no improvement whatsoever. I am surprised because with 6 levels of refinement of the DMDA the fine grid has more than 1200 points so with 3 levels the coarse grid still has more than 300 points which is still pretty large (I assume the ratio between grids is 2). I am wondering how the coarse solver efficiently solves the problem on the coarse grid with such a large number of points ? Given the principle of multigrid which is to erase the smooth part of the error with relaxation methods, which are usually efficient only for high frequency, I would expect optimal performance when the coarse grid is basically just a few points in each direction. Does anyone know why the performance saturates at low number of levels ? Basically what happens internally seems to be quite different from what I would expect...
Best
Timothee
<ex5_2_levels_of_multigrid.log><ex5_3_levels_of_multigrid.log><ex5_4_levels_of_multigrid.log><ex5_5_levels_of_multigrid.log><ex5_6_levels_of_multigrid.log><ex5_7_levels_of_multigrid.log>
<ex5_2_multigrid_levels.log><ex5_3_multigrid_levels.log><ex5_4_multigrid_levels.log><ex5_5_multigrid_levels.log><ex5_6_multigrid_levels.log><ex5_7_multigrid_levels.log>
<ex5_2_multigrid_levels_richardson.log><ex5_3_multigrid_levels_richardson.log><ex5_4_multigrid_levels_richardson.log><ex5_5_multigrid_levels_richardson.log><ex5_6_multigrid_levels_richardson.log><ex5_7_multigrid_levels_richardson.log>
<ex5_2_multigrid_levels_richardson_7_refine.log><ex5_3_multigrid_levels_richardson_7_refine.log><ex5_3_multigrid_levels_richardson_8_refine.log><ex5_4_multigrid_levels_richardson_7_refine.log><ex5_4_multigrid_levels_richardson_8_refine.log><ex5_5_multigrid_levels_richardson_7_refine.log><ex5_5_multigrid_levels_richardson_8_refine.log><ex5_6_multigrid_levels_richardson_7_refine.log><ex5_6_multigrid_levels_richardson_8_refine.log><ex5_7_multigrid_levels_richardson_7_refine.log><ex5_7_multigrid_levels_richardson_8_refine.log><ex5_8_multigrid_levels_richardson_7_refine.log><ex5_8_multigrid_levels_richardson_8_refine.log><ex5_pc_mg_full_2_multigrid_levels_richardson_7_refine.log><ex5_pc_mg_full_3_multigrid_levels_richardson_7_refine.log><ex5_pc_mg_full_4_multigrid_levels_richardson_7_refine.log><ex5_pc_mg_full_5_multigrid_levels_richardson_7_refine.log><ex5_pc_mg_full_6_multigrid_levels_richardson_7_refine.log><ex5_pc_mg_full_7_multigrid_levels_richardson_7_refine.log><ex5_pc_mg_full_8_multigrid_levels_richardson_7_refine.log>
Timothée Nicolas
2015-10-16 00:44:04 UTC
Permalink
OK, perfect, thank you.
Post by Barry Smith
I guess the -pc_mg_type full is the best you are going to get. In
parallel the coarser grids are problematic because they have little local
work but still communication.
Barry
On Oct 15, 2015, at 12:26 AM, Timothée Nicolas <
OK,
I ran an other battery of tests, here are the outputs. It seems to get a
bit better when refining more as you suggested. For instance, for one more
level of refinement, the CPU time saturation occurs for 5 levels instead of
3 previously. However the number of KSP iterations always tends to
(marginally) increase with the number of levels. But in the same time, it
always remain pretty low (less than 5 with extremely good convergence) so
maybe it is not really surprising ?
Timothee
Wow, quick response! Yes the times still indicate that after 4 levels
you get no improvement in time.
t = [1.5629e+01 , 6.2692e+00, 5.3451e+00, 5.4948e+00, 5.4940e+00,
5.7643e+00 ]
I'll look more specifically at the numbers to see where the time is
being transformed tomorrow when I am less drunk. It is a trade off between
the work saved in the direct solve vs the work needed for the coarser
levels in the multigrid cycle.
Try refining the grid a couple more times, likely more levels will still
help in that case
Ahh, you should also try -pc_mg_type full
Barry
On Oct 14, 2015, at 10:53 PM, Timothée Nicolas <
OK,
Richardson is 30-70% faster for these tests, but other than this I
don't see any change.
Timothee
Timothee,
Thank you for reporting this issue, it is indeed disturbing and
could be due to a performance regression we may have introduced by being
too clever for our own good. Could you please rerun with the additional
option -mg_levels_ksp_type richardson and send the same output?
Thanks
Barry
On Oct 14, 2015, at 9:32 PM, Timothée Nicolas <
Thank you Barry for pointing this out. Indeed on a system with no
debugging the Jacobian evaluations no longer dominate the time (less than
10%). However the rest is similar, except the improvement from 2 to 3
levels is much better. Still it saturates after levels=3. I understand it
in terms of CPU time thanks to Matthew's explanations, however what
surprises me more is that KSP iterations are not more efficient. At the
least, even if it takes more time to have more levels because of memory
issues, I would expect KSP iterations to converge more rapidly with more
levels, but it is not the case as you can see. Probably there is also a
rationale behind this but I cannot see easily.
I send the new outputs
Best
Timothee
1) Your timings are meaningless! You cannot compare timings when
built with all debugging on, PERIOD!
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option, #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
2) Please run with -snes_view .
3) Note that with 7 levels
SNESJacobianEval 21 1.0 2.4364e+01 1.0 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 54 0 0 0 0 54 0 0 0 0 0
with 2 levels
SNESJacobianEval 6 1.0 2.2441e+01 1.0 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
The Jacobian evaluation is dominating the time! Likely if you fix
the debugging this will be less the case
Barry
On Oct 13, 2015, at 9:23 PM, Timothée Nicolas <
Dear all,
I have been playing around with multigrid recently, namely with
/ksp/ksp/examples/tutorials/ex42.c, with /snes/examples/tutorial/ex5.c and
with my own implementation of a laplacian type problem. In all cases, I
have noted no improvement whatsoever in the performance, whether in CPU
time or KSP iteration, by varying the number of levels of the multigrid
solver. As an example, I have attached the log_summary for ex5.c with
nlevels = 2 to 7, launched by
mpiexec -n 1 ./ex5 -da_grid_x 21 -da_grid_y 21 -ksp_rtol 1.0e-9
-da_refine 6 -pc_type mg -pc_mg_levels # -snes_monitor -ksp_monitor
-log_summary
where -pc_mg_levels is set to a number between 2 and 7.
So there is a noticeable CPU time improvement from 2 levels to 3
levels (30%), and then no improvement whatsoever. I am surprised because
with 6 levels of refinement of the DMDA the fine grid has more than 1200
points so with 3 levels the coarse grid still has more than 300 points
which is still pretty large (I assume the ratio between grids is 2). I am
wondering how the coarse solver efficiently solves the problem on the
coarse grid with such a large number of points ? Given the principle of
multigrid which is to erase the smooth part of the error with relaxation
methods, which are usually efficient only for high frequency, I would
expect optimal performance when the coarse grid is basically just a few
points in each direction. Does anyone know why the performance saturates at
low number of levels ? Basically what happens internally seems to be quite
different from what I would expect...
Best
Timothee
<ex5_2_levels_of_multigrid.log><ex5_3_levels_of_multigrid.log><ex5_4_levels_of_multigrid.log><ex5_5_levels_of_multigrid.log><ex5_6_levels_of_multigrid.log><ex5_7_levels_of_multigrid.log>
<ex5_2_multigrid_levels.log><ex5_3_multigrid_levels.log><ex5_4_multigrid_levels.log><ex5_5_multigrid_levels.log><ex5_6_multigrid_levels.log><ex5_7_multigrid_levels.log>
<ex5_2_multigrid_levels_richardson.log><ex5_3_multigrid_levels_richardson.log><ex5_4_multigrid_levels_richardson.log><ex5_5_multigrid_levels_richardson.log><ex5_6_multigrid_levels_richardson.log><ex5_7_multigrid_levels_richardson.log>
<ex5_2_multigrid_levels_richardson_7_refine.log><ex5_3_multigrid_levels_richardson_7_refine.log><ex5_3_multigrid_levels_richardson_8_refine.log><ex5_4_multigrid_levels_richardson_7_refine.log><ex5_4_multigrid_levels_richardson_8_refine.log><ex5_5_multigrid_levels_richardson_7_refine.log><ex5_5_multigrid_levels_richardson_8_refine.log><ex5_6_multigrid_levels_richardson_7_refine.log><ex5_6_multigrid_levels_richardson_8_refine.log><ex5_7_multigrid_levels_richardson_7_refine.log><ex5_7_multigrid_levels_richardson_8_refine.log><ex5_8_multigrid_levels_richardson_7_refine.log><ex5_8_multigrid_levels_richardson_8_refine.log><ex5_pc_mg_full_2_multigrid_levels_richardson_7_refine.log><ex5_pc_mg_full_3_multigrid_levels_richardson_7_refine.log><ex5_pc_mg_full_4_multigrid_levels_richardson_7_refine.log><ex5_pc_mg_full_5_multigrid_levels_richardson_7_refine.log><ex5_pc_mg_full_6_multigrid_levels_richardson_7_refine.log><ex5_pc_mg_full_7_multigrid_levels_richardson_7_refine.log><ex5_pc_mg_full_8_multigrid_levels_richardson_7_refine.log>
Matthew Knepley
2015-10-15 11:07:24 UTC
Permalink
On Wed, Oct 14, 2015 at 9:32 PM, Timothée Nicolas <
Post by Timothée Nicolas
Thank you Barry for pointing this out. Indeed on a system with no
debugging the Jacobian evaluations no longer dominate the time (less than
10%). However the rest is similar, except the improvement from 2 to 3
levels is much better. Still it saturates after levels=3. I understand it
in terms of CPU time thanks to Matthew's explanations, however what
surprises me more is that KSP iterations are not more efficient. At the
least, even if it takes more time to have more levels because of memory
issues, I would expect KSP iterations to converge more rapidly with more
levels, but it is not the case as you can see. Probably there is also a
rationale behind this but I cannot see easily.
That conclusion makes no sense to me. Thought experiment:

I have K levels, which means that on the coarsest level K I do a direct
solve. Now I add a level
so that I have K+1. On level K-1, now instead using the result of a
direct solve as a starting guess,
I use some iterative result. I cannot imagine that the iterates would go
down.

Matt
Post by Timothée Nicolas
I send the new outputs
Best
Timothee
Post by Barry Smith
1) Your timings are meaningless! You cannot compare timings when built
with all debugging on, PERIOD!
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option, #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
2) Please run with -snes_view .
3) Note that with 7 levels
SNESJacobianEval 21 1.0 2.4364e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 54 0 0 0 0 54 0 0 0 0 0
with 2 levels
SNESJacobianEval 6 1.0 2.2441e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 34 0 0 0 0 34 0 0 0 0 0
The Jacobian evaluation is dominating the time! Likely if you fix the
debugging this will be less the case
Barry
On Oct 13, 2015, at 9:23 PM, Timothée Nicolas <
Dear all,
I have been playing around with multigrid recently, namely with
/ksp/ksp/examples/tutorials/ex42.c, with /snes/examples/tutorial/ex5.c and
with my own implementation of a laplacian type problem. In all cases, I
have noted no improvement whatsoever in the performance, whether in CPU
time or KSP iteration, by varying the number of levels of the multigrid
solver. As an example, I have attached the log_summary for ex5.c with
nlevels = 2 to 7, launched by
mpiexec -n 1 ./ex5 -da_grid_x 21 -da_grid_y 21 -ksp_rtol 1.0e-9
-da_refine 6 -pc_type mg -pc_mg_levels # -snes_monitor -ksp_monitor
-log_summary
where -pc_mg_levels is set to a number between 2 and 7.
So there is a noticeable CPU time improvement from 2 levels to 3 levels
(30%), and then no improvement whatsoever. I am surprised because with 6
levels of refinement of the DMDA the fine grid has more than 1200 points so
with 3 levels the coarse grid still has more than 300 points which is still
pretty large (I assume the ratio between grids is 2). I am wondering how
the coarse solver efficiently solves the problem on the coarse grid with
such a large number of points ? Given the principle of multigrid which is
to erase the smooth part of the error with relaxation methods, which are
usually efficient only for high frequency, I would expect optimal
performance when the coarse grid is basically just a few points in each
direction. Does anyone know why the performance saturates at low number of
levels ? Basically what happens internally seems to be quite different from
what I would expect...
Best
Timothee
<ex5_2_levels_of_multigrid.log><ex5_3_levels_of_multigrid.log><ex5_4_levels_of_multigrid.log><ex5_5_levels_of_multigrid.log><ex5_6_levels_of_multigrid.log><ex5_7_levels_of_multigrid.log>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
Loading...