Discussion:
[petsc-users] Scaling with number of cores
TAY wee-beng
2015-10-31 16:34:44 UTC
Permalink
Hi,

I understand that as mentioned in the faq, due to the limitations in
memory, the scaling is not linear. So, I am trying to write a proposal
to use a supercomputer.

Its specs are:

Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)

8 cores / processor

Interconnect: Tofu (6-dimensional mesh/torus) Interconnect

Each cabinet contains 96 computing nodes,

One of the requirement is to give the performance of my current code
with my current set of data, and there is a formula to calculate the
estimated parallel efficiency when using the new large set of data

There are 2 ways to give performance:
1. Strong scaling, which is defined as how the elapsed time varies with
the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with
the number of processors for a
fixed problem size per processor.

I ran my cases with 48 and 96 cores with my current cluster, giving 140
and 90 mins respectively. This is classified as strong scaling.

Cluster specs:

CPU: AMD 6234 2.4GHz

8 cores / processor (CPU)

6 CPU / node

So 48 Cores / CPU

Not sure abt the memory / node


The parallel efficiency ‘En’ for a given degree of parallelism ‘n’
indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the
following formulae. Although their
derivation processes are different depending on strong and weak scaling,
derived formulae are the
same.

From the estimated time, my parallel efficiency using Amdahl's law on
the current old cluster was 52.7%.

So is my results acceptable?

For the large data set, if using 2205 nodes (2205X8cores), my expected
parallel efficiency is only 0.5%. The proposal recommends value of > 50%.

Is it possible for this type of scaling in PETSc (>50%), when using
17640 (2205X8) cores?

Btw, I do not have access to the system.




Sent using CloudMagic Email
<https://cloudmagic.com/k/d/mailapp?ct=pa&cv=7.4.10&pv=5.0.2&source=email_footer_2>
Matthew Knepley
2015-10-31 16:47:51 UTC
Permalink
Post by TAY wee-beng
Hi,
I understand that as mentioned in the faq, due to the limitations in
memory, the scaling is not linear. So, I am trying to write a proposal to
use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with
my current set of data, and there is a formula to calculate the estimated
parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with
the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the
number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140
and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’
indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the
following formulae. Although their
derivation processes are different depending on strong and weak scaling,
derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the
current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected
parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from
Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and
apply it to another without a
model of this dependence.

Weak scaling does model changes with problem size, so I would measure weak
scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not
make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.

Thanks,

Matt
Post by TAY wee-beng
Is it possible for this type of scaling in PETSc (>50%), when using 17640
(2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
<https://cloudmagic.com/k/d/mailapp?ct=pa&cv=7.4.10&pv=5.0.2&source=email_footer_2>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
TAY wee-beng
2015-11-01 01:43:06 UTC
Permalink
Post by TAY wee-beng
Hi,
I understand that as mentioned in the faq, due to the limitations
in memory, the scaling is not linear. So, I am trying to write a
proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current
code with my current set of data, and there is a formula to
calculate the estimated parallel efficiency when using the new
large set of data
1. Strong scaling, which is defined as how the elapsed time varies
with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies
with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster,
giving 140 and 90 mins respectively. This is classified as strong
scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’
indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by
the following formulae. Although their
derivation processes are different depending on strong and weak
scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's
law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my
expected parallel efficiency is only 0.5%. The proposal recommends
value of > 50%.
The problem with this analysis is that the estimated serial fraction
from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one
problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure
weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does
not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the
expected parallel efficiency. From the formula used, it's obvious it's
doing some sort of exponential extrapolation decrease. So unless I can
achieve a near > 90% speed up when I double the cores and problem size
for my current 48/96 cores setup, extrapolating from about 96 nodes to
10,000 nodes will give a much lower expected parallel efficiency for the
new case.

However, it's mentioned in the FAQ that due to memory requirement, it's
impossible to get >90% speed when I double the cores and problem size
(ie linear increase in performance), which means that I can't get >90%
speed up when I double the cores and problem size for my current 48/96
cores setup. Is that so?

So is it fair to say that the main problem does not lie in my
programming skills, but rather the way the linear equations are solved?

Thanks.
Post by TAY wee-beng
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when
using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
<https://cloudmagic.com/k/d/mailapp?ct=pa&cv=7.4.10&pv=5.0.2&source=email_footer_2>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener
Barry Smith
2015-11-01 02:00:49 UTC
Permalink
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?

Barry
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
TAY wee-beng
2015-11-01 13:30:50 UTC
Permalink
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,

I have attached the output

48 cores: log48
96 cores: log96

There are 2 solvers - The momentum linear eqn uses bcgs, while the
Poisson eqn uses hypre BoomerAMG.

Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
Barry Smith
2015-11-01 16:30:52 UTC
Permalink
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results

Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.

PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11

PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2

Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?

You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.

Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
TAY wee-beng
2015-11-02 01:35:47 UTC
Permalink
Hi,

Sorry I forgot and use the old a.out. I have attached the new log for
48cores (log48), together with the 96cores log (log96).

Why does the number of processes increase so much? Is there something
wrong with my coding?

Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to
reuse the preconditioner, what must I do? Or what must I not do?

Lastly, I only simulated 2 time steps previously. Now I run for 10
timesteps (log48_10). Is it building the preconditioner at every timestep?

Also, what about momentum eqn? Is it working well?

I will try the gamg later too.

Thank you

Yours sincerely,

TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
Barry Smith
2015-11-02 01:49:36 UTC
Permalink
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.

Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
TAY wee-beng
2015-11-02 04:02:33 UTC
Permalink
Hi,

I have attached the new run with 100 time steps for 48 and 96 cores.

Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to
reuse the preconditioner, what must I do? Or what must I not do?

Why does the number of processes increase so much? Is there something
wrong with my coding? Seems to be so too for my new run.

Thank you

Yours sincerely,

TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
Barry Smith
2015-11-02 04:27:58 UTC
Permalink
Run without the -momentum_ksp_view -poisson_ksp_view and send the new results


You can see from the log summary that the PCSetUp is taking a much smaller percentage of the time meaning that it is reusing the preconditioner and not rebuilding it each time.

Barry

Something makes no sense with the output: it gives

KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165

90% of the time is in the solve but there is no significant amount of time in other events of the code which is just not possible. I hope it is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
TAY wee-beng
2015-11-02 06:19:49 UTC
Permalink
Hi,

I have attached the new results.

Thank you

Yours sincerely,

TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send the new results
You can see from the log summary that the PCSetUp is taking a much smaller percentage of the time meaning that it is reusing the preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount of time in other events of the code which is just not possible. I hope it is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
Barry Smith
2015-11-02 06:55:35 UTC
Permalink
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then (158)x(266)x(150) on 64 processors and send the two -log_summary results

Barry
Post by TAY wee-beng
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send the new results
You can see from the log summary that the PCSetUp is taking a much smaller percentage of the time meaning that it is reusing the preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount of time in other events of the code which is just not possible. I hope it is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
TAY wee-beng
2015-11-02 09:17:06 UTC
Permalink
Hi,

I have attached the 2 files.

Thank you

Yours sincerely,

TAY wee-beng
Post by Barry Smith
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then (158)x(266)x(150) on 64 processors and send the two -log_summary results
Barry
Post by TAY wee-beng
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send the new results
You can see from the log summary that the PCSetUp is taking a much smaller percentage of the time meaning that it is reusing the preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount of time in other events of the code which is just not possible. I hope it is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
Barry Smith
2015-11-02 19:18:37 UTC
Permalink
hypre is just not scaling well here. I do not know why. Since hypre is a block box for us there is no way to determine why the poor scaling.

If you make the same two runs with -pc_type gamg there will be a lot more information in the log summary about in what routines it is scaling well or poorly.

Barry
Post by TAY wee-beng
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then (158)x(266)x(150) on 64 processors and send the two -log_summary results
Barry
Post by TAY wee-beng
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send the new results
You can see from the log summary that the PCSetUp is taking a much smaller percentage of the time meaning that it is reusing the preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount of time in other events of the code which is just not possible. I hope it is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
TAY wee-beng
2015-11-03 04:37:12 UTC
Permalink
Hi,

I tried :

1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg

2. -poisson_pc_type gamg

Both options give:

1 0.00150000 0.00000000 0.00000000
1.00000000 NaN NaN NaN
M Diverged but why?, time = 2
reason = -9

How can I check what's wrong?

Thank you

Yours sincerely,

TAY wee-beng
Post by Barry Smith
hypre is just not scaling well here. I do not know why. Since hypre is a block box for us there is no way to determine why the poor scaling.
If you make the same two runs with -pc_type gamg there will be a lot more information in the log summary about in what routines it is scaling well or poorly.
Barry
Post by TAY wee-beng
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then (158)x(266)x(150) on 64 processors and send the two -log_summary results
Barry
Post by TAY wee-beng
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send the new results
You can see from the log summary that the PCSetUp is taking a much smaller percentage of the time meaning that it is reusing the preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount of time in other events of the code which is just not possible. I hope it is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
Barry Smith
2015-11-03 04:45:03 UTC
Permalink
Post by TAY wee-beng
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual -poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do you have any zeros on the diagonal for the matrix (you shouldn't).

There may be something wrong with your poisson discretization that was also messing up hypre
Post by TAY wee-beng
1 0.00150000 0.00000000 0.00000000 1.00000000 NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
hypre is just not scaling well here. I do not know why. Since hypre is a block box for us there is no way to determine why the poor scaling.
If you make the same two runs with -pc_type gamg there will be a lot more information in the log summary about in what routines it is scaling well or poorly.
Barry
Post by TAY wee-beng
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then (158)x(266)x(150) on 64 processors and send the two -log_summary results
Barry
Post by TAY wee-beng
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send the new results
You can see from the log summary that the PCSetUp is taking a much smaller percentage of the time meaning that it is reusing the preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount of time in other events of the code which is just not possible. I hope it is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
TAY wee-beng
2015-11-03 12:49:06 UTC
Permalink
Hi,

I tried and have attached the log.

Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify
some null space stuff? Like KSPSetNullSpace or MatNullSpaceCreate?

Thank you

Yours sincerely,

TAY wee-beng
Post by Barry Smith
Post by TAY wee-beng
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual -poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do you have any zeros on the diagonal for the matrix (you shouldn't).
There may be something wrong with your poisson discretization that was also messing up hypre
Post by TAY wee-beng
1 0.00150000 0.00000000 0.00000000 1.00000000 NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
hypre is just not scaling well here. I do not know why. Since hypre is a block box for us there is no way to determine why the poor scaling.
If you make the same two runs with -pc_type gamg there will be a lot more information in the log summary about in what routines it is scaling well or poorly.
Barry
Post by TAY wee-beng
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then (158)x(266)x(150) on 64 processors and send the two -log_summary results
Barry
Post by TAY wee-beng
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send the new results
You can see from the log summary that the PCSetUp is taking a much smaller percentage of the time meaning that it is reusing the preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount of time in other events of the code which is just not possible. I hope it is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
Matthew Knepley
2015-11-03 12:52:30 UTC
Permalink
Post by TAY wee-beng
Hi,
I tried and have attached the log.
Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify
some null space stuff? Like KSPSetNullSpace or MatNullSpaceCreate?
Yes, you need to attach the constant null space to the matrix.

Thanks,

Matt
Post by TAY wee-beng
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Post by TAY wee-beng
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual
-poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do you have any zeros
on the diagonal for the matrix (you shouldn't).
There may be something wrong with your poisson discretization that was
also messing up hypre
Post by TAY wee-beng
1 0.00150000 0.00000000 0.00000000 1.00000000
NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
hypre is just not scaling well here. I do not know why. Since hypre
is a block box for us there is no way to determine why the poor scaling.
If you make the same two runs with -pc_type gamg there will be a
lot more information in the log summary about in what routines it is
scaling well or poorly.
Barry
Post by TAY wee-beng
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then
(158)x(266)x(150) on 64 processors and send the two -log_summary results
Barry
Post by TAY wee-beng
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send
the new results
You can see from the log summary that the PCSetUp is taking a
much smaller percentage of the time meaning that it is reusing the
preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04
9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount
of time in other events of the code which is just not possible. I hope it
is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want
to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there
something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver
then you MUST do your weak scaling studies with MANY time steps since the
setup time of AMG only takes place in the first stimestep. So run both 48
and 96 processes with the same large number of time steps.
Barry
Post by TAY wee-beng
Hi,
Sorry I forgot and use the old a.out. I have attached the new
log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there
something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I
want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for
10 timesteps (log48_10). Is it building the preconditioner at every
timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You
need to be careful and make sure you don't change the solvers when you
change the number of processors since you can get very different
inconsistent results
Anyways all the time is being spent in the BoomerAMG
algebraic multigrid setup and it is is scaling badly. When you double the
problem size and number of processes it went from 3.2445e+01 to 4.3599e+02
seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0
0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0
0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can
you use the same preconditioner built with BoomerAMG for all the time
steps? Algebraic multigrid has a large set up time that you often doesn't
matter if you have many time steps but if you have to rebuild it each
timestep it is too large?
You might also try -pc_type gamg and see how PETSc's
algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Post by TAY wee-beng
On Sat, Oct 31, 2015 at 11:34 AM, TAY wee-beng<
Hi,
I understand that as mentioned in the faq, due to the
limitations in memory, the scaling is not linear. So, I am trying to write
a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory
per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my
current code with my current set of data, and there is a formula to
calculate the estimated parallel efficiency when using the new large set of
data
1. Strong scaling, which is defined as how the elapsed time
varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time
varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current
cluster, giving 140 and 90 mins respectively. This is classified as strong
scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of
parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is
given by the following formulae. Although their
derivation processes are different depending on strong and
weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using
Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores),
my expected parallel efficiency is only 0.5%. The proposal recommends value
of > 50%.
The problem with this analysis is that the estimated serial
fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from
one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I
would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that
this does not make sense for many scientific
applications, but neither does requiring a certain parallel
efficiency.
Ok I check the results for my weak scaling it is even worse
for the expected parallel efficiency. From the formula used, it's obvious
it's doing some sort of exponential extrapolation decrease. So unless I can
achieve a near > 90% speed up when I double the cores and problem size for
my current 48/96 cores setup, extrapolating from about 96 nodes to
10,000 nodes will give a much lower expected parallel efficiency for the
new case.
However, it's mentioned in the FAQ that due to memory
requirement, it's impossible to get >90% speed when I double the cores and
problem size (ie linear increase in performance), which means that I can't
get >90% speed up when I double the cores and problem size for my current
48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the
problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while
the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my
Post by TAY wee-beng
programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%),
when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin
their experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
TAY wee-beng
2015-11-03 12:58:23 UTC
Permalink
Post by TAY wee-beng
Hi,
I tried and have attached the log.
Ya, my Poisson eqn has Neumann boundary condition. Do I need to
specify some null space stuff? Like KSPSetNullSpace or
MatNullSpaceCreate?
Yes, you need to attach the constant null space to the matrix.
Thanks,
Matt
Ok so can you point me to a suitable example so that I know which one to
use specifically?

Thanks.
Post by TAY wee-beng
Thank you
Yours sincerely,
TAY wee-beng
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual
-poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do you
have any zeros on the diagonal for the matrix (you shouldn't).
There may be something wrong with your poisson
discretization that was also messing up hypre
1 0.00150000 0.00000000 0.00000000
1.00000000 NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
hypre is just not scaling well here. I do not know
why. Since hypre is a block box for us there is no way
to determine why the poor scaling.
If you make the same two runs with -pc_type gamg
there will be a lot more information in the log
summary about in what routines it is scaling well or
poorly.
Barry
On Nov 2, 2015, at 3:17 AM, TAY
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Run (158/2)x(266/2)x(150/2) grid on 8
processes and then (158)x(266)x(150) on 64
processors and send the two -log_summary results
Barry
On Nov 2, 2015, at 12:19 AM, TAY
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Run without the -momentum_ksp_view
-poisson_ksp_view and send the new results
You can see from the log summary
that the PCSetUp is taking a much
smaller percentage of the time meaning
that it is reusing the preconditioner
and not rebuilding it each time.
Barry
Something makes no sense with the
output: it gives
KSPSolve 199 1.0
2.3298e+03 1.0 5.20e+09 1.8 3.8e+04
9.9e+05 5.0e+02 90100 66100 24 90100
66100 24 165
90% of the time is in the solve but
there is no significant amount of time
in other events of the code which is
just not possible. I hope it is due to
your IO.
On Nov 1, 2015, at 10:02 PM, TAY
Hi,
I have attached the new run with
100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS
changes, the LHS doesn't. So if I
want to reuse the preconditioner,
what must I do? Or what must I not do?
Why does the number of processes
increase so much? Is there
something wrong with my coding?
Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
On 2/11/2015 9:49 AM, Barry Smith
If you are doing many time
steps with the same linear
solver then you MUST do your
weak scaling studies with MANY
time steps since the setup
time of AMG only takes place
in the first stimestep. So run
both 48 and 96 processes with
the same large number of time
steps.
Barry
On Nov 1, 2015, at 7:35
PM, TAY
Hi,
Sorry I forgot and use the
old a.out. I have attached
the new log for 48cores
(log48), together with the
96cores log (log96).
Why does the number of
processes increase so
much? Is there something
wrong with my coding?
Only the Poisson eqn 's
RHS changes, the LHS
doesn't. So if I want to
reuse the preconditioner,
what must I do? Or what
must I not do?
Lastly, I only simulated 2
time steps previously. Now
I run for 10 timesteps
(log48_10). Is it building
the preconditioner at
every timestep?
Also, what about momentum
eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
On 2/11/2015 12:30 AM,
You used gmres with
48 processes but
richardson with 96.
You need to be careful
and make sure you
don't change the
solvers when you
change the number of
processors since you
can get very different
inconsistent results
Anyways all the
time is being spent in
the BoomerAMG
algebraic multigrid
setup and it is is
scaling badly. When
you double the problem
size and number of
processes it went from
3.2445e+01 to
4.3599e+02 seconds.
PCSetUp
3 1.0 3.2445e+01 1.0
9.58e+06 2.0 0.0e+00
0.0e+00 4.0e+00 62 8
0 0 4 62 8 0 0
5 11
PCSetUp
3 1.0 4.3599e+02 1.0
9.58e+06 2.0 0.0e+00
0.0e+00 4.0e+00 85 18
0 0 6 85 18 0 0
6 2
Now is the Poisson
problem changing at
each timestep or can
you use the same
preconditioner built
with BoomerAMG for all
the time steps?
Algebraic multigrid
has a large set up
time that you often
doesn't matter if you
have many time steps
but if you have to
rebuild it each
timestep it is too large?
You might also try
-pc_type gamg and see
how PETSc's algebraic
multigrid scales for
your problem/machine.
Barry
On Nov 1, 2015, at
7:30 AM, TAY
On 1/11/2015 10:00
On Oct 31,
2015, at
8:43 PM,
TAY
On
1/11/2015
12:47 AM,
Matthew
On
Sat,
Oct
31,
2015
at
11:34
AM,
TAY
Hi,
I
understand
that
as
mentioned
in the
faq,
due to
the
limitations
in
memory, the
scaling is
not
linear. So,
I am
trying
to
write
a
proposal
to use
a
supercomputer.
Its
82,944
nodes
(SPARC64
VIIIfx; 16GB
of
memory
per node)
8
cores
/
processor
Tofu
(6-dimensional
mesh/torus)
Interconnect
Each
cabinet contains
96
computing
nodes,
One of
the
requirement
is to
give
the
performance
of my
current code
with
my
current set
of
data,
and
there
is a
formula to
calculate
the
estimated
parallel
efficiency
when
using
the
new
large
set of
data
There
are 2
ways
to
give
1.
Strong
scaling,
which
is
defined as
how
the
elapsed time
varies
with
the
number
of
processors
for a
fixed
problem.
2.
Weak
scaling,
which
is
defined as
how
the
elapsed time
varies
with
the
number
of
processors
for a
fixed
problem size
per
processor.
I ran
my
cases
with
48 and
96
cores
with
my
current cluster,
giving
140
and 90
mins
respectively.
This
is
classified
as
strong
scaling.
AMD
6234
2.4GHz
8
cores
/
processor
(CPU)
6 CPU
/ node
So 48
Cores
/ CPU
Not
sure
abt
the
memory
/ node
The
parallel
efficiency
‘En’
for a
given
degree
of
parallelism
‘n’
indicates
how
much
the
program is
efficiently
accelerated
by
parallel
processing.
‘En’
is
given
by the
following
formulae.
Although
their
derivation
processes
are
different
depending
on
strong
and
weak
scaling,
derived formulae
are the
same.
From
the
estimated
time,
my
parallel
efficiency
using
Amdahl's
law on
the
current old
cluster was
52.7%.
So is
my
results acceptable?
For
the
large
data
set,
if
using
2205
nodes
(2205X8cores),
my
expected
parallel
efficiency
is
only
0.5%.
The
proposal
recommends
value
of > 50%.
The
problem with
this
analysis
is
that
the
estimated
serial
fraction
from
Amdahl's
Law
changes as
a function
of
problem size,
so you
cannot
take
the
strong
scaling from
one
problem and
apply
it to
another without
a
model
of
this
dependence.
Weak
scaling does
model
changes with
problem size,
so I
would
measure weak
scaling on
your
current
cluster,
and
extrapolate
to the
big
machine.
I
realize that
this
does
not
make
sense
for
many
scientific
applications,
but
neither does
requiring
a
certain parallel
efficiency.
Ok I check
the
results
for my
weak
scaling it
is even
worse for
the
expected
parallel
efficiency. From
the
formula
used, it's
obvious
it's doing
some sort
of
exponential extrapolation
decrease.
So unless
I can
achieve a
near > 90%
speed up
when I
double the
cores and
problem
size for
my current
48/96
cores
setup,
extrapolating
from about
96 nodes
to 10,000
nodes will
give a
much lower
expected
parallel
efficiency
for the
new case.
However,
it's
mentioned
in the FAQ
that due
to memory
requirement,
it's
impossible
to get
Post by TAY wee-beng
90% speed
when I
double the
cores and
problem
size (ie
linear
increase
in
performance),
which
means that
I can't
get >90%
speed up
when I
double the
cores and
problem
size for
my current
48/96
cores
setup. Is
that so?
What is the
output of
-ksp_view
-log_summary
on the problem
and then on
the problem
doubled in
size and
number of
processors?
Barry
Hi,
I have attached
the output
48 cores: log48
96 cores: log96
There are 2
solvers - The
momentum linear
eqn uses bcgs,
while the Poisson
eqn uses hypre
BoomerAMG.
Problem size
doubled from
158x266x150 to
158x266x300.
So is it
fair to
say that
the main
problem
does not
lie in my
programming skills,
but rather
the way
the linear
equations
are solved?
Thanks.
Thanks,
Matt
Is it
possible
for
this
type
of
scaling in
PETSc
(>50%), when
using
17640
(2205X8)
cores?
Btw, I
do not
have
access
to the
system.
Sent
using
CloudMagic
Email
--
What
most
experimenters
take
for
granted before
they
begin
their
experiments
is
infinitely
more
interesting
than
any
results to
which
their
experiments
lead.
--
Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener
Matthew Knepley
2015-11-03 13:01:18 UTC
Permalink
Post by Matthew Knepley
Post by TAY wee-beng
Hi,
I tried and have attached the log.
Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify
some null space stuff? Like KSPSetNullSpace or MatNullSpaceCreate?
Yes, you need to attach the constant null space to the matrix.
Thanks,
Matt
Ok so can you point me to a suitable example so that I know which one to
use specifically?
https://bitbucket.org/petsc/petsc/src/9ae8fd060698c4d6fc0d13188aca8a1828c138ab/src/snes/examples/tutorials/ex12.c?at=master&fileviewer=file-view-default#ex12.c-761

Matt
Post by Matthew Knepley
Thanks.
Post by TAY wee-beng
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Post by TAY wee-beng
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual
-poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do you have any
zeros on the diagonal for the matrix (you shouldn't).
There may be something wrong with your poisson discretization that
was also messing up hypre
Post by TAY wee-beng
1 0.00150000 0.00000000 0.00000000 1.00000000
NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
hypre is just not scaling well here. I do not know why. Since
hypre is a block box for us there is no way to determine why the poor
scaling.
If you make the same two runs with -pc_type gamg there will be a
lot more information in the log summary about in what routines it is
scaling well or poorly.
Barry
Post by TAY wee-beng
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then
(158)x(266)x(150) on 64 processors and send the two -log_summary results
Barry
Post by TAY wee-beng
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send
the new results
You can see from the log summary that the PCSetUp is taking a
much smaller percentage of the time meaning that it is reusing the
preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04
9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount
of time in other events of the code which is just not possible. I hope it
is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I
want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there
something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver
then you MUST do your weak scaling studies with MANY time steps since the
setup time of AMG only takes place in the first stimestep. So run both 48
and 96 processes with the same large number of time steps.
Barry
Post by TAY wee-beng
Hi,
Sorry I forgot and use the old a.out. I have attached the new
log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there
something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I
want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for
10 timesteps (log48_10). Is it building the preconditioner at every
timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96.
You need to be careful and make sure you don't change the solvers when you
change the number of processors since you can get very different
inconsistent results
Anyways all the time is being spent in the BoomerAMG
algebraic multigrid setup and it is is scaling badly. When you double the
problem size and number of processes it went from 3.2445e+01 to 4.3599e+02
seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0
0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0
0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can
you use the same preconditioner built with BoomerAMG for all the time
steps? Algebraic multigrid has a large set up time that you often doesn't
matter if you have many time steps but if you have to rebuild it each
timestep it is too large?
You might also try -pc_type gamg and see how PETSc's
algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
On Oct 31, 2015, at 8:43 PM, TAY wee-beng<
Post by TAY wee-beng
On Sat, Oct 31, 2015 at 11:34 AM, TAY wee-beng<
Hi,
I understand that as mentioned in the faq, due to the
limitations in memory, the scaling is not linear. So, I am trying to write
a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of
memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my
current code with my current set of data, and there is a formula to
calculate the estimated parallel efficiency when using the new large set of
data
1. Strong scaling, which is defined as how the elapsed
time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time
varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current
cluster, giving 140 and 90 mins respectively. This is classified as strong
scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of
parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is
given by the following formulae. Although their
derivation processes are different depending on strong and
weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using
Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores),
my expected parallel efficiency is only 0.5%. The proposal recommends value
of > 50%.
The problem with this analysis is that the estimated
serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling
from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I
would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize
that this does not make sense for many scientific
applications, but neither does requiring a certain
parallel efficiency.
Ok I check the results for my weak scaling it is even worse
for the expected parallel efficiency. From the formula used, it's obvious
it's doing some sort of exponential extrapolation decrease. So unless I can
achieve a near > 90% speed up when I double the cores and problem size for
my current 48/96 cores setup, extrapolating from about 96 nodes to
10,000 nodes will give a much lower expected parallel efficiency for the
new case.
However, it's mentioned in the FAQ that due to memory
requirement, it's impossible to get >90% speed when I double the cores and
problem size (ie linear increase in performance), which means that I can't
get >90% speed up when I double the cores and problem size for my current
48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the
problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs,
while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
So is it fair to say that the main problem does not lie in
Post by TAY wee-beng
my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%),
when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin
their experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
Mark Adams
2015-11-03 14:00:20 UTC
Permalink
If you clean the RHS of the null space then you can probably (only)
use "*-mg_coarse_pc_type
svd*", but you need this or the lu coarse grid solver will have problems.
If your solution starts drifting then you need to set the null space but
this is often not needed.

Also, after you get this working you want to check these PCSetUp times.
This setup will not scale great but this behavior indicates that there is
something wrong. Hypre's default parameters are tune for 2D problems, you
have a 3D problem, I assume. GAMG should be fine. As a rule of thumb the
PCSetup should not be much more than a solve. An easy 3D Poisson solve
might require relatively more setup and a hard 2D problem might require
relatively less.
Post by Matthew Knepley
Post by Matthew Knepley
Post by TAY wee-beng
Hi,
I tried and have attached the log.
Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify
some null space stuff? Like KSPSetNullSpace or MatNullSpaceCreate?
Yes, you need to attach the constant null space to the matrix.
Thanks,
Matt
Ok so can you point me to a suitable example so that I know which one to
use specifically?
https://bitbucket.org/petsc/petsc/src/9ae8fd060698c4d6fc0d13188aca8a1828c138ab/src/snes/examples/tutorials/ex12.c?at=master&fileviewer=file-view-default#ex12.c-761
Matt
Post by Matthew Knepley
Thanks.
Post by TAY wee-beng
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Post by TAY wee-beng
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual
-poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do you have any
zeros on the diagonal for the matrix (you shouldn't).
There may be something wrong with your poisson discretization that
was also messing up hypre
Post by TAY wee-beng
1 0.00150000 0.00000000 0.00000000 1.00000000
NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
hypre is just not scaling well here. I do not know why. Since
hypre is a block box for us there is no way to determine why the poor
scaling.
If you make the same two runs with -pc_type gamg there will be a
lot more information in the log summary about in what routines it is
scaling well or poorly.
Barry
Post by TAY wee-beng
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then
(158)x(266)x(150) on 64 processors and send the two -log_summary results
Barry
Post by TAY wee-beng
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send
the new results
You can see from the log summary that the PCSetUp is taking a
much smaller percentage of the time meaning that it is reusing the
preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04
9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant
amount of time in other events of the code which is just not possible. I
hope it is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I
want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there
something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver
then you MUST do your weak scaling studies with MANY time steps since the
setup time of AMG only takes place in the first stimestep. So run both 48
and 96 processes with the same large number of time steps.
Barry
Post by TAY wee-beng
Hi,
Sorry I forgot and use the old a.out. I have attached the new
log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there
something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I
want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run
for 10 timesteps (log48_10). Is it building the preconditioner at every
timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96.
You need to be careful and make sure you don't change the solvers when you
change the number of processors since you can get very different
inconsistent results
Anyways all the time is being spent in the BoomerAMG
algebraic multigrid setup and it is is scaling badly. When you double the
problem size and number of processes it went from 3.2445e+01 to 4.3599e+02
seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0
0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0
0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or
can you use the same preconditioner built with BoomerAMG for all the time
steps? Algebraic multigrid has a large set up time that you often doesn't
matter if you have many time steps but if you have to rebuild it each
timestep it is too large?
You might also try -pc_type gamg and see how PETSc's
algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
On Oct 31, 2015, at 8:43 PM, TAY wee-beng<
Post by TAY wee-beng
On Sat, Oct 31, 2015 at 11:34 AM, TAY wee-beng<
Hi,
I understand that as mentioned in the faq, due to the
limitations in memory, the scaling is not linear. So, I am trying to write
a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of
memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my
current code with my current set of data, and there is a formula to
calculate the estimated parallel efficiency when using the new large set of
data
1. Strong scaling, which is defined as how the elapsed
time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time
varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current
cluster, giving 140 and 90 mins respectively. This is classified as strong
scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of
parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is
given by the following formulae. Although their
derivation processes are different depending on strong
and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using
Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes
(2205X8cores), my expected parallel efficiency is only 0.5%. The proposal
recommends value of > 50%.
The problem with this analysis is that the estimated
serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling
from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I
would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize
that this does not make sense for many scientific
applications, but neither does requiring a certain
parallel efficiency.
Ok I check the results for my weak scaling it is even
worse for the expected parallel efficiency. From the formula used, it's
obvious it's doing some sort of exponential extrapolation decrease. So
unless I can achieve a near > 90% speed up when I double the cores and
problem size for my current 48/96 cores setup, extrapolating from about
96 nodes to 10,000 nodes will give a much lower expected parallel
efficiency for the new case.
However, it's mentioned in the FAQ that due to memory
requirement, it's impossible to get >90% speed when I double the cores and
problem size (ie linear increase in performance), which means that I can't
get >90% speed up when I double the cores and problem size for my current
48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the
problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs,
while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
So is it fair to say that the main problem does not lie in
Post by TAY wee-beng
my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%),
when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they
begin their experiments is infinitely more interesting than any results to
which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
TAY wee-beng
2015-11-03 15:04:56 UTC
Permalink
Post by TAY wee-beng
Post by TAY wee-beng
Hi,
I tried and have attached the log.
Ya, my Poisson eqn has Neumann boundary condition. Do I need
to specify some null space stuff? Like KSPSetNullSpace or
MatNullSpaceCreate?
Yes, you need to attach the constant null space to the matrix.
Thanks,
Matt
Ok so can you point me to a suitable example so that I know which
one to use specifically?
https://bitbucket.org/petsc/petsc/src/9ae8fd060698c4d6fc0d13188aca8a1828c138ab/src/snes/examples/tutorials/ex12.c?at=master&fileviewer=file-view-default#ex12.c-761
Matt
Hi,

Actually, I realised that for my Poisson eqn, I have neumann and
dirichlet BC. Dirichlet BC is at the output grids by specifying pressure
= 0. So do I still need the null space?

My Poisson eqn LHS is fixed but RHS is changing with every timestep.

If I need to use null space, how do I know if the null space contains
the constant vector and what the the no. of vectors? I follow the
example given and added:

call MatNullSpaceCreate(MPI_COMM_WORLD,PETSC_TRUE,0,NULL,nullsp,ierr)

call MatSetNullSpace(A,nullsp,ierr)

call MatNullSpaceDestroy(nullsp,ierr)

Is that all?

Before this, I was using HYPRE geometric solver and the matrix / vector
in the subroutine was written based on HYPRE. It worked pretty well and
fast.

However, it's a black box and it's hard to diagnose problems.

I always had the PETSc subroutine to solve my Poisson eqn but I used
KSPBCGS or KSPGMRES with HYPRE's boomeramg as the PC. It worked but was
slow.

Matt: Thanks, I will see how it goes using the nullspace and may try
"/-mg_coarse_pc_type svd/" later.
Post by TAY wee-beng
Thanks.
Post by TAY wee-beng
Thank you
Yours sincerely,
TAY wee-beng
On Nov 2, 2015, at 10:37 PM, TAY
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual
-poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do
you have any zeros on the diagonal for the matrix (you
shouldn't).
There may be something wrong with your poisson
discretization that was also messing up hypre
1 0.00150000 0.00000000 0.00000000
1.00000000 NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
hypre is just not scaling well here. I do not
know why. Since hypre is a block box for us there
is no way to determine why the poor scaling.
If you make the same two runs with -pc_type
gamg there will be a lot more information in the
log summary about in what routines it is scaling
well or poorly.
Barry
On Nov 2, 2015, at 3:17 AM, TAY
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Run (158/2)x(266/2)x(150/2) grid on 8
processes and then (158)x(266)x(150) on
64 processors and send the two
-log_summary results
Barry
On Nov 2, 2015, at 12:19 AM, TAY
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Run without the
-momentum_ksp_view
-poisson_ksp_view and send the
new results
You can see from the log
summary that the PCSetUp is
taking a much smaller percentage
of the time meaning that it is
reusing the preconditioner and
not rebuilding it each time.
Barry
Something makes no sense with
the output: it gives
KSPSolve 199 1.0
2.3298e+03 1.0 5.20e+09 1.8
3.8e+04 9.9e+05 5.0e+02 90100
66100 24 90100 66100 24 165
90% of the time is in the solve
but there is no significant
amount of time in other events of
the code which is just not
possible. I hope it is due to
your IO.
On Nov 1, 2015, at 10:02 PM,
Hi,
I have attached the new run
with 100 time steps for 48
and 96 cores.
Only the Poisson eqn 's RHS
changes, the LHS doesn't. So
if I want to reuse the
preconditioner, what must I
do? Or what must I not do?
Why does the number of
processes increase so much?
Is there something wrong with
my coding? Seems to be so too
for my new run.
Thank you
Yours sincerely,
TAY wee-beng
On 2/11/2015 9:49 AM, Barry
If you are doing many
time steps with the same
linear solver then you
MUST do your weak scaling
studies with MANY time
steps since the setup
time of AMG only takes
place in the first
stimestep. So run both 48
and 96 processes with the
same large number of time
steps.
Barry
On Nov 1, 2015, at
7:35 PM, TAY
Hi,
Sorry I forgot and
use the old a.out. I
have attached the new
log for 48cores
(log48), together
with the 96cores log
(log96).
Why does the number
of processes increase
so much? Is there
something wrong with
my coding?
Only the Poisson eqn
's RHS changes, the
LHS doesn't. So if I
want to reuse the
preconditioner, what
must I do? Or what
must I not do?
Lastly, I only
simulated 2 time
steps previously. Now
I run for 10
timesteps (log48_10).
Is it building the
preconditioner at
every timestep?
Also, what about
momentum eqn? Is it
working well?
I will try the gamg
later too.
Thank you
Yours sincerely,
TAY wee-beng
On 2/11/2015 12:30
You used gmres
with 48 processes
but richardson
with 96. You need
to be careful and
make sure you
don't change the
solvers when you
change the number
of processors
since you can get
very different
inconsistent results
Anyways all
the time is being
spent in the
BoomerAMG
algebraic
multigrid setup
and it is is
scaling badly.
When you double
the problem size
and number of
processes it went
from 3.2445e+01
to 4.3599e+02
seconds.
PCSetUp 3 1.0
3.2445e+01 1.0
9.58e+06 2.0
0.0e+00 0.0e+00
4.0e+00 62 8 0
0 4 62 8 0 0
5 11
PCSetUp 3 1.0
4.3599e+02 1.0
9.58e+06 2.0
0.0e+00 0.0e+00
4.0e+00 85 18 0
0 6 85 18 0 0
6 2
Now is the
Poisson problem
changing at each
timestep or can
you use the same
preconditioner
built with
BoomerAMG for all
the time steps?
Algebraic
multigrid has a
large set up time
that you often
doesn't matter if
you have many
time steps but if
you have to
rebuild it each
timestep it is
too large?
You might also
try -pc_type gamg
and see how
PETSc's algebraic
multigrid scales
for your
problem/machine.
Barry
On Nov 1,
2015, at 7:30
AM, TAY
On 1/11/2015
10:00 AM,
Barry Smith
On
Oct
31,
2015,
at
8:43
PM,
TAY
On
1/11/2015
12:47
AM,
Matthew
Knepley
On Sat,
Oct
31,
2015
at 11:34
AM,
TAY
Hi,
I
understand
that
as mentioned
in the
faq,
due
to the
limitations
in memory,
the
scaling
is not
linear.
So,
I
am trying
to write
a
proposal
to use
a
supercomputer.
Its
specs
Compute
82,944
nodes
(SPARC64
VIIIfx;
16GB
of memory
per
node)
8
cores
/
processor
Tofu
(6-dimensional
mesh/torus)
Interconnect
Each
cabinet
contains
96 computing
nodes,
One
of the
requirement
is to
give
the
performance
of my
current
code
with
my current
set
of data,
and
there
is a
formula
to calculate
the
estimated
parallel
efficiency
when
using
the
new
large
set
of data
There
are
2
ways
to give
1. Strong
scaling,
which
is defined
as how
the
elapsed
time
varies
with
the
number
of processors
for
a
fixed
problem.
2. Weak
scaling,
which
is defined
as how
the
elapsed
time
varies
with
the
number
of processors
for a
fixed
problem
size
per
processor.
I
ran
my cases
with
48 and
96 cores
with
my current
cluster,
giving
140
and
90 mins
respectively.
This
is classified
as strong
scaling.
Cluster
AMD
6234
2.4GHz
8
cores
/
processor
(CPU)
6
CPU
/
node
So 48
Cores
/ CPU
Not
sure
abt
the
memory
/
node
The
parallel
efficiency
‘En’
for
a
given
degree
of parallelism
‘n’
indicates
how
much
the
program
is
efficiently
accelerated
by parallel
processing.
‘En’
is given
by the
following
formulae.
Although
their
derivation
processes
are
different
depending
on strong
and
weak
scaling,
derived
formulae
are
the
same.
From
the
estimated
time,
my parallel
efficiency
using
Amdahl's
law
on the
current
old
cluster
was
52.7%.
So is
my results
acceptable?
For
the
large
data
set,
if using
2205
nodes
(2205X8cores),
my expected
parallel
efficiency
is only
0.5%.
The
proposal
recommends
value
of >
50%.
The
problem
with
this
analysis
is that
the
estimated
serial
fraction
from
Amdahl's
Law
changes
as a
function
of problem
size,
so you
cannot
take
the
strong
scaling
from
one
problem
and
apply
it to
another
without
a
model
of this
dependence.
Weak
scaling
does
model
changes
with
problem
size,
so I
would
measure
weak
scaling
on your
current
cluster,
and
extrapolate
to the
big
machine.
I
realize
that
this
does
not
make
sense
for
many
scientific
applications,
but
neither
does
requiring
a
certain
parallel
efficiency.
Ok I
check
the
results
for
my
weak
scaling
it is
even
worse
for
the
expected
parallel
efficiency.
From
the
formula
used,
it's
obvious
it's
doing
some
sort
of
exponential
extrapolation
decrease.
So
unless I
can
achieve
a
near
90%
speed
up
when
I
double the
cores
and
problem
size
for
my
current
48/96
cores
setup, extrapolating
from
about
96
nodes
to
10,000 nodes
will
give
a
much
lower
expected
parallel
efficiency
for
the
new case.
However,
it's
mentioned
in
the
FAQ
that
due
to
memory requirement,
it's
impossible
to
get
90%
speed
when
I
double the
cores
and
problem
size
(ie
linear increase
in
performance),
which
means
that
I
can't
get
90%
speed
up
when
I
double the
cores
and
problem
size
for
my
current
48/96
cores
setup. Is
that so?
What
is the
output of
-ksp_view
-log_summary
on the
problem
and then
on the
problem
doubled
in size
and
number of
processors?
Barry
Hi,
I have
attached the
output
48 cores: log48
96 cores: log96
There are 2
solvers - The
momentum
linear eqn
uses bcgs,
while the
Poisson eqn
uses hypre
BoomerAMG.
Problem size
doubled from
158x266x150
to 158x266x300.
So is
it
fair
to
say
that
the
main
problem
does
not
lie
in my
programming
skills,
but
rather the
way
the
linear equations
are
solved?
Thanks.
Thanks,
Matt
Is it
possible
for
this
type
of scaling
in PETSc
(>50%),
when
using
17640
(2205X8)
cores?
Btw,
I
do not
have
access
to the
system.
Sent
using
CloudMagic
Email
--
What
most
experimenters
take
for
granted
before
they
begin
their
experiments
is infinitely
more
interesting
than
any
results
to which
their
experiments
lead.
-- Norbert
Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to
which their experiments lead.
-- Norbert Wiener
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener
Barry Smith
2015-11-03 17:04:58 UTC
Permalink
Post by TAY wee-beng
Hi,
I tried and have attached the log.
Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify some null space stuff? Like KSPSetNullSpace or MatNullSpaceCreate?
Yes, you need to attach the constant null space to the matrix.
Thanks,
Matt
Ok so can you point me to a suitable example so that I know which one to use specifically?
https://bitbucket.org/petsc/petsc/src/9ae8fd060698c4d6fc0d13188aca8a1828c138ab/src/snes/examples/tutorials/ex12.c?at=master&fileviewer=file-view-default#ex12.c-761
Matt
Hi,
Actually, I realised that for my Poisson eqn, I have neumann and dirichlet BC. Dirichlet BC is at the output grids by specifying pressure = 0. So do I still need the null space?
No,
My Poisson eqn LHS is fixed but RHS is changing with every timestep.
call MatNullSpaceCreate(MPI_COMM_WORLD,PETSC_TRUE,0,NULL,nullsp,ierr)
call MatSetNullSpace(A,nullsp,ierr)
call MatNullSpaceDestroy(nullsp,ierr)
Is that all?
Before this, I was using HYPRE geometric solver and the matrix / vector in the subroutine was written based on HYPRE. It worked pretty well and fast.
However, it's a black box and it's hard to diagnose problems.
I always had the PETSc subroutine to solve my Poisson eqn but I used KSPBCGS or KSPGMRES with HYPRE's boomeramg as the PC. It worked but was slow.
Matt: Thanks, I will see how it goes using the nullspace and may try "-mg_coarse_pc_type svd" later.
Thanks.
Post by TAY wee-beng
Thank you
Yours sincerely,
TAY wee-beng
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual -poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do you have any zeros on the diagonal for the matrix (you shouldn't).
There may be something wrong with your poisson discretization that was also messing up hypre
1 0.00150000 0.00000000 0.00000000 1.00000000 NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
hypre is just not scaling well here. I do not know why. Since hypre is a block box for us there is no way to determine why the poor scaling.
If you make the same two runs with -pc_type gamg there will be a lot more information in the log summary about in what routines it is scaling well or poorly.
Barry
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then (158)x(266)x(150) on 64 processors and send the two -log_summary results
Barry
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Run without the -momentum_ksp_view -poisson_ksp_view and send the new results
You can see from the log summary that the PCSetUp is taking a much smaller percentage of the time meaning that it is reusing the preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount of time in other events of the code which is just not possible. I hope it is due to your IO.
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
TAY wee-beng
2015-11-03 15:16:23 UTC
Permalink
Post by TAY wee-beng
Post by TAY wee-beng
Hi,
I tried and have attached the log.
Ya, my Poisson eqn has Neumann boundary condition. Do I need
to specify some null space stuff? Like KSPSetNullSpace or
MatNullSpaceCreate?
Yes, you need to attach the constant null space to the matrix.
Thanks,
Matt
Ok so can you point me to a suitable example so that I know which
one to use specifically?
https://bitbucket.org/petsc/petsc/src/9ae8fd060698c4d6fc0d13188aca8a1828c138ab/src/snes/examples/tutorials/ex12.c?at=master&fileviewer=file-view-default#ex12.c-761
Matt
Oh ya,

How do I call:

call MatNullSpaceCreate(MPI_COMM_WORLD,PETSC_TRUE,0,NULL,nullsp,ierr)

But it says NULL is not defined. How do I define it?

Thanks
Post by TAY wee-beng
Thanks.
Post by TAY wee-beng
Thank you
Yours sincerely,
TAY wee-beng
On Nov 2, 2015, at 10:37 PM, TAY
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual
-poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do
you have any zeros on the diagonal for the matrix (you
shouldn't).
There may be something wrong with your poisson
discretization that was also messing up hypre
1 0.00150000 0.00000000 0.00000000
1.00000000 NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
hypre is just not scaling well here. I do not
know why. Since hypre is a block box for us there
is no way to determine why the poor scaling.
If you make the same two runs with -pc_type
gamg there will be a lot more information in the
log summary about in what routines it is scaling
well or poorly.
Barry
On Nov 2, 2015, at 3:17 AM, TAY
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Run (158/2)x(266/2)x(150/2) grid on 8
processes and then (158)x(266)x(150) on
64 processors and send the two
-log_summary results
Barry
On Nov 2, 2015, at 12:19 AM, TAY
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Run without the
-momentum_ksp_view
-poisson_ksp_view and send the
new results
You can see from the log
summary that the PCSetUp is
taking a much smaller percentage
of the time meaning that it is
reusing the preconditioner and
not rebuilding it each time.
Barry
Something makes no sense with
the output: it gives
KSPSolve 199 1.0
2.3298e+03 1.0 5.20e+09 1.8
3.8e+04 9.9e+05 5.0e+02 90100
66100 24 90100 66100 24 165
90% of the time is in the solve
but there is no significant
amount of time in other events of
the code which is just not
possible. I hope it is due to
your IO.
On Nov 1, 2015, at 10:02 PM,
Hi,
I have attached the new run
with 100 time steps for 48
and 96 cores.
Only the Poisson eqn 's RHS
changes, the LHS doesn't. So
if I want to reuse the
preconditioner, what must I
do? Or what must I not do?
Why does the number of
processes increase so much?
Is there something wrong with
my coding? Seems to be so too
for my new run.
Thank you
Yours sincerely,
TAY wee-beng
On 2/11/2015 9:49 AM, Barry
If you are doing many
time steps with the same
linear solver then you
MUST do your weak scaling
studies with MANY time
steps since the setup
time of AMG only takes
place in the first
stimestep. So run both 48
and 96 processes with the
same large number of time
steps.
Barry
On Nov 1, 2015, at
7:35 PM, TAY
Hi,
Sorry I forgot and
use the old a.out. I
have attached the new
log for 48cores
(log48), together
with the 96cores log
(log96).
Why does the number
of processes increase
so much? Is there
something wrong with
my coding?
Only the Poisson eqn
's RHS changes, the
LHS doesn't. So if I
want to reuse the
preconditioner, what
must I do? Or what
must I not do?
Lastly, I only
simulated 2 time
steps previously. Now
I run for 10
timesteps (log48_10).
Is it building the
preconditioner at
every timestep?
Also, what about
momentum eqn? Is it
working well?
I will try the gamg
later too.
Thank you
Yours sincerely,
TAY wee-beng
On 2/11/2015 12:30
You used gmres
with 48 processes
but richardson
with 96. You need
to be careful and
make sure you
don't change the
solvers when you
change the number
of processors
since you can get
very different
inconsistent results
Anyways all
the time is being
spent in the
BoomerAMG
algebraic
multigrid setup
and it is is
scaling badly.
When you double
the problem size
and number of
processes it went
from 3.2445e+01
to 4.3599e+02
seconds.
PCSetUp 3 1.0
3.2445e+01 1.0
9.58e+06 2.0
0.0e+00 0.0e+00
4.0e+00 62 8 0
0 4 62 8 0 0
5 11
PCSetUp 3 1.0
4.3599e+02 1.0
9.58e+06 2.0
0.0e+00 0.0e+00
4.0e+00 85 18 0
0 6 85 18 0 0
6 2
Now is the
Poisson problem
changing at each
timestep or can
you use the same
preconditioner
built with
BoomerAMG for all
the time steps?
Algebraic
multigrid has a
large set up time
that you often
doesn't matter if
you have many
time steps but if
you have to
rebuild it each
timestep it is
too large?
You might also
try -pc_type gamg
and see how
PETSc's algebraic
multigrid scales
for your
problem/machine.
Barry
On Nov 1,
2015, at 7:30
AM, TAY
On 1/11/2015
10:00 AM,
Barry Smith
On
Oct
31,
2015,
at
8:43
PM,
TAY
On
1/11/2015
12:47
AM,
Matthew
Knepley
On Sat,
Oct
31,
2015
at 11:34
AM,
TAY
Hi,
I
understand
that
as mentioned
in the
faq,
due
to the
limitations
in memory,
the
scaling
is not
linear.
So,
I
am trying
to write
a
proposal
to use
a
supercomputer.
Its
specs
Compute
82,944
nodes
(SPARC64
VIIIfx;
16GB
of memory
per
node)
8
cores
/
processor
Tofu
(6-dimensional
mesh/torus)
Interconnect
Each
cabinet
contains
96 computing
nodes,
One
of the
requirement
is to
give
the
performance
of my
current
code
with
my current
set
of data,
and
there
is a
formula
to calculate
the
estimated
parallel
efficiency
when
using
the
new
large
set
of data
There
are
2
ways
to give
1. Strong
scaling,
which
is defined
as how
the
elapsed
time
varies
with
the
number
of processors
for
a
fixed
problem.
2. Weak
scaling,
which
is defined
as how
the
elapsed
time
varies
with
the
number
of processors
for a
fixed
problem
size
per
processor.
I
ran
my cases
with
48 and
96 cores
with
my current
cluster,
giving
140
and
90 mins
respectively.
This
is classified
as strong
scaling.
Cluster
AMD
6234
2.4GHz
8
cores
/
processor
(CPU)
6
CPU
/
node
So 48
Cores
/ CPU
Not
sure
abt
the
memory
/
node
The
parallel
efficiency
‘En’
for
a
given
degree
of parallelism
‘n’
indicates
how
much
the
program
is
efficiently
accelerated
by parallel
processing.
‘En’
is given
by the
following
formulae.
Although
their
derivation
processes
are
different
depending
on strong
and
weak
scaling,
derived
formulae
are
the
same.
From
the
estimated
time,
my parallel
efficiency
using
Amdahl's
law
on the
current
old
cluster
was
52.7%.
So is
my results
acceptable?
For
the
large
data
set,
if using
2205
nodes
(2205X8cores),
my expected
parallel
efficiency
is only
0.5%.
The
proposal
recommends
value
of >
50%.
The
problem
with
this
analysis
is that
the
estimated
serial
fraction
from
Amdahl's
Law
changes
as a
function
of problem
size,
so you
cannot
take
the
strong
scaling
from
one
problem
and
apply
it to
another
without
a
model
of this
dependence.
Weak
scaling
does
model
changes
with
problem
size,
so I
would
measure
weak
scaling
on your
current
cluster,
and
extrapolate
to the
big
machine.
I
realize
that
this
does
not
make
sense
for
many
scientific
applications,
but
neither
does
requiring
a
certain
parallel
efficiency.
Ok I
check
the
results
for
my
weak
scaling
it is
even
worse
for
the
expected
parallel
efficiency.
From
the
formula
used,
it's
obvious
it's
doing
some
sort
of
exponential
extrapolation
decrease.
So
unless I
can
achieve
a
near
90%
speed
up
when
I
double the
cores
and
problem
size
for
my
current
48/96
cores
setup, extrapolating
from
about
96
nodes
to
10,000 nodes
will
give
a
much
lower
expected
parallel
efficiency
for
the
new case.
However,
it's
mentioned
in
the
FAQ
that
due
to
memory requirement,
it's
impossible
to
get
90%
speed
when
I
double the
cores
and
problem
size
(ie
linear increase
in
performance),
which
means
that
I
can't
get
90%
speed
up
when
I
double the
cores
and
problem
size
for
my
current
48/96
cores
setup. Is
that so?
What
is the
output of
-ksp_view
-log_summary
on the
problem
and then
on the
problem
doubled
in size
and
number of
processors?
Barry
Hi,
I have
attached the
output
48 cores: log48
96 cores: log96
There are 2
solvers - The
momentum
linear eqn
uses bcgs,
while the
Poisson eqn
uses hypre
BoomerAMG.
Problem size
doubled from
158x266x150
to 158x266x300.
So is
it
fair
to
say
that
the
main
problem
does
not
lie
in my
programming
skills,
but
rather the
way
the
linear equations
are
solved?
Thanks.
Thanks,
Matt
Is it
possible
for
this
type
of scaling
in PETSc
(>50%),
when
using
17640
(2205X8)
cores?
Btw,
I
do not
have
access
to the
system.
Sent
using
CloudMagic
Email
--
What
most
experimenters
take
for
granted
before
they
begin
their
experiments
is infinitely
more
interesting
than
any
results
to which
their
experiments
lead.
-- Norbert
Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to
which their experiments lead.
-- Norbert Wiener
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener
TAY wee-beng
2015-11-03 15:21:20 UTC
Permalink
Post by TAY wee-beng
Post by TAY wee-beng
Hi,
I tried and have attached the log.
Ya, my Poisson eqn has Neumann boundary condition. Do I need
to specify some null space stuff? Like KSPSetNullSpace or
MatNullSpaceCreate?
Yes, you need to attach the constant null space to the matrix.
Thanks,
Matt
Ok so can you point me to a suitable example so that I know which
one to use specifically?
https://bitbucket.org/petsc/petsc/src/9ae8fd060698c4d6fc0d13188aca8a1828c138ab/src/snes/examples/tutorials/ex12.c?at=master&fileviewer=file-view-default#ex12.c-761
Matt
Ok did a search and found the ans for the MatNullSpaceCreate:

http://petsc-users.mcs.anl.narkive.com/jtIlVll0/pass-petsc-null-integer-to-dynamic-array-of-vec-in-frotran90
Post by TAY wee-beng
Thanks.
Post by TAY wee-beng
Thank you
Yours sincerely,
TAY wee-beng
On Nov 2, 2015, at 10:37 PM, TAY
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual
-poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do
you have any zeros on the diagonal for the matrix (you
shouldn't).
There may be something wrong with your poisson
discretization that was also messing up hypre
1 0.00150000 0.00000000 0.00000000
1.00000000 NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
hypre is just not scaling well here. I do not
know why. Since hypre is a block box for us there
is no way to determine why the poor scaling.
If you make the same two runs with -pc_type
gamg there will be a lot more information in the
log summary about in what routines it is scaling
well or poorly.
Barry
On Nov 2, 2015, at 3:17 AM, TAY
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Run (158/2)x(266/2)x(150/2) grid on 8
processes and then (158)x(266)x(150) on
64 processors and send the two
-log_summary results
Barry
On Nov 2, 2015, at 12:19 AM, TAY
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Run without the
-momentum_ksp_view
-poisson_ksp_view and send the
new results
You can see from the log
summary that the PCSetUp is
taking a much smaller percentage
of the time meaning that it is
reusing the preconditioner and
not rebuilding it each time.
Barry
Something makes no sense with
the output: it gives
KSPSolve 199 1.0
2.3298e+03 1.0 5.20e+09 1.8
3.8e+04 9.9e+05 5.0e+02 90100
66100 24 90100 66100 24 165
90% of the time is in the solve
but there is no significant
amount of time in other events of
the code which is just not
possible. I hope it is due to
your IO.
On Nov 1, 2015, at 10:02 PM,
Hi,
I have attached the new run
with 100 time steps for 48
and 96 cores.
Only the Poisson eqn 's RHS
changes, the LHS doesn't. So
if I want to reuse the
preconditioner, what must I
do? Or what must I not do?
Why does the number of
processes increase so much?
Is there something wrong with
my coding? Seems to be so too
for my new run.
Thank you
Yours sincerely,
TAY wee-beng
On 2/11/2015 9:49 AM, Barry
If you are doing many
time steps with the same
linear solver then you
MUST do your weak scaling
studies with MANY time
steps since the setup
time of AMG only takes
place in the first
stimestep. So run both 48
and 96 processes with the
same large number of time
steps.
Barry
On Nov 1, 2015, at
7:35 PM, TAY
Hi,
Sorry I forgot and
use the old a.out. I
have attached the new
log for 48cores
(log48), together
with the 96cores log
(log96).
Why does the number
of processes increase
so much? Is there
something wrong with
my coding?
Only the Poisson eqn
's RHS changes, the
LHS doesn't. So if I
want to reuse the
preconditioner, what
must I do? Or what
must I not do?
Lastly, I only
simulated 2 time
steps previously. Now
I run for 10
timesteps (log48_10).
Is it building the
preconditioner at
every timestep?
Also, what about
momentum eqn? Is it
working well?
I will try the gamg
later too.
Thank you
Yours sincerely,
TAY wee-beng
On 2/11/2015 12:30
You used gmres
with 48 processes
but richardson
with 96. You need
to be careful and
make sure you
don't change the
solvers when you
change the number
of processors
since you can get
very different
inconsistent results
Anyways all
the time is being
spent in the
BoomerAMG
algebraic
multigrid setup
and it is is
scaling badly.
When you double
the problem size
and number of
processes it went
from 3.2445e+01
to 4.3599e+02
seconds.
PCSetUp 3 1.0
3.2445e+01 1.0
9.58e+06 2.0
0.0e+00 0.0e+00
4.0e+00 62 8 0
0 4 62 8 0 0
5 11
PCSetUp 3 1.0
4.3599e+02 1.0
9.58e+06 2.0
0.0e+00 0.0e+00
4.0e+00 85 18 0
0 6 85 18 0 0
6 2
Now is the
Poisson problem
changing at each
timestep or can
you use the same
preconditioner
built with
BoomerAMG for all
the time steps?
Algebraic
multigrid has a
large set up time
that you often
doesn't matter if
you have many
time steps but if
you have to
rebuild it each
timestep it is
too large?
You might also
try -pc_type gamg
and see how
PETSc's algebraic
multigrid scales
for your
problem/machine.
Barry
On Nov 1,
2015, at 7:30
AM, TAY
On 1/11/2015
10:00 AM,
Barry Smith
On
Oct
31,
2015,
at
8:43
PM,
TAY
On
1/11/2015
12:47
AM,
Matthew
Knepley
On Sat,
Oct
31,
2015
at 11:34
AM,
TAY
Hi,
I
understand
that
as mentioned
in the
faq,
due
to the
limitations
in memory,
the
scaling
is not
linear.
So,
I
am trying
to write
a
proposal
to use
a
supercomputer.
Its
specs
Compute
82,944
nodes
(SPARC64
VIIIfx;
16GB
of memory
per
node)
8
cores
/
processor
Tofu
(6-dimensional
mesh/torus)
Interconnect
Each
cabinet
contains
96 computing
nodes,
One
of the
requirement
is to
give
the
performance
of my
current
code
with
my current
set
of data,
and
there
is a
formula
to calculate
the
estimated
parallel
efficiency
when
using
the
new
large
set
of data
There
are
2
ways
to give
1. Strong
scaling,
which
is defined
as how
the
elapsed
time
varies
with
the
number
of processors
for
a
fixed
problem.
2. Weak
scaling,
which
is defined
as how
the
elapsed
time
varies
with
the
number
of processors
for a
fixed
problem
size
per
processor.
I
ran
my cases
with
48 and
96 cores
with
my current
cluster,
giving
140
and
90 mins
respectively.
This
is classified
as strong
scaling.
Cluster
AMD
6234
2.4GHz
8
cores
/
processor
(CPU)
6
CPU
/
node
So 48
Cores
/ CPU
Not
sure
abt
the
memory
/
node
The
parallel
efficiency
‘En’
for
a
given
degree
of parallelism
‘n’
indicates
how
much
the
program
is
efficiently
accelerated
by parallel
processing.
‘En’
is given
by the
following
formulae.
Although
their
derivation
processes
are
different
depending
on strong
and
weak
scaling,
derived
formulae
are
the
same.
From
the
estimated
time,
my parallel
efficiency
using
Amdahl's
law
on the
current
old
cluster
was
52.7%.
So is
my results
acceptable?
For
the
large
data
set,
if using
2205
nodes
(2205X8cores),
my expected
parallel
efficiency
is only
0.5%.
The
proposal
recommends
value
of >
50%.
The
problem
with
this
analysis
is that
the
estimated
serial
fraction
from
Amdahl's
Law
changes
as a
function
of problem
size,
so you
cannot
take
the
strong
scaling
from
one
problem
and
apply
it to
another
without
a
model
of this
dependence.
Weak
scaling
does
model
changes
with
problem
size,
so I
would
measure
weak
scaling
on your
current
cluster,
and
extrapolate
to the
big
machine.
I
realize
that
this
does
not
make
sense
for
many
scientific
applications,
but
neither
does
requiring
a
certain
parallel
efficiency.
Ok I
check
the
results
for
my
weak
scaling
it is
even
worse
for
the
expected
parallel
efficiency.
From
the
formula
used,
it's
obvious
it's
doing
some
sort
of
exponential
extrapolation
decrease.
So
unless I
can
achieve
a
near
90%
speed
up
when
I
double the
cores
and
problem
size
for
my
current
48/96
cores
setup, extrapolating
from
about
96
nodes
to
10,000 nodes
will
give
a
much
lower
expected
parallel
efficiency
for
the
new case.
However,
it's
mentioned
in
the
FAQ
that
due
to
memory requirement,
it's
impossible
to
get
90%
speed
when
I
double the
cores
and
problem
size
(ie
linear increase
in
performance),
which
means
that
I
can't
get
90%
speed
up
when
I
double the
cores
and
problem
size
for
my
current
48/96
cores
setup. Is
that so?
What
is the
output of
-ksp_view
-log_summary
on the
problem
and then
on the
problem
doubled
in size
and
number of
processors?
Barry
Hi,
I have
attached the
output
48 cores: log48
96 cores: log96
There are 2
solvers - The
momentum
linear eqn
uses bcgs,
while the
Poisson eqn
uses hypre
BoomerAMG.
Problem size
doubled from
158x266x150
to 158x266x300.
So is
it
fair
to
say
that
the
main
problem
does
not
lie
in my
programming
skills,
but
rather the
way
the
linear equations
are
solved?
Thanks.
Thanks,
Matt
Is it
possible
for
this
type
of scaling
in PETSc
(>50%),
when
using
17640
(2205X8)
cores?
Btw,
I
do not
have
access
to the
system.
Sent
using
CloudMagic
Email
--
What
most
experimenters
take
for
granted
before
they
begin
their
experiments
is infinitely
more
interesting
than
any
results
to which
their
experiments
lead.
-- Norbert
Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to
which their experiments lead.
-- Norbert Wiener
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener
Barry Smith
2015-11-03 17:11:36 UTC
Permalink
Ok, the convergence looks good. Now run on 8 and 64 processes as before with -log_summary and not -ksp_monitor to see how it scales.

Barry
Post by TAY wee-beng
Hi,
I tried and have attached the log.
Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify some null space stuff? Like KSPSetNullSpace or MatNullSpaceCreate?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Post by TAY wee-beng
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual -poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do you have any zeros on the diagonal for the matrix (you shouldn't).
There may be something wrong with your poisson discretization that was also messing up hypre
Post by TAY wee-beng
1 0.00150000 0.00000000 0.00000000 1.00000000 NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
hypre is just not scaling well here. I do not know why. Since hypre is a block box for us there is no way to determine why the poor scaling.
If you make the same two runs with -pc_type gamg there will be a lot more information in the log summary about in what routines it is scaling well or poorly.
Barry
Post by TAY wee-beng
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then (158)x(266)x(150) on 64 processors and send the two -log_summary results
Barry
Post by TAY wee-beng
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send the new results
You can see from the log summary that the PCSetUp is taking a much smaller percentage of the time meaning that it is reusing the preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount of time in other events of the code which is just not possible. I hope it is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
<log.txt>
TAY wee-beng
2015-11-05 03:30:39 UTC
Permalink
Hi,

I have attached the 2 logs.

Thank you

Yours sincerely,

TAY wee-beng
Post by Barry Smith
Ok, the convergence looks good. Now run on 8 and 64 processes as before with -log_summary and not -ksp_monitor to see how it scales.
Barry
Post by TAY wee-beng
Hi,
I tried and have attached the log.
Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify some null space stuff? Like KSPSetNullSpace or MatNullSpaceCreate?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Post by TAY wee-beng
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual -poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do you have any zeros on the diagonal for the matrix (you shouldn't).
There may be something wrong with your poisson discretization that was also messing up hypre
Post by TAY wee-beng
1 0.00150000 0.00000000 0.00000000 1.00000000 NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
hypre is just not scaling well here. I do not know why. Since hypre is a block box for us there is no way to determine why the poor scaling.
If you make the same two runs with -pc_type gamg there will be a lot more information in the log summary about in what routines it is scaling well or poorly.
Barry
Post by TAY wee-beng
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then (158)x(266)x(150) on 64 processors and send the two -log_summary results
Barry
Post by TAY wee-beng
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send the new results
You can see from the log summary that the PCSetUp is taking a much smaller percentage of the time meaning that it is reusing the preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount of time in other events of the code which is just not possible. I hope it is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
<log.txt>
Barry Smith
2015-11-05 04:03:54 UTC
Permalink
There is a problem here. The -log_summary doesn't show all the events associated with the -pc_type gamg preconditioner it should have rows like

VecDot 2 1.0 6.1989e-06 1.0 1.00e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1613
VecMDot 134 1.0 5.4145e-04 1.0 1.64e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 3 0 0 0 0 3 0 0 0 3025
VecNorm 154 1.0 2.4176e-04 1.0 3.82e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1578
VecScale 148 1.0 1.6928e-04 1.0 1.76e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1039
VecCopy 106 1.0 1.2255e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 474 1.0 5.1236e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 54 1.0 1.3471e-04 1.0 2.35e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1742
VecAYPX 384 1.0 5.7459e-04 1.0 4.94e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 860
VecAXPBYCZ 192 1.0 4.7398e-04 1.0 9.88e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 2085
VecWAXPY 2 1.0 7.8678e-06 1.0 5.00e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 636
VecMAXPY 148 1.0 8.1539e-04 1.0 1.96e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 3 0 0 0 1 3 0 0 0 2399
VecPointwiseMult 66 1.0 1.1253e-04 1.0 6.79e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 604
VecScatterBegin 45 1.0 6.3419e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSetRandom 6 1.0 3.0994e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecReduceArith 4 1.0 1.3113e-05 1.0 2.00e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1525
VecReduceComm 2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecNormalize 148 1.0 4.4799e-04 1.0 5.27e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1177
MatMult 424 1.0 8.9276e-03 1.0 2.09e+07 1.0 0.0e+00 0.0e+00 0.0e+00 7 37 0 0 0 7 37 0 0 0 2343
MatMultAdd 48 1.0 5.0926e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 2069
MatMultTranspose 48 1.0 9.8586e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 2 0 0 0 1 2 0 0 0 1069
MatSolve 16 1.0 2.2173e-05 1.0 1.02e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 460
MatSOR 354 1.0 1.0547e-02 1.0 1.72e+07 1.0 0.0e+00 0.0e+00 0.0e+00 9 31 0 0 0 9 31 0 0 0 1631
MatLUFactorSym 2 1.0 4.7922e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatLUFactorNum 2 1.0 2.5272e-05 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 307
MatScale 18 1.0 1.7142e-04 1.0 1.50e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 874
MatResidual 48 1.0 1.0548e-03 1.0 2.33e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 2212
MatAssemblyBegin 57 1.0 4.7684e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 57 1.0 1.9786e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatGetRow 21616 1.0 1.8497e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatGetRowIJ 2 1.0 6.9141e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 2 1.0 6.0797e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatCoarsen 6 1.0 9.3222e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatZeroEntries 2 1.0 3.9101e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAXPY 6 1.0 1.7998e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatFDColorCreate 1 1.0 3.2902e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatFDColorSetUp 1 1.0 1.6739e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
MatFDColorApply 2 1.0 1.3199e-03 1.0 2.41e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 1826
MatFDColorFunc 42 1.0 7.4601e-04 1.0 2.20e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 2956
MatMatMult 6 1.0 5.1048e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 4 2 0 0 0 4 2 0 0 0 241
MatMatMultSym 6 1.0 3.2601e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 3 0 0 0 0 0
MatMatMultNum 6 1.0 1.8158e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 2 2 0 0 0 2 2 0 0 0 679
MatPtAP 6 1.0 2.1328e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 283
MatPtAPSymbolic 6 1.0 1.0073e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 8 0 0 0 0 8 0 0 0 0 0
MatPtAPNumeric 6 1.0 1.1230e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 537
MatTrnMatMult 2 1.0 7.2789e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 75
MatTrnMatMultSym 2 1.0 5.7006e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatTrnMatMultNum 2 1.0 1.5473e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 352
MatGetSymTrans 8 1.0 3.1638e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPGMRESOrthog 134 1.0 1.3156e-03 1.0 3.28e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 6 0 0 0 1 6 0 0 0 2491
KSPSetUp 24 1.0 4.6754e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 2 1.0 1.1291e-01 1.0 5.32e+07 1.0 0.0e+00 0.0e+00 0.0e+00 94 95 0 0 0 94 95 0 0 0 471
PCGAMGGraph_AGG 6 1.0 1.2108e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 2
PCGAMGCoarse_AGG 6 1.0 1.1127e-03 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 49
PCGAMGProl_AGG 6 1.0 4.1062e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
PCGAMGPOpt_AGG 6 1.0 1.1200e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 534
GAMG: createProl 6 1.0 6.5530e-02 1.0 6.06e+06 1.0 0.0e+00 0.0e+00 0.0e+00 55 11 0 0 0 55 11 0 0 0 92
Graph 12 1.0 1.1692e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 2
MIS/Agg 6 1.0 1.4496e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SA: col data 6 1.0 7.1526e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SA: frmProl0 6 1.0 4.0917e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
SA: smooth 6 1.0 1.1198e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 534
GAMG: partLevel 6 1.0 2.1341e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 283
PCSetUp 4 1.0 8.8020e-02 1.0 1.21e+07 1.0 0.0e+00 0.0e+00 0.0e+00 74 22 0 0 0 74 22 0 0 0 137
PCSetUpOnBlocks 16 1.0 1.8382e-04 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 42
PCApply 16 1.0 2.3858e-02 1.0 3.91e+07 1.0 0.0e+00 0.0e+00 0.0e+00 20 70 0 0 0 20 70 0 0 0 1637


Are you sure you ran with -pc_type gamg ? What about running with -info does it print anything about gamg? What about -ksp_view does it indicate it is using the gamg preconditioner?
Post by TAY wee-beng
Hi,
I have attached the 2 logs.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Ok, the convergence looks good. Now run on 8 and 64 processes as before with -log_summary and not -ksp_monitor to see how it scales.
Barry
Post by TAY wee-beng
Hi,
I tried and have attached the log.
Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify some null space stuff? Like KSPSetNullSpace or MatNullSpaceCreate?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Post by TAY wee-beng
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual -poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do you have any zeros on the diagonal for the matrix (you shouldn't).
There may be something wrong with your poisson discretization that was also messing up hypre
Post by TAY wee-beng
1 0.00150000 0.00000000 0.00000000 1.00000000 NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
hypre is just not scaling well here. I do not know why. Since hypre is a block box for us there is no way to determine why the poor scaling.
If you make the same two runs with -pc_type gamg there will be a lot more information in the log summary about in what routines it is scaling well or poorly.
Barry
Post by TAY wee-beng
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then (158)x(266)x(150) on 64 processors and send the two -log_summary results
Barry
Post by TAY wee-beng
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send the new results
You can see from the log summary that the PCSetUp is taking a much smaller percentage of the time meaning that it is reusing the preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount of time in other events of the code which is just not possible. I hope it is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
<log.txt>
<log64_100_2.txt><log8_100_2.txt>
TAY wee-beng
2015-11-05 15:58:15 UTC
Permalink
Sorry I realised that I didn't use gamg and that's why. But if I use
gamg, the 8 core case worked, but the 64 core case shows p diverged.

Why is this so? Btw, I have also added nullspace in my code.

Thank you.

Yours sincerely,

TAY wee-beng
Post by Barry Smith
There is a problem here. The -log_summary doesn't show all the events associated with the -pc_type gamg preconditioner it should have rows like
VecDot 2 1.0 6.1989e-06 1.0 1.00e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1613
VecMDot 134 1.0 5.4145e-04 1.0 1.64e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 3 0 0 0 0 3 0 0 0 3025
VecNorm 154 1.0 2.4176e-04 1.0 3.82e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1578
VecScale 148 1.0 1.6928e-04 1.0 1.76e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1039
VecCopy 106 1.0 1.2255e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 474 1.0 5.1236e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 54 1.0 1.3471e-04 1.0 2.35e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1742
VecAYPX 384 1.0 5.7459e-04 1.0 4.94e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 860
VecAXPBYCZ 192 1.0 4.7398e-04 1.0 9.88e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 2085
VecWAXPY 2 1.0 7.8678e-06 1.0 5.00e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 636
VecMAXPY 148 1.0 8.1539e-04 1.0 1.96e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 3 0 0 0 1 3 0 0 0 2399
VecPointwiseMult 66 1.0 1.1253e-04 1.0 6.79e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 604
VecScatterBegin 45 1.0 6.3419e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSetRandom 6 1.0 3.0994e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecReduceArith 4 1.0 1.3113e-05 1.0 2.00e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1525
VecReduceComm 2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecNormalize 148 1.0 4.4799e-04 1.0 5.27e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1177
MatMult 424 1.0 8.9276e-03 1.0 2.09e+07 1.0 0.0e+00 0.0e+00 0.0e+00 7 37 0 0 0 7 37 0 0 0 2343
MatMultAdd 48 1.0 5.0926e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 2069
MatMultTranspose 48 1.0 9.8586e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 2 0 0 0 1 2 0 0 0 1069
MatSolve 16 1.0 2.2173e-05 1.0 1.02e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 460
MatSOR 354 1.0 1.0547e-02 1.0 1.72e+07 1.0 0.0e+00 0.0e+00 0.0e+00 9 31 0 0 0 9 31 0 0 0 1631
MatLUFactorSym 2 1.0 4.7922e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatLUFactorNum 2 1.0 2.5272e-05 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 307
MatScale 18 1.0 1.7142e-04 1.0 1.50e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 874
MatResidual 48 1.0 1.0548e-03 1.0 2.33e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 2212
MatAssemblyBegin 57 1.0 4.7684e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 57 1.0 1.9786e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatGetRow 21616 1.0 1.8497e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatGetRowIJ 2 1.0 6.9141e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 2 1.0 6.0797e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatCoarsen 6 1.0 9.3222e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatZeroEntries 2 1.0 3.9101e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAXPY 6 1.0 1.7998e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatFDColorCreate 1 1.0 3.2902e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatFDColorSetUp 1 1.0 1.6739e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
MatFDColorApply 2 1.0 1.3199e-03 1.0 2.41e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 1826
MatFDColorFunc 42 1.0 7.4601e-04 1.0 2.20e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 2956
MatMatMult 6 1.0 5.1048e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 4 2 0 0 0 4 2 0 0 0 241
MatMatMultSym 6 1.0 3.2601e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 3 0 0 0 0 0
MatMatMultNum 6 1.0 1.8158e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 2 2 0 0 0 2 2 0 0 0 679
MatPtAP 6 1.0 2.1328e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 283
MatPtAPSymbolic 6 1.0 1.0073e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 8 0 0 0 0 8 0 0 0 0 0
MatPtAPNumeric 6 1.0 1.1230e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 537
MatTrnMatMult 2 1.0 7.2789e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 75
MatTrnMatMultSym 2 1.0 5.7006e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatTrnMatMultNum 2 1.0 1.5473e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 352
MatGetSymTrans 8 1.0 3.1638e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPGMRESOrthog 134 1.0 1.3156e-03 1.0 3.28e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 6 0 0 0 1 6 0 0 0 2491
KSPSetUp 24 1.0 4.6754e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 2 1.0 1.1291e-01 1.0 5.32e+07 1.0 0.0e+00 0.0e+00 0.0e+00 94 95 0 0 0 94 95 0 0 0 471
PCGAMGGraph_AGG 6 1.0 1.2108e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 2
PCGAMGCoarse_AGG 6 1.0 1.1127e-03 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 49
PCGAMGProl_AGG 6 1.0 4.1062e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
PCGAMGPOpt_AGG 6 1.0 1.1200e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 534
GAMG: createProl 6 1.0 6.5530e-02 1.0 6.06e+06 1.0 0.0e+00 0.0e+00 0.0e+00 55 11 0 0 0 55 11 0 0 0 92
Graph 12 1.0 1.1692e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 2
MIS/Agg 6 1.0 1.4496e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SA: col data 6 1.0 7.1526e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SA: frmProl0 6 1.0 4.0917e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
SA: smooth 6 1.0 1.1198e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 534
GAMG: partLevel 6 1.0 2.1341e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 283
PCSetUp 4 1.0 8.8020e-02 1.0 1.21e+07 1.0 0.0e+00 0.0e+00 0.0e+00 74 22 0 0 0 74 22 0 0 0 137
PCSetUpOnBlocks 16 1.0 1.8382e-04 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 42
PCApply 16 1.0 2.3858e-02 1.0 3.91e+07 1.0 0.0e+00 0.0e+00 0.0e+00 20 70 0 0 0 20 70 0 0 0 1637
Are you sure you ran with -pc_type gamg ? What about running with -info does it print anything about gamg? What about -ksp_view does it indicate it is using the gamg preconditioner?
Post by TAY wee-beng
Hi,
I have attached the 2 logs.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Ok, the convergence looks good. Now run on 8 and 64 processes as before with -log_summary and not -ksp_monitor to see how it scales.
Barry
Post by TAY wee-beng
Hi,
I tried and have attached the log.
Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify some null space stuff? Like KSPSetNullSpace or MatNullSpaceCreate?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Post by TAY wee-beng
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual -poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do you have any zeros on the diagonal for the matrix (you shouldn't).
There may be something wrong with your poisson discretization that was also messing up hypre
Post by TAY wee-beng
1 0.00150000 0.00000000 0.00000000 1.00000000 NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
hypre is just not scaling well here. I do not know why. Since hypre is a block box for us there is no way to determine why the poor scaling.
If you make the same two runs with -pc_type gamg there will be a lot more information in the log summary about in what routines it is scaling well or poorly.
Barry
Post by TAY wee-beng
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then (158)x(266)x(150) on 64 processors and send the two -log_summary results
Barry
Post by TAY wee-beng
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send the new results
You can see from the log summary that the PCSetUp is taking a much smaller percentage of the time meaning that it is reusing the preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount of time in other events of the code which is just not possible. I hope it is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
<log.txt>
<log64_100_2.txt><log8_100_2.txt>
Barry Smith
2015-11-05 16:06:30 UTC
Permalink
Sorry I realised that I didn't use gamg and that's why. But if I use gamg, the 8 core case worked, but the 64 core case shows p diverged.
Why is this so? Btw, I have also added nullspace in my code.
You don't need the null space and should not add it.
Thank you.
Yours sincerely,
TAY wee-beng
Post by Barry Smith
There is a problem here. The -log_summary doesn't show all the events associated with the -pc_type gamg preconditioner it should have rows like
VecDot 2 1.0 6.1989e-06 1.0 1.00e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1613
VecMDot 134 1.0 5.4145e-04 1.0 1.64e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 3 0 0 0 0 3 0 0 0 3025
VecNorm 154 1.0 2.4176e-04 1.0 3.82e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1578
VecScale 148 1.0 1.6928e-04 1.0 1.76e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1039
VecCopy 106 1.0 1.2255e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 474 1.0 5.1236e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 54 1.0 1.3471e-04 1.0 2.35e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1742
VecAYPX 384 1.0 5.7459e-04 1.0 4.94e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 860
VecAXPBYCZ 192 1.0 4.7398e-04 1.0 9.88e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 2085
VecWAXPY 2 1.0 7.8678e-06 1.0 5.00e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 636
VecMAXPY 148 1.0 8.1539e-04 1.0 1.96e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 3 0 0 0 1 3 0 0 0 2399
VecPointwiseMult 66 1.0 1.1253e-04 1.0 6.79e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 604
VecScatterBegin 45 1.0 6.3419e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSetRandom 6 1.0 3.0994e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecReduceArith 4 1.0 1.3113e-05 1.0 2.00e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1525
VecReduceComm 2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecNormalize 148 1.0 4.4799e-04 1.0 5.27e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1177
MatMult 424 1.0 8.9276e-03 1.0 2.09e+07 1.0 0.0e+00 0.0e+00 0.0e+00 7 37 0 0 0 7 37 0 0 0 2343
MatMultAdd 48 1.0 5.0926e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 2069
MatMultTranspose 48 1.0 9.8586e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 2 0 0 0 1 2 0 0 0 1069
MatSolve 16 1.0 2.2173e-05 1.0 1.02e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 460
MatSOR 354 1.0 1.0547e-02 1.0 1.72e+07 1.0 0.0e+00 0.0e+00 0.0e+00 9 31 0 0 0 9 31 0 0 0 1631
MatLUFactorSym 2 1.0 4.7922e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatLUFactorNum 2 1.0 2.5272e-05 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 307
MatScale 18 1.0 1.7142e-04 1.0 1.50e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 874
MatResidual 48 1.0 1.0548e-03 1.0 2.33e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 2212
MatAssemblyBegin 57 1.0 4.7684e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 57 1.0 1.9786e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatGetRow 21616 1.0 1.8497e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatGetRowIJ 2 1.0 6.9141e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 2 1.0 6.0797e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatCoarsen 6 1.0 9.3222e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatZeroEntries 2 1.0 3.9101e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAXPY 6 1.0 1.7998e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatFDColorCreate 1 1.0 3.2902e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatFDColorSetUp 1 1.0 1.6739e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
MatFDColorApply 2 1.0 1.3199e-03 1.0 2.41e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 1826
MatFDColorFunc 42 1.0 7.4601e-04 1.0 2.20e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 2956
MatMatMult 6 1.0 5.1048e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 4 2 0 0 0 4 2 0 0 0 241
MatMatMultSym 6 1.0 3.2601e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 3 0 0 0 0 0
MatMatMultNum 6 1.0 1.8158e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 2 2 0 0 0 2 2 0 0 0 679
MatPtAP 6 1.0 2.1328e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 283
MatPtAPSymbolic 6 1.0 1.0073e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 8 0 0 0 0 8 0 0 0 0 0
MatPtAPNumeric 6 1.0 1.1230e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 537
MatTrnMatMult 2 1.0 7.2789e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 75
MatTrnMatMultSym 2 1.0 5.7006e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatTrnMatMultNum 2 1.0 1.5473e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 352
MatGetSymTrans 8 1.0 3.1638e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPGMRESOrthog 134 1.0 1.3156e-03 1.0 3.28e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 6 0 0 0 1 6 0 0 0 2491
KSPSetUp 24 1.0 4.6754e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 2 1.0 1.1291e-01 1.0 5.32e+07 1.0 0.0e+00 0.0e+00 0.0e+00 94 95 0 0 0 94 95 0 0 0 471
PCGAMGGraph_AGG 6 1.0 1.2108e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 2
PCGAMGCoarse_AGG 6 1.0 1.1127e-03 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 49
PCGAMGProl_AGG 6 1.0 4.1062e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
PCGAMGPOpt_AGG 6 1.0 1.1200e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 534
GAMG: createProl 6 1.0 6.5530e-02 1.0 6.06e+06 1.0 0.0e+00 0.0e+00 0.0e+00 55 11 0 0 0 55 11 0 0 0 92
Graph 12 1.0 1.1692e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 2
MIS/Agg 6 1.0 1.4496e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SA: col data 6 1.0 7.1526e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SA: frmProl0 6 1.0 4.0917e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
SA: smooth 6 1.0 1.1198e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 534
GAMG: partLevel 6 1.0 2.1341e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 283
PCSetUp 4 1.0 8.8020e-02 1.0 1.21e+07 1.0 0.0e+00 0.0e+00 0.0e+00 74 22 0 0 0 74 22 0 0 0 137
PCSetUpOnBlocks 16 1.0 1.8382e-04 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 42
PCApply 16 1.0 2.3858e-02 1.0 3.91e+07 1.0 0.0e+00 0.0e+00 0.0e+00 20 70 0 0 0 20 70 0 0 0 1637
Are you sure you ran with -pc_type gamg ? What about running with -info does it print anything about gamg? What about -ksp_view does it indicate it is using the gamg preconditioner?
Post by TAY wee-beng
Hi,
I have attached the 2 logs.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Ok, the convergence looks good. Now run on 8 and 64 processes as before with -log_summary and not -ksp_monitor to see how it scales.
Barry
Post by TAY wee-beng
Hi,
I tried and have attached the log.
Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify some null space stuff? Like KSPSetNullSpace or MatNullSpaceCreate?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Post by TAY wee-beng
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual -poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do you have any zeros on the diagonal for the matrix (you shouldn't).
There may be something wrong with your poisson discretization that was also messing up hypre
Post by TAY wee-beng
1 0.00150000 0.00000000 0.00000000 1.00000000 NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
hypre is just not scaling well here. I do not know why. Since hypre is a block box for us there is no way to determine why the poor scaling.
If you make the same two runs with -pc_type gamg there will be a lot more information in the log summary about in what routines it is scaling well or poorly.
Barry
Post by TAY wee-beng
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then (158)x(266)x(150) on 64 processors and send the two -log_summary results
Barry
Post by TAY wee-beng
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send the new results
You can see from the log summary that the PCSetUp is taking a much smaller percentage of the time meaning that it is reusing the preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount of time in other events of the code which is just not possible. I hope it is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
<log.txt>
<log64_100_2.txt><log8_100_2.txt>
Barry Smith
2015-11-05 16:07:46 UTC
Permalink
Sorry I realised that I didn't use gamg and that's why. But if I use gamg, the 8 core case worked, but the 64 core case shows p diverged.
Where is the log file for the 8 core case? And where is all the output from where it fails with 64 cores? Include -ksp_monitor_true_residual and -ksp_converged_reason

Barry
Why is this so? Btw, I have also added nullspace in my code.
Thank you.
Yours sincerely,
TAY wee-beng
Post by Barry Smith
There is a problem here. The -log_summary doesn't show all the events associated with the -pc_type gamg preconditioner it should have rows like
VecDot 2 1.0 6.1989e-06 1.0 1.00e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1613
VecMDot 134 1.0 5.4145e-04 1.0 1.64e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 3 0 0 0 0 3 0 0 0 3025
VecNorm 154 1.0 2.4176e-04 1.0 3.82e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1578
VecScale 148 1.0 1.6928e-04 1.0 1.76e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1039
VecCopy 106 1.0 1.2255e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 474 1.0 5.1236e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 54 1.0 1.3471e-04 1.0 2.35e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1742
VecAYPX 384 1.0 5.7459e-04 1.0 4.94e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 860
VecAXPBYCZ 192 1.0 4.7398e-04 1.0 9.88e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 2085
VecWAXPY 2 1.0 7.8678e-06 1.0 5.00e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 636
VecMAXPY 148 1.0 8.1539e-04 1.0 1.96e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 3 0 0 0 1 3 0 0 0 2399
VecPointwiseMult 66 1.0 1.1253e-04 1.0 6.79e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 604
VecScatterBegin 45 1.0 6.3419e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSetRandom 6 1.0 3.0994e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecReduceArith 4 1.0 1.3113e-05 1.0 2.00e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1525
VecReduceComm 2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecNormalize 148 1.0 4.4799e-04 1.0 5.27e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1177
MatMult 424 1.0 8.9276e-03 1.0 2.09e+07 1.0 0.0e+00 0.0e+00 0.0e+00 7 37 0 0 0 7 37 0 0 0 2343
MatMultAdd 48 1.0 5.0926e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 2069
MatMultTranspose 48 1.0 9.8586e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 2 0 0 0 1 2 0 0 0 1069
MatSolve 16 1.0 2.2173e-05 1.0 1.02e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 460
MatSOR 354 1.0 1.0547e-02 1.0 1.72e+07 1.0 0.0e+00 0.0e+00 0.0e+00 9 31 0 0 0 9 31 0 0 0 1631
MatLUFactorSym 2 1.0 4.7922e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatLUFactorNum 2 1.0 2.5272e-05 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 307
MatScale 18 1.0 1.7142e-04 1.0 1.50e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 874
MatResidual 48 1.0 1.0548e-03 1.0 2.33e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 2212
MatAssemblyBegin 57 1.0 4.7684e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 57 1.0 1.9786e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatGetRow 21616 1.0 1.8497e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatGetRowIJ 2 1.0 6.9141e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 2 1.0 6.0797e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatCoarsen 6 1.0 9.3222e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatZeroEntries 2 1.0 3.9101e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAXPY 6 1.0 1.7998e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatFDColorCreate 1 1.0 3.2902e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatFDColorSetUp 1 1.0 1.6739e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
MatFDColorApply 2 1.0 1.3199e-03 1.0 2.41e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 1826
MatFDColorFunc 42 1.0 7.4601e-04 1.0 2.20e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 2956
MatMatMult 6 1.0 5.1048e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 4 2 0 0 0 4 2 0 0 0 241
MatMatMultSym 6 1.0 3.2601e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 3 0 0 0 0 0
MatMatMultNum 6 1.0 1.8158e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 2 2 0 0 0 2 2 0 0 0 679
MatPtAP 6 1.0 2.1328e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 283
MatPtAPSymbolic 6 1.0 1.0073e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 8 0 0 0 0 8 0 0 0 0 0
MatPtAPNumeric 6 1.0 1.1230e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 537
MatTrnMatMult 2 1.0 7.2789e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 75
MatTrnMatMultSym 2 1.0 5.7006e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatTrnMatMultNum 2 1.0 1.5473e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 352
MatGetSymTrans 8 1.0 3.1638e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPGMRESOrthog 134 1.0 1.3156e-03 1.0 3.28e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 6 0 0 0 1 6 0 0 0 2491
KSPSetUp 24 1.0 4.6754e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 2 1.0 1.1291e-01 1.0 5.32e+07 1.0 0.0e+00 0.0e+00 0.0e+00 94 95 0 0 0 94 95 0 0 0 471
PCGAMGGraph_AGG 6 1.0 1.2108e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 2
PCGAMGCoarse_AGG 6 1.0 1.1127e-03 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 49
PCGAMGProl_AGG 6 1.0 4.1062e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
PCGAMGPOpt_AGG 6 1.0 1.1200e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 534
GAMG: createProl 6 1.0 6.5530e-02 1.0 6.06e+06 1.0 0.0e+00 0.0e+00 0.0e+00 55 11 0 0 0 55 11 0 0 0 92
Graph 12 1.0 1.1692e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 2
MIS/Agg 6 1.0 1.4496e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SA: col data 6 1.0 7.1526e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SA: frmProl0 6 1.0 4.0917e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
SA: smooth 6 1.0 1.1198e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 534
GAMG: partLevel 6 1.0 2.1341e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 283
PCSetUp 4 1.0 8.8020e-02 1.0 1.21e+07 1.0 0.0e+00 0.0e+00 0.0e+00 74 22 0 0 0 74 22 0 0 0 137
PCSetUpOnBlocks 16 1.0 1.8382e-04 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 42
PCApply 16 1.0 2.3858e-02 1.0 3.91e+07 1.0 0.0e+00 0.0e+00 0.0e+00 20 70 0 0 0 20 70 0 0 0 1637
Are you sure you ran with -pc_type gamg ? What about running with -info does it print anything about gamg? What about -ksp_view does it indicate it is using the gamg preconditioner?
Post by TAY wee-beng
Hi,
I have attached the 2 logs.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Ok, the convergence looks good. Now run on 8 and 64 processes as before with -log_summary and not -ksp_monitor to see how it scales.
Barry
Post by TAY wee-beng
Hi,
I tried and have attached the log.
Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify some null space stuff? Like KSPSetNullSpace or MatNullSpaceCreate?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Post by TAY wee-beng
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual -poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do you have any zeros on the diagonal for the matrix (you shouldn't).
There may be something wrong with your poisson discretization that was also messing up hypre
Post by TAY wee-beng
1 0.00150000 0.00000000 0.00000000 1.00000000 NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
hypre is just not scaling well here. I do not know why. Since hypre is a block box for us there is no way to determine why the poor scaling.
If you make the same two runs with -pc_type gamg there will be a lot more information in the log summary about in what routines it is scaling well or poorly.
Barry
Post by TAY wee-beng
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then (158)x(266)x(150) on 64 processors and send the two -log_summary results
Barry
Post by TAY wee-beng
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send the new results
You can see from the log summary that the PCSetUp is taking a much smaller percentage of the time meaning that it is reusing the preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount of time in other events of the code which is just not possible. I hope it is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
<log.txt>
<log64_100_2.txt><log8_100_2.txt>
TAY wee-beng
2015-11-06 02:47:39 UTC
Permalink
Hi,

I have removed the nullspace and attached the new logs.

Thank you

Yours sincerely,

TAY wee-beng
Post by Barry Smith
Sorry I realised that I didn't use gamg and that's why. But if I use gamg, the 8 core case worked, but the 64 core case shows p diverged.
Where is the log file for the 8 core case? And where is all the output from where it fails with 64 cores? Include -ksp_monitor_true_residual and -ksp_converged_reason
Barry
Why is this so? Btw, I have also added nullspace in my code.
Thank you.
Yours sincerely,
TAY wee-beng
Post by Barry Smith
There is a problem here. The -log_summary doesn't show all the events associated with the -pc_type gamg preconditioner it should have rows like
VecDot 2 1.0 6.1989e-06 1.0 1.00e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1613
VecMDot 134 1.0 5.4145e-04 1.0 1.64e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 3 0 0 0 0 3 0 0 0 3025
VecNorm 154 1.0 2.4176e-04 1.0 3.82e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1578
VecScale 148 1.0 1.6928e-04 1.0 1.76e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1039
VecCopy 106 1.0 1.2255e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 474 1.0 5.1236e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 54 1.0 1.3471e-04 1.0 2.35e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1742
VecAYPX 384 1.0 5.7459e-04 1.0 4.94e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 860
VecAXPBYCZ 192 1.0 4.7398e-04 1.0 9.88e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 2085
VecWAXPY 2 1.0 7.8678e-06 1.0 5.00e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 636
VecMAXPY 148 1.0 8.1539e-04 1.0 1.96e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 3 0 0 0 1 3 0 0 0 2399
VecPointwiseMult 66 1.0 1.1253e-04 1.0 6.79e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 604
VecScatterBegin 45 1.0 6.3419e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSetRandom 6 1.0 3.0994e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecReduceArith 4 1.0 1.3113e-05 1.0 2.00e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1525
VecReduceComm 2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecNormalize 148 1.0 4.4799e-04 1.0 5.27e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1177
MatMult 424 1.0 8.9276e-03 1.0 2.09e+07 1.0 0.0e+00 0.0e+00 0.0e+00 7 37 0 0 0 7 37 0 0 0 2343
MatMultAdd 48 1.0 5.0926e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 2069
MatMultTranspose 48 1.0 9.8586e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 2 0 0 0 1 2 0 0 0 1069
MatSolve 16 1.0 2.2173e-05 1.0 1.02e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 460
MatSOR 354 1.0 1.0547e-02 1.0 1.72e+07 1.0 0.0e+00 0.0e+00 0.0e+00 9 31 0 0 0 9 31 0 0 0 1631
MatLUFactorSym 2 1.0 4.7922e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatLUFactorNum 2 1.0 2.5272e-05 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 307
MatScale 18 1.0 1.7142e-04 1.0 1.50e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 874
MatResidual 48 1.0 1.0548e-03 1.0 2.33e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 2212
MatAssemblyBegin 57 1.0 4.7684e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 57 1.0 1.9786e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatGetRow 21616 1.0 1.8497e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatGetRowIJ 2 1.0 6.9141e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 2 1.0 6.0797e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatCoarsen 6 1.0 9.3222e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatZeroEntries 2 1.0 3.9101e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAXPY 6 1.0 1.7998e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatFDColorCreate 1 1.0 3.2902e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatFDColorSetUp 1 1.0 1.6739e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
MatFDColorApply 2 1.0 1.3199e-03 1.0 2.41e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 1826
MatFDColorFunc 42 1.0 7.4601e-04 1.0 2.20e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 2956
MatMatMult 6 1.0 5.1048e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 4 2 0 0 0 4 2 0 0 0 241
MatMatMultSym 6 1.0 3.2601e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 3 0 0 0 0 0
MatMatMultNum 6 1.0 1.8158e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 2 2 0 0 0 2 2 0 0 0 679
MatPtAP 6 1.0 2.1328e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 283
MatPtAPSymbolic 6 1.0 1.0073e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 8 0 0 0 0 8 0 0 0 0 0
MatPtAPNumeric 6 1.0 1.1230e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 537
MatTrnMatMult 2 1.0 7.2789e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 75
MatTrnMatMultSym 2 1.0 5.7006e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatTrnMatMultNum 2 1.0 1.5473e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 352
MatGetSymTrans 8 1.0 3.1638e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPGMRESOrthog 134 1.0 1.3156e-03 1.0 3.28e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 6 0 0 0 1 6 0 0 0 2491
KSPSetUp 24 1.0 4.6754e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 2 1.0 1.1291e-01 1.0 5.32e+07 1.0 0.0e+00 0.0e+00 0.0e+00 94 95 0 0 0 94 95 0 0 0 471
PCGAMGGraph_AGG 6 1.0 1.2108e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 2
PCGAMGCoarse_AGG 6 1.0 1.1127e-03 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 49
PCGAMGProl_AGG 6 1.0 4.1062e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
PCGAMGPOpt_AGG 6 1.0 1.1200e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 534
GAMG: createProl 6 1.0 6.5530e-02 1.0 6.06e+06 1.0 0.0e+00 0.0e+00 0.0e+00 55 11 0 0 0 55 11 0 0 0 92
Graph 12 1.0 1.1692e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 2
MIS/Agg 6 1.0 1.4496e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SA: col data 6 1.0 7.1526e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SA: frmProl0 6 1.0 4.0917e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
SA: smooth 6 1.0 1.1198e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 534
GAMG: partLevel 6 1.0 2.1341e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 283
PCSetUp 4 1.0 8.8020e-02 1.0 1.21e+07 1.0 0.0e+00 0.0e+00 0.0e+00 74 22 0 0 0 74 22 0 0 0 137
PCSetUpOnBlocks 16 1.0 1.8382e-04 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 42
PCApply 16 1.0 2.3858e-02 1.0 3.91e+07 1.0 0.0e+00 0.0e+00 0.0e+00 20 70 0 0 0 20 70 0 0 0 1637
Are you sure you ran with -pc_type gamg ? What about running with -info does it print anything about gamg? What about -ksp_view does it indicate it is using the gamg preconditioner?
Post by TAY wee-beng
Hi,
I have attached the 2 logs.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Ok, the convergence looks good. Now run on 8 and 64 processes as before with -log_summary and not -ksp_monitor to see how it scales.
Barry
Post by TAY wee-beng
Hi,
I tried and have attached the log.
Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify some null space stuff? Like KSPSetNullSpace or MatNullSpaceCreate?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Post by TAY wee-beng
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual -poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do you have any zeros on the diagonal for the matrix (you shouldn't).
There may be something wrong with your poisson discretization that was also messing up hypre
Post by TAY wee-beng
1 0.00150000 0.00000000 0.00000000 1.00000000 NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
hypre is just not scaling well here. I do not know why. Since hypre is a block box for us there is no way to determine why the poor scaling.
If you make the same two runs with -pc_type gamg there will be a lot more information in the log summary about in what routines it is scaling well or poorly.
Barry
Post by TAY wee-beng
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then (158)x(266)x(150) on 64 processors and send the two -log_summary results
Barry
Post by TAY wee-beng
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send the new results
You can see from the log summary that the PCSetUp is taking a much smaller percentage of the time meaning that it is reusing the preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount of time in other events of the code which is just not possible. I hope it is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
<log.txt>
<log64_100_2.txt><log8_100_2.txt>
Barry Smith
2015-11-06 04:08:31 UTC
Permalink
Ok the 64 case not converging makes no sense.

Run it with ksp_monitor and ksp_converged_reason for the pressure solve turned on and -info

You need to figure out why it is not converging.

Barry
Post by TAY wee-beng
Hi,
I have removed the nullspace and attached the new logs.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Sorry I realised that I didn't use gamg and that's why. But if I use gamg, the 8 core case worked, but the 64 core case shows p diverged.
Where is the log file for the 8 core case? And where is all the output from where it fails with 64 cores? Include -ksp_monitor_true_residual and -ksp_converged_reason
Barry
Why is this so? Btw, I have also added nullspace in my code.
Thank you.
Yours sincerely,
TAY wee-beng
Post by Barry Smith
There is a problem here. The -log_summary doesn't show all the events associated with the -pc_type gamg preconditioner it should have rows like
VecDot 2 1.0 6.1989e-06 1.0 1.00e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1613
VecMDot 134 1.0 5.4145e-04 1.0 1.64e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 3 0 0 0 0 3 0 0 0 3025
VecNorm 154 1.0 2.4176e-04 1.0 3.82e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1578
VecScale 148 1.0 1.6928e-04 1.0 1.76e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1039
VecCopy 106 1.0 1.2255e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 474 1.0 5.1236e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 54 1.0 1.3471e-04 1.0 2.35e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1742
VecAYPX 384 1.0 5.7459e-04 1.0 4.94e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 860
VecAXPBYCZ 192 1.0 4.7398e-04 1.0 9.88e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 2085
VecWAXPY 2 1.0 7.8678e-06 1.0 5.00e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 636
VecMAXPY 148 1.0 8.1539e-04 1.0 1.96e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 3 0 0 0 1 3 0 0 0 2399
VecPointwiseMult 66 1.0 1.1253e-04 1.0 6.79e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 604
VecScatterBegin 45 1.0 6.3419e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSetRandom 6 1.0 3.0994e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecReduceArith 4 1.0 1.3113e-05 1.0 2.00e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1525
VecReduceComm 2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecNormalize 148 1.0 4.4799e-04 1.0 5.27e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1177
MatMult 424 1.0 8.9276e-03 1.0 2.09e+07 1.0 0.0e+00 0.0e+00 0.0e+00 7 37 0 0 0 7 37 0 0 0 2343
MatMultAdd 48 1.0 5.0926e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 2069
MatMultTranspose 48 1.0 9.8586e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 2 0 0 0 1 2 0 0 0 1069
MatSolve 16 1.0 2.2173e-05 1.0 1.02e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 460
MatSOR 354 1.0 1.0547e-02 1.0 1.72e+07 1.0 0.0e+00 0.0e+00 0.0e+00 9 31 0 0 0 9 31 0 0 0 1631
MatLUFactorSym 2 1.0 4.7922e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatLUFactorNum 2 1.0 2.5272e-05 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 307
MatScale 18 1.0 1.7142e-04 1.0 1.50e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 874
MatResidual 48 1.0 1.0548e-03 1.0 2.33e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 2212
MatAssemblyBegin 57 1.0 4.7684e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 57 1.0 1.9786e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatGetRow 21616 1.0 1.8497e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatGetRowIJ 2 1.0 6.9141e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 2 1.0 6.0797e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatCoarsen 6 1.0 9.3222e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatZeroEntries 2 1.0 3.9101e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAXPY 6 1.0 1.7998e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatFDColorCreate 1 1.0 3.2902e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatFDColorSetUp 1 1.0 1.6739e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
MatFDColorApply 2 1.0 1.3199e-03 1.0 2.41e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 1826
MatFDColorFunc 42 1.0 7.4601e-04 1.0 2.20e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 2956
MatMatMult 6 1.0 5.1048e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 4 2 0 0 0 4 2 0 0 0 241
MatMatMultSym 6 1.0 3.2601e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 3 0 0 0 0 0
MatMatMultNum 6 1.0 1.8158e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 2 2 0 0 0 2 2 0 0 0 679
MatPtAP 6 1.0 2.1328e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 283
MatPtAPSymbolic 6 1.0 1.0073e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 8 0 0 0 0 8 0 0 0 0 0
MatPtAPNumeric 6 1.0 1.1230e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 537
MatTrnMatMult 2 1.0 7.2789e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 75
MatTrnMatMultSym 2 1.0 5.7006e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatTrnMatMultNum 2 1.0 1.5473e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 352
MatGetSymTrans 8 1.0 3.1638e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPGMRESOrthog 134 1.0 1.3156e-03 1.0 3.28e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 6 0 0 0 1 6 0 0 0 2491
KSPSetUp 24 1.0 4.6754e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 2 1.0 1.1291e-01 1.0 5.32e+07 1.0 0.0e+00 0.0e+00 0.0e+00 94 95 0 0 0 94 95 0 0 0 471
PCGAMGGraph_AGG 6 1.0 1.2108e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 2
PCGAMGCoarse_AGG 6 1.0 1.1127e-03 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 49
PCGAMGProl_AGG 6 1.0 4.1062e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
PCGAMGPOpt_AGG 6 1.0 1.1200e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 534
GAMG: createProl 6 1.0 6.5530e-02 1.0 6.06e+06 1.0 0.0e+00 0.0e+00 0.0e+00 55 11 0 0 0 55 11 0 0 0 92
Graph 12 1.0 1.1692e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 2
MIS/Agg 6 1.0 1.4496e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SA: col data 6 1.0 7.1526e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SA: frmProl0 6 1.0 4.0917e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
SA: smooth 6 1.0 1.1198e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 534
GAMG: partLevel 6 1.0 2.1341e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 283
PCSetUp 4 1.0 8.8020e-02 1.0 1.21e+07 1.0 0.0e+00 0.0e+00 0.0e+00 74 22 0 0 0 74 22 0 0 0 137
PCSetUpOnBlocks 16 1.0 1.8382e-04 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 42
PCApply 16 1.0 2.3858e-02 1.0 3.91e+07 1.0 0.0e+00 0.0e+00 0.0e+00 20 70 0 0 0 20 70 0 0 0 1637
Are you sure you ran with -pc_type gamg ? What about running with -info does it print anything about gamg? What about -ksp_view does it indicate it is using the gamg preconditioner?
Post by TAY wee-beng
Hi,
I have attached the 2 logs.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Ok, the convergence looks good. Now run on 8 and 64 processes as before with -log_summary and not -ksp_monitor to see how it scales.
Barry
Post by TAY wee-beng
Hi,
I tried and have attached the log.
Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify some null space stuff? Like KSPSetNullSpace or MatNullSpaceCreate?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Post by TAY wee-beng
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual -poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do you have any zeros on the diagonal for the matrix (you shouldn't).
There may be something wrong with your poisson discretization that was also messing up hypre
Post by TAY wee-beng
1 0.00150000 0.00000000 0.00000000 1.00000000 NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
hypre is just not scaling well here. I do not know why. Since hypre is a block box for us there is no way to determine why the poor scaling.
If you make the same two runs with -pc_type gamg there will be a lot more information in the log summary about in what routines it is scaling well or poorly.
Barry
Post by TAY wee-beng
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then (158)x(266)x(150) on 64 processors and send the two -log_summary results
Barry
Post by TAY wee-beng
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send the new results
You can see from the log summary that the PCSetUp is taking a much smaller percentage of the time meaning that it is reusing the preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount of time in other events of the code which is just not possible. I hope it is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
<log.txt>
<log64_100_2.txt><log8_100_2.txt>
<log8_100_3.txt><log64_100_3.txt>
TAY wee-beng
2015-11-06 05:16:52 UTC
Permalink
Post by Barry Smith
Ok the 64 case not converging makes no sense.
Run it with ksp_monitor and ksp_converged_reason for the pressure solve turned on and -info
You need to figure out why it is not converging.
Barry
Hi,

I found out the reason. Because my partitioning is only in the z
direction, and if using 64cores to partition 150 cells in the z
direction, some partitions will be too small, leading to error.

So how can I test now? The original problem has 158x266x300 with 96
cores. How should I reduce it to test for scaling?

Thanks.
Post by Barry Smith
Post by TAY wee-beng
Hi,
I have removed the nullspace and attached the new logs.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Sorry I realised that I didn't use gamg and that's why. But if I use gamg, the 8 core case worked, but the 64 core case shows p diverged.
Where is the log file for the 8 core case? And where is all the output from where it fails with 64 cores? Include -ksp_monitor_true_residual and -ksp_converged_reason
Barry
Why is this so? Btw, I have also added nullspace in my code.
Thank you.
Yours sincerely,
TAY wee-beng
Post by Barry Smith
There is a problem here. The -log_summary doesn't show all the events associated with the -pc_type gamg preconditioner it should have rows like
VecDot 2 1.0 6.1989e-06 1.0 1.00e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1613
VecMDot 134 1.0 5.4145e-04 1.0 1.64e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 3 0 0 0 0 3 0 0 0 3025
VecNorm 154 1.0 2.4176e-04 1.0 3.82e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1578
VecScale 148 1.0 1.6928e-04 1.0 1.76e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1039
VecCopy 106 1.0 1.2255e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 474 1.0 5.1236e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 54 1.0 1.3471e-04 1.0 2.35e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1742
VecAYPX 384 1.0 5.7459e-04 1.0 4.94e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 860
VecAXPBYCZ 192 1.0 4.7398e-04 1.0 9.88e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 2085
VecWAXPY 2 1.0 7.8678e-06 1.0 5.00e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 636
VecMAXPY 148 1.0 8.1539e-04 1.0 1.96e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 3 0 0 0 1 3 0 0 0 2399
VecPointwiseMult 66 1.0 1.1253e-04 1.0 6.79e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 604
VecScatterBegin 45 1.0 6.3419e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSetRandom 6 1.0 3.0994e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecReduceArith 4 1.0 1.3113e-05 1.0 2.00e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1525
VecReduceComm 2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecNormalize 148 1.0 4.4799e-04 1.0 5.27e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1177
MatMult 424 1.0 8.9276e-03 1.0 2.09e+07 1.0 0.0e+00 0.0e+00 0.0e+00 7 37 0 0 0 7 37 0 0 0 2343
MatMultAdd 48 1.0 5.0926e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 2069
MatMultTranspose 48 1.0 9.8586e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 2 0 0 0 1 2 0 0 0 1069
MatSolve 16 1.0 2.2173e-05 1.0 1.02e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 460
MatSOR 354 1.0 1.0547e-02 1.0 1.72e+07 1.0 0.0e+00 0.0e+00 0.0e+00 9 31 0 0 0 9 31 0 0 0 1631
MatLUFactorSym 2 1.0 4.7922e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatLUFactorNum 2 1.0 2.5272e-05 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 307
MatScale 18 1.0 1.7142e-04 1.0 1.50e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 874
MatResidual 48 1.0 1.0548e-03 1.0 2.33e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 2212
MatAssemblyBegin 57 1.0 4.7684e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 57 1.0 1.9786e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatGetRow 21616 1.0 1.8497e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatGetRowIJ 2 1.0 6.9141e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 2 1.0 6.0797e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatCoarsen 6 1.0 9.3222e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatZeroEntries 2 1.0 3.9101e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAXPY 6 1.0 1.7998e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatFDColorCreate 1 1.0 3.2902e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatFDColorSetUp 1 1.0 1.6739e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
MatFDColorApply 2 1.0 1.3199e-03 1.0 2.41e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 1826
MatFDColorFunc 42 1.0 7.4601e-04 1.0 2.20e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 2956
MatMatMult 6 1.0 5.1048e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 4 2 0 0 0 4 2 0 0 0 241
MatMatMultSym 6 1.0 3.2601e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 3 0 0 0 0 0
MatMatMultNum 6 1.0 1.8158e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 2 2 0 0 0 2 2 0 0 0 679
MatPtAP 6 1.0 2.1328e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 283
MatPtAPSymbolic 6 1.0 1.0073e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 8 0 0 0 0 8 0 0 0 0 0
MatPtAPNumeric 6 1.0 1.1230e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 537
MatTrnMatMult 2 1.0 7.2789e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 75
MatTrnMatMultSym 2 1.0 5.7006e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatTrnMatMultNum 2 1.0 1.5473e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 352
MatGetSymTrans 8 1.0 3.1638e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPGMRESOrthog 134 1.0 1.3156e-03 1.0 3.28e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 6 0 0 0 1 6 0 0 0 2491
KSPSetUp 24 1.0 4.6754e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 2 1.0 1.1291e-01 1.0 5.32e+07 1.0 0.0e+00 0.0e+00 0.0e+00 94 95 0 0 0 94 95 0 0 0 471
PCGAMGGraph_AGG 6 1.0 1.2108e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 2
PCGAMGCoarse_AGG 6 1.0 1.1127e-03 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 49
PCGAMGProl_AGG 6 1.0 4.1062e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
PCGAMGPOpt_AGG 6 1.0 1.1200e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 534
GAMG: createProl 6 1.0 6.5530e-02 1.0 6.06e+06 1.0 0.0e+00 0.0e+00 0.0e+00 55 11 0 0 0 55 11 0 0 0 92
Graph 12 1.0 1.1692e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 2
MIS/Agg 6 1.0 1.4496e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SA: col data 6 1.0 7.1526e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SA: frmProl0 6 1.0 4.0917e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
SA: smooth 6 1.0 1.1198e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 534
GAMG: partLevel 6 1.0 2.1341e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 283
PCSetUp 4 1.0 8.8020e-02 1.0 1.21e+07 1.0 0.0e+00 0.0e+00 0.0e+00 74 22 0 0 0 74 22 0 0 0 137
PCSetUpOnBlocks 16 1.0 1.8382e-04 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 42
PCApply 16 1.0 2.3858e-02 1.0 3.91e+07 1.0 0.0e+00 0.0e+00 0.0e+00 20 70 0 0 0 20 70 0 0 0 1637
Are you sure you ran with -pc_type gamg ? What about running with -info does it print anything about gamg? What about -ksp_view does it indicate it is using the gamg preconditioner?
Post by TAY wee-beng
Hi,
I have attached the 2 logs.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Ok, the convergence looks good. Now run on 8 and 64 processes as before with -log_summary and not -ksp_monitor to see how it scales.
Barry
Post by TAY wee-beng
Hi,
I tried and have attached the log.
Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify some null space stuff? Like KSPSetNullSpace or MatNullSpaceCreate?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Post by TAY wee-beng
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual -poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do you have any zeros on the diagonal for the matrix (you shouldn't).
There may be something wrong with your poisson discretization that was also messing up hypre
Post by TAY wee-beng
1 0.00150000 0.00000000 0.00000000 1.00000000 NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
hypre is just not scaling well here. I do not know why. Since hypre is a block box for us there is no way to determine why the poor scaling.
If you make the same two runs with -pc_type gamg there will be a lot more information in the log summary about in what routines it is scaling well or poorly.
Barry
Post by TAY wee-beng
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then (158)x(266)x(150) on 64 processors and send the two -log_summary results
Barry
Post by TAY wee-beng
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send the new results
You can see from the log summary that the PCSetUp is taking a much smaller percentage of the time meaning that it is reusing the preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount of time in other events of the code which is just not possible. I hope it is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
<log.txt>
<log64_100_2.txt><log8_100_2.txt>
<log8_100_3.txt><log64_100_3.txt>
Barry Smith
2015-11-06 05:26:14 UTC
Permalink
Post by Barry Smith
Ok the 64 case not converging makes no sense.
Run it with ksp_monitor and ksp_converged_reason for the pressure solve turned on and -info
You need to figure out why it is not converging.
Barry
Hi,
I found out the reason. Because my partitioning is only in the z direction, and if using 64cores to partition 150 cells in the z direction, some partitions will be too small, leading to error.
Oh, well this is your fundamental problem and why you don't get scaling! You need to have partitioning in all three directions or you will never get good scaling! This is fundamental, just fix your code to have partitioning in all dimensions

Barry
So how can I test now? The original problem has 158x266x300 with 96 cores. How should I reduce it to test for scaling?
Thanks.
Post by Barry Smith
Post by TAY wee-beng
Hi,
I have removed the nullspace and attached the new logs.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Sorry I realised that I didn't use gamg and that's why. But if I use gamg, the 8 core case worked, but the 64 core case shows p diverged.
Where is the log file for the 8 core case? And where is all the output from where it fails with 64 cores? Include -ksp_monitor_true_residual and -ksp_converged_reason
Barry
Why is this so? Btw, I have also added nullspace in my code.
Thank you.
Yours sincerely,
TAY wee-beng
Post by Barry Smith
There is a problem here. The -log_summary doesn't show all the events associated with the -pc_type gamg preconditioner it should have rows like
VecDot 2 1.0 6.1989e-06 1.0 1.00e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1613
VecMDot 134 1.0 5.4145e-04 1.0 1.64e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 3 0 0 0 0 3 0 0 0 3025
VecNorm 154 1.0 2.4176e-04 1.0 3.82e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1578
VecScale 148 1.0 1.6928e-04 1.0 1.76e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1039
VecCopy 106 1.0 1.2255e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 474 1.0 5.1236e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 54 1.0 1.3471e-04 1.0 2.35e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1742
VecAYPX 384 1.0 5.7459e-04 1.0 4.94e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 860
VecAXPBYCZ 192 1.0 4.7398e-04 1.0 9.88e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 2085
VecWAXPY 2 1.0 7.8678e-06 1.0 5.00e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 636
VecMAXPY 148 1.0 8.1539e-04 1.0 1.96e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 3 0 0 0 1 3 0 0 0 2399
VecPointwiseMult 66 1.0 1.1253e-04 1.0 6.79e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 604
VecScatterBegin 45 1.0 6.3419e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSetRandom 6 1.0 3.0994e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecReduceArith 4 1.0 1.3113e-05 1.0 2.00e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1525
VecReduceComm 2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecNormalize 148 1.0 4.4799e-04 1.0 5.27e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1177
MatMult 424 1.0 8.9276e-03 1.0 2.09e+07 1.0 0.0e+00 0.0e+00 0.0e+00 7 37 0 0 0 7 37 0 0 0 2343
MatMultAdd 48 1.0 5.0926e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 2069
MatMultTranspose 48 1.0 9.8586e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 2 0 0 0 1 2 0 0 0 1069
MatSolve 16 1.0 2.2173e-05 1.0 1.02e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 460
MatSOR 354 1.0 1.0547e-02 1.0 1.72e+07 1.0 0.0e+00 0.0e+00 0.0e+00 9 31 0 0 0 9 31 0 0 0 1631
MatLUFactorSym 2 1.0 4.7922e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatLUFactorNum 2 1.0 2.5272e-05 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 307
MatScale 18 1.0 1.7142e-04 1.0 1.50e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 874
MatResidual 48 1.0 1.0548e-03 1.0 2.33e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 2212
MatAssemblyBegin 57 1.0 4.7684e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 57 1.0 1.9786e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatGetRow 21616 1.0 1.8497e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatGetRowIJ 2 1.0 6.9141e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 2 1.0 6.0797e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatCoarsen 6 1.0 9.3222e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatZeroEntries 2 1.0 3.9101e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAXPY 6 1.0 1.7998e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatFDColorCreate 1 1.0 3.2902e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatFDColorSetUp 1 1.0 1.6739e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
MatFDColorApply 2 1.0 1.3199e-03 1.0 2.41e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 1826
MatFDColorFunc 42 1.0 7.4601e-04 1.0 2.20e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 2956
MatMatMult 6 1.0 5.1048e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 4 2 0 0 0 4 2 0 0 0 241
MatMatMultSym 6 1.0 3.2601e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 3 0 0 0 0 0
MatMatMultNum 6 1.0 1.8158e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 2 2 0 0 0 2 2 0 0 0 679
MatPtAP 6 1.0 2.1328e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 283
MatPtAPSymbolic 6 1.0 1.0073e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 8 0 0 0 0 8 0 0 0 0 0
MatPtAPNumeric 6 1.0 1.1230e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 537
MatTrnMatMult 2 1.0 7.2789e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 75
MatTrnMatMultSym 2 1.0 5.7006e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatTrnMatMultNum 2 1.0 1.5473e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 352
MatGetSymTrans 8 1.0 3.1638e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPGMRESOrthog 134 1.0 1.3156e-03 1.0 3.28e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 6 0 0 0 1 6 0 0 0 2491
KSPSetUp 24 1.0 4.6754e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 2 1.0 1.1291e-01 1.0 5.32e+07 1.0 0.0e+00 0.0e+00 0.0e+00 94 95 0 0 0 94 95 0 0 0 471
PCGAMGGraph_AGG 6 1.0 1.2108e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 2
PCGAMGCoarse_AGG 6 1.0 1.1127e-03 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 49
PCGAMGProl_AGG 6 1.0 4.1062e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
PCGAMGPOpt_AGG 6 1.0 1.1200e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 534
GAMG: createProl 6 1.0 6.5530e-02 1.0 6.06e+06 1.0 0.0e+00 0.0e+00 0.0e+00 55 11 0 0 0 55 11 0 0 0 92
Graph 12 1.0 1.1692e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 2
MIS/Agg 6 1.0 1.4496e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SA: col data 6 1.0 7.1526e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SA: frmProl0 6 1.0 4.0917e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
SA: smooth 6 1.0 1.1198e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 534
GAMG: partLevel 6 1.0 2.1341e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 283
PCSetUp 4 1.0 8.8020e-02 1.0 1.21e+07 1.0 0.0e+00 0.0e+00 0.0e+00 74 22 0 0 0 74 22 0 0 0 137
PCSetUpOnBlocks 16 1.0 1.8382e-04 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 42
PCApply 16 1.0 2.3858e-02 1.0 3.91e+07 1.0 0.0e+00 0.0e+00 0.0e+00 20 70 0 0 0 20 70 0 0 0 1637
Are you sure you ran with -pc_type gamg ? What about running with -info does it print anything about gamg? What about -ksp_view does it indicate it is using the gamg preconditioner?
Post by TAY wee-beng
Hi,
I have attached the 2 logs.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Ok, the convergence looks good. Now run on 8 and 64 processes as before with -log_summary and not -ksp_monitor to see how it scales.
Barry
Post by TAY wee-beng
Hi,
I tried and have attached the log.
Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify some null space stuff? Like KSPSetNullSpace or MatNullSpaceCreate?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Post by TAY wee-beng
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual -poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do you have any zeros on the diagonal for the matrix (you shouldn't).
There may be something wrong with your poisson discretization that was also messing up hypre
Post by TAY wee-beng
1 0.00150000 0.00000000 0.00000000 1.00000000 NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
hypre is just not scaling well here. I do not know why. Since hypre is a block box for us there is no way to determine why the poor scaling.
If you make the same two runs with -pc_type gamg there will be a lot more information in the log summary about in what routines it is scaling well or poorly.
Barry
Post by TAY wee-beng
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then (158)x(266)x(150) on 64 processors and send the two -log_summary results
Barry
Post by TAY wee-beng
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send the new results
You can see from the log summary that the PCSetUp is taking a much smaller percentage of the time meaning that it is reusing the preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount of time in other events of the code which is just not possible. I hope it is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
<log.txt>
<log64_100_2.txt><log8_100_2.txt>
<log8_100_3.txt><log64_100_3.txt>
TAY wee-beng
2015-11-06 05:59:11 UTC
Permalink
Post by Barry Smith
Post by Barry Smith
Ok the 64 case not converging makes no sense.
Run it with ksp_monitor and ksp_converged_reason for the pressure solve turned on and -info
You need to figure out why it is not converging.
Barry
Hi,
I found out the reason. Because my partitioning is only in the z direction, and if using 64cores to partition 150 cells in the z direction, some partitions will be too small, leading to error.
Oh, well this is your fundamental problem and why you don't get scaling! You need to have partitioning in all three directions or you will never get good scaling! This is fundamental, just fix your code to have partitioning in all dimensions
Barry
Hi,

Ok, I'll make the change and compare again.

Thanks
Post by Barry Smith
So how can I test now? The original problem has 158x266x300 with 96 cores. How should I reduce it to test for scaling?
Thanks.
Post by Barry Smith
Post by TAY wee-beng
Hi,
I have removed the nullspace and attached the new logs.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Sorry I realised that I didn't use gamg and that's why. But if I use gamg, the 8 core case worked, but the 64 core case shows p diverged.
Where is the log file for the 8 core case? And where is all the output from where it fails with 64 cores? Include -ksp_monitor_true_residual and -ksp_converged_reason
Barry
Why is this so? Btw, I have also added nullspace in my code.
Thank you.
Yours sincerely,
TAY wee-beng
Post by Barry Smith
There is a problem here. The -log_summary doesn't show all the events associated with the -pc_type gamg preconditioner it should have rows like
VecDot 2 1.0 6.1989e-06 1.0 1.00e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1613
VecMDot 134 1.0 5.4145e-04 1.0 1.64e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 3 0 0 0 0 3 0 0 0 3025
VecNorm 154 1.0 2.4176e-04 1.0 3.82e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1578
VecScale 148 1.0 1.6928e-04 1.0 1.76e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1039
VecCopy 106 1.0 1.2255e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 474 1.0 5.1236e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 54 1.0 1.3471e-04 1.0 2.35e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1742
VecAYPX 384 1.0 5.7459e-04 1.0 4.94e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 860
VecAXPBYCZ 192 1.0 4.7398e-04 1.0 9.88e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 2085
VecWAXPY 2 1.0 7.8678e-06 1.0 5.00e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 636
VecMAXPY 148 1.0 8.1539e-04 1.0 1.96e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 3 0 0 0 1 3 0 0 0 2399
VecPointwiseMult 66 1.0 1.1253e-04 1.0 6.79e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 604
VecScatterBegin 45 1.0 6.3419e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSetRandom 6 1.0 3.0994e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecReduceArith 4 1.0 1.3113e-05 1.0 2.00e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1525
VecReduceComm 2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecNormalize 148 1.0 4.4799e-04 1.0 5.27e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1177
MatMult 424 1.0 8.9276e-03 1.0 2.09e+07 1.0 0.0e+00 0.0e+00 0.0e+00 7 37 0 0 0 7 37 0 0 0 2343
MatMultAdd 48 1.0 5.0926e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 2069
MatMultTranspose 48 1.0 9.8586e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 2 0 0 0 1 2 0 0 0 1069
MatSolve 16 1.0 2.2173e-05 1.0 1.02e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 460
MatSOR 354 1.0 1.0547e-02 1.0 1.72e+07 1.0 0.0e+00 0.0e+00 0.0e+00 9 31 0 0 0 9 31 0 0 0 1631
MatLUFactorSym 2 1.0 4.7922e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatLUFactorNum 2 1.0 2.5272e-05 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 307
MatScale 18 1.0 1.7142e-04 1.0 1.50e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 874
MatResidual 48 1.0 1.0548e-03 1.0 2.33e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 2212
MatAssemblyBegin 57 1.0 4.7684e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 57 1.0 1.9786e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatGetRow 21616 1.0 1.8497e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatGetRowIJ 2 1.0 6.9141e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 2 1.0 6.0797e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatCoarsen 6 1.0 9.3222e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatZeroEntries 2 1.0 3.9101e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAXPY 6 1.0 1.7998e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatFDColorCreate 1 1.0 3.2902e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatFDColorSetUp 1 1.0 1.6739e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
MatFDColorApply 2 1.0 1.3199e-03 1.0 2.41e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 1826
MatFDColorFunc 42 1.0 7.4601e-04 1.0 2.20e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 2956
MatMatMult 6 1.0 5.1048e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 4 2 0 0 0 4 2 0 0 0 241
MatMatMultSym 6 1.0 3.2601e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 3 0 0 0 0 0
MatMatMultNum 6 1.0 1.8158e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 2 2 0 0 0 2 2 0 0 0 679
MatPtAP 6 1.0 2.1328e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 283
MatPtAPSymbolic 6 1.0 1.0073e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 8 0 0 0 0 8 0 0 0 0 0
MatPtAPNumeric 6 1.0 1.1230e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 537
MatTrnMatMult 2 1.0 7.2789e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 75
MatTrnMatMultSym 2 1.0 5.7006e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatTrnMatMultNum 2 1.0 1.5473e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 352
MatGetSymTrans 8 1.0 3.1638e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPGMRESOrthog 134 1.0 1.3156e-03 1.0 3.28e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 6 0 0 0 1 6 0 0 0 2491
KSPSetUp 24 1.0 4.6754e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 2 1.0 1.1291e-01 1.0 5.32e+07 1.0 0.0e+00 0.0e+00 0.0e+00 94 95 0 0 0 94 95 0 0 0 471
PCGAMGGraph_AGG 6 1.0 1.2108e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 2
PCGAMGCoarse_AGG 6 1.0 1.1127e-03 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 49
PCGAMGProl_AGG 6 1.0 4.1062e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
PCGAMGPOpt_AGG 6 1.0 1.1200e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 534
GAMG: createProl 6 1.0 6.5530e-02 1.0 6.06e+06 1.0 0.0e+00 0.0e+00 0.0e+00 55 11 0 0 0 55 11 0 0 0 92
Graph 12 1.0 1.1692e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 2
MIS/Agg 6 1.0 1.4496e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SA: col data 6 1.0 7.1526e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SA: frmProl0 6 1.0 4.0917e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34 0 0 0 0 34 0 0 0 0 0
SA: smooth 6 1.0 1.1198e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00 9 11 0 0 0 9 11 0 0 0 534
GAMG: partLevel 6 1.0 2.1341e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 283
PCSetUp 4 1.0 8.8020e-02 1.0 1.21e+07 1.0 0.0e+00 0.0e+00 0.0e+00 74 22 0 0 0 74 22 0 0 0 137
PCSetUpOnBlocks 16 1.0 1.8382e-04 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 42
PCApply 16 1.0 2.3858e-02 1.0 3.91e+07 1.0 0.0e+00 0.0e+00 0.0e+00 20 70 0 0 0 20 70 0 0 0 1637
Are you sure you ran with -pc_type gamg ? What about running with -info does it print anything about gamg? What about -ksp_view does it indicate it is using the gamg preconditioner?
Post by TAY wee-beng
Hi,
I have attached the 2 logs.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Ok, the convergence looks good. Now run on 8 and 64 processes as before with -log_summary and not -ksp_monitor to see how it scales.
Barry
Post by TAY wee-beng
Hi,
I tried and have attached the log.
Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify some null space stuff? Like KSPSetNullSpace or MatNullSpaceCreate?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Post by TAY wee-beng
Hi,
1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
2. -poisson_pc_type gamg
Run with -poisson_ksp_monitor_true_residual -poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do you have any zeros on the diagonal for the matrix (you shouldn't).
There may be something wrong with your poisson discretization that was also messing up hypre
Post by TAY wee-beng
1 0.00150000 0.00000000 0.00000000 1.00000000 NaN NaN NaN
M Diverged but why?, time = 2
reason = -9
How can I check what's wrong?
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
hypre is just not scaling well here. I do not know why. Since hypre is a block box for us there is no way to determine why the poor scaling.
If you make the same two runs with -pc_type gamg there will be a lot more information in the log summary about in what routines it is scaling well or poorly.
Barry
Post by TAY wee-beng
Hi,
I have attached the 2 files.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run (158/2)x(266/2)x(150/2) grid on 8 processes and then (158)x(266)x(150) on 64 processors and send the two -log_summary results
Barry
Post by TAY wee-beng
Hi,
I have attached the new results.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
Run without the -momentum_ksp_view -poisson_ksp_view and send the new results
You can see from the log summary that the PCSetUp is taking a much smaller percentage of the time meaning that it is reusing the preconditioner and not rebuilding it each time.
Barry
Something makes no sense with the output: it gives
KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
90% of the time is in the solve but there is no significant amount of time in other events of the code which is just not possible. I hope it is due to your IO.
Post by TAY wee-beng
Hi,
I have attached the new run with 100 time steps for 48 and 96 cores.
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Why does the number of processes increase so much? Is there something wrong with my coding? Seems to be so too for my new run.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
If you are doing many time steps with the same linear solver then you MUST do your weak scaling studies with MANY time steps since the setup time of AMG only takes place in the first stimestep. So run both 48 and 96 processes with the same large number of time steps.
Barry
Hi,
Sorry I forgot and use the old a.out. I have attached the new log for 48cores (log48), together with the 96cores log (log96).
Why does the number of processes increase so much? Is there something wrong with my coding?
Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the preconditioner, what must I do? Or what must I not do?
Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps (log48_10). Is it building the preconditioner at every timestep?
Also, what about momentum eqn? Is it working well?
I will try the gamg later too.
Thank you
Yours sincerely,
TAY wee-beng
Post by Barry Smith
You used gmres with 48 processes but richardson with 96. You need to be careful and make sure you don't change the solvers when you change the number of processors since you can get very different inconsistent results
Anyways all the time is being spent in the BoomerAMG algebraic multigrid setup and it is is scaling badly. When you double the problem size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
Now is the Poisson problem changing at each timestep or can you use the same preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid has a large set up time that you often doesn't matter if you have many time steps but if you have to rebuild it each timestep it is too large?
You might also try -pc_type gamg and see how PETSc's algebraic multigrid scales for your problem/machine.
Barry
Post by TAY wee-beng
Post by Barry Smith
Hi,
I understand that as mentioned in the faq, due to the limitations in memory, the scaling is not linear. So, I am trying to write a proposal to use a supercomputer.
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my current set of data, and there is a formula to calculate the estimated parallel efficiency when using the new large set of data
1. Strong scaling, which is defined as how the elapsed time varies with the number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 mins respectively. This is classified as strong scaling.
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node
The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following formulae. Although their
derivation processes are different depending on strong and weak scaling, derived formulae are the
same.
From the estimated time, my parallel efficiency using Amdahl's law on the current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from Amdahl's Law changes as a function
of problem size, so you cannot take the strong scaling from one problem and apply it to another without a
model of this dependence.
Weak scaling does model changes with problem size, so I would measure weak scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make sense for many scientific
applications, but neither does requiring a certain parallel efficiency.
Ok I check the results for my weak scaling it is even worse for the expected parallel efficiency. From the formula used, it's obvious it's doing some sort of exponential extrapolation decrease. So unless I can achieve a near > 90% speed up when I double the cores and problem size for my current 48/96 cores setup, extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected parallel efficiency for the new case.
However, it's mentioned in the FAQ that due to memory requirement, it's impossible to get >90% speed when I double the cores and problem size (ie linear increase in performance), which means that I can't get >90% speed up when I double the cores and problem size for my current 48/96 cores setup. Is that so?
What is the output of -ksp_view -log_summary on the problem and then on the problem doubled in size and number of processors?
Barry
Hi,
I have attached the output
48 cores: log48
96 cores: log96
There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn uses hypre BoomerAMG.
Problem size doubled from 158x266x150 to 158x266x300.
Post by Barry Smith
So is it fair to say that the main problem does not lie in my programming skills, but rather the way the linear equations are solved?
Thanks.
Thanks,
Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 (2205X8) cores?
Btw, I do not have access to the system.
Sent using CloudMagic Email
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<log48.txt><log96.txt>
<log48_10.txt><log48.txt><log96.txt>
<log96_100.txt><log48_100.txt>
<log96_100_2.txt><log48_100_2.txt>
<log64_100.txt><log8_100.txt>
<log.txt>
<log64_100_2.txt><log8_100_2.txt>
<log8_100_3.txt><log64_100_3.txt>
Loading...