FD4 Library Hands-on Training: Developing an Advection Simulation

10 November 2014, Developer School for HPC applications in Earth Sciences, ICTP, Trieste, Italy

Matthias Lieber, ZIH, TU Dresden, Germany, matthias.lieber@tu-dresden.de

FD4 Documents

Hands-on Downloads

Task 1: Download and Build FD4

Building FD4 should be as easy as:

% wget http://wwwpub.zih.tu-dresden.de/~mlieber/fd4/fd4-2014-11-05.tar.gz
% tar xzf fd4-2014-11-05.tar.gz
% cd fd4-2014-11-05
% make -j

Task 2: FD4 Domain and FD4 Vartab Example

Download the task package and unpack it in the fd4-2014-11-05 directory. If you unpack it anywhere else, the path to the FD4 library needs to be modified in the Makefiles.

% wget http://wwwpub.zih.tu-dresden.de/~mlieber/fd4/handson-ictp-2014.tar.gz
% tar xzf handson-ictp-2014.tar.gz
% cd handson/02_domain
% make
% mpirun -np 4 ./simulation

Now do the following modifications:

Task 3: NetCDF Output

Declare an FD4 NetCDF Communicator and, after allocating the blocks of the domain, open a file "out.nc" and write all 3 variables to the file. Do not forget to close the file after that.

Then, after running with multiple MPI ranks, use ncview to take a look into the data:

% mpirun -np 4 ./simulation
% ncview out.nc
The pictures below show variable "blocks" for 4 and 6 MPI ranks:
4 MPI ranks
 
6 MPI ranks

Task 4: Block Iterator and Initialization

Declare an FD4 Block Iterator and implement a loop over all blocks. Within that loop, loop over all grid cells (x and y dimension) of the block and initialize the three variables using the functions provided by the Fortran module advection:

They take the x and y grid position and the grid's dimension as input and return the initial value for each of the variables. Look in the advection.F90 source code file for details.

Hints:

Run the code again and open the new NetCDF file with ncview. The pictures below show two of the variables:
Concentration
 
uWind

Task 5: Ghost Communication and Time Stepping Loop

Implement the time stepping loop over 2000 time steps with ghost communication. The body of the time stepping loop should contain (in this order):

Hints:

Run the code again and open the new NetCDF file with ncview. Observe how the concentration changes over time. The pictures below show the concentration at two different time steps:
Concentration after 500 time steps
 
Concentration after 2000 time steps
The picture to the right shows that there is still a problem with the boundary conditions, which we will fix in the next task.

Task 6: Boundary Conditions

First, compile and execute the program for this task. A check for mass conservation of the concentration variable has been added. The output shows, that the total mass is not constant.

step   100   mass:   56.5475
[...]
step  2000   mass:   56.5523
This is due to problems with boundary conditions in the grid: periodic boundary conditions are used (for simplicity), but the wind variables are not consistent at the domain boundaries, e.g. the uWind at the leftmost face has not the same value as the rightmost face (similar for vWind).

To solve the problem, zero-gradient boundary conditions should be used. This is straightforward with FD4: Before computing the next time step on a block, call fd4_boundary_zerograd_block to set the boundaries for all input variables for that time step. This routine will do nothing if the block is not at the domain boundary. Of course, you also need to disable periodic boundary conditions when calling fd4_domain_create.

Run the program again to ensure that the mass in constant. In ncview, the concentration at the end of the simulation now looks like this:
Concentration after 2000 time steps

Task 7: Dynamic Load Balancing

In this program the load of the advection computation is artificially increased and small concentrations are eliminated. This leads to mass loss (which we accept for the sake of this artificial example) and strong load imbalances between regions with concentration larger than zero and regions were the concentration is zero.

Insert calls to measure the computation time of a block and set the block weight. Call the FD4 load balancing routine fd4_balance_readjust at the end of each time step. Printout the last measured load balance (from the opt_stats argument of fd4_balance_readjust) at the output time steps. Run the program again and compare the run time to the original program without load balancing. Also take a look at the NetCDF output: you can see how blocks are migrated between processes in the "blocks" variable and the workload per block in the "weight" variable:
Weight after 100 time steps
 
Blocks after 100 time steps

Now explore some load balancing parameters. Use fd4_balance_params (before allocating all blocks) to test the following:

For each of the 3 subtasks check the printout (measured balance) and the partitioning ("blocks" variable in NetCDF output).
Blocks after 100 time steps (Morton SFC)
 
Blocks after 100 time steps (RCB)

Another thing to try out is to increase the number of blocks, i.e. from 64 8x8 blocks to 256 4x4 blocks. This makes load balancing more efficient for higher number of ranks (because of the finer granularity).

Blocks after 1400 time steps (with 4x4 blocks)

Task 8: Coupling

This program uses FD4's model coupling techniques to initialize the Concentration and to update the wind variables each time step. The wind variables are "computed" by a toy model that has a different spatial decomposition than the one FD4 uses. This artificial set-up is just to demonstrate model coupling with as few as possible source code lines. These pictures show the partitioning in FD4 and for the "wind model" using 4 MPI ranks:
Initial FD4 partitioning
 
Static "wind model" partitioning

The sequence within a time step now looks like this:

Take a look at the code. It contains a bug: only the windU (varU) is added to the FD4 couple context, but not windV (varV). Your task is to add windV such that the computations are correct. In ncview, vWind and Concentration look like this after 1200 steps:
vWind after 1200 steps
 
Concentration after 1200 steps

Task 9: Adaptive Block Allocation

This is just a demo of the adaptive block mode of FD4. The pictures show that FD4 dynamically allocates blocks only where required (i.e. variable Concentration is larger than a specified threshold):
Concentration
 
Block allocation and partitioning

The following things are changed in comparison to solution of Task 8:

The pictures above were created with an increased number of blocks (16x16), 8 ranks and RCB partitioning.