FD4 Library Hands-on Training: Developing an Advection Simulation

10 November 2014, Developer School for HPC applications in Earth Sciences, ICTP, Trieste, Italy

Matthias Lieber, ZIH, TU Dresden, Germany, matthias.lieber@tu-dresden.de

FD4 Documents

Hands-on Downloads

Task 1: Download and Build FD4

Building FD4 should be as easy as:

% wget http://wwwpub.zih.tu-dresden.de/~mlieber/fd4/fd4-2014-11-05.tar.gz
% tar xzf fd4-2014-11-05.tar.gz
% cd fd4-2014-11-05
% make -j

Task 2: FD4 Domain and FD4 Vartab Example

Download the task package and unpack it in the fd4-2014-11-05 directory. If you unpack it anywhere else, the path to the FD4 library needs to be modified in the Makefiles.

% wget http://wwwpub.zih.tu-dresden.de/~mlieber/fd4/handson-ictp-2014.tar.gz
% tar xzf handson-ictp-2014.tar.gz
% cd handson/02_domain
% make
% mpirun -np 4 ./simulation

Now do the following modifications:

The current example has 2 variables only: Concentration (of some exemplary quantity like water vapor in the air) and horizontal (x dimension) wind. Add a vertical (y dimension) wind "vWind" to the variable table, it should be similar to "uWind", but a face variable in y dimension.
Change the block size to 8x8, i.e. increase bnum.
Enable FD4 internal statistics with opt_stats=.true. in fd4_domain_delete.

Task 3: NetCDF Output

Declare an FD4 NetCDF Communicator and, after allocating the blocks of the domain, open a file "out.nc" and write all 3 variables to the file. Do not forget to close the file after that.

Then, after running with multiple MPI ranks, use ncview to take a look into the data:

% mpirun -np 4 ./simulation
% ncview out.nc

Click on the three variable names and observe that variables are all zero. FD4 initializes all variables with zero, if not declared otherwise in the variable table.
Display the variable "blocks": it shows the mapping of blocks to MPI ranks, i.e. the partitioning of the grid. FD4 uses a Hilbert space-filling curve as the default partitioning method. Task 7 addresses other methods.
Variable "weight" will be addressed in task 7.

The pictures below show variable "blocks" for 4 and 6 MPI ranks:

4 MPI ranks

6 MPI ranks

Task 4: Block Iterator and Initialization

Declare an FD4 Block Iterator and implement a loop over all blocks. Within that loop, loop over all grid cells (x and y dimension) of the block and initialize the three variables using the functions provided by the Fortran module advection:

advection_init_c(x, y, grid)
advection_init_u(x, y, grid)
advection_init_v(x, y, grid)

They take the x and y grid position and the grid's dimension as input and return the initial value for each of the variables. Look in the advection.F90 source code file for details.

Hints:

The wind variables are face variables, i.e. the grid cell loop's upper bound must be increased by one for x (uWind) or y (vWind).
It is sufficient to initialize the time level 1 of the concentration variable.
Use fd4_iter_offset(iter, offset) to get the offset from local block coordinates (starting at 1 for each block) to global coordinates.

Run the code again and open the new NetCDF file with ncview. The pictures below show two of the variables:

Concentration

uWind

Task 5: Ghost Communication and Time Stepping Loop

Implement the time stepping loop over 2000 time steps with ghost communication. The body of the time stepping loop should contain (in this order):

Ghost communication.
Block loop containing an fd4_iter_get_ghost for the concentration and loops over all grid cells to compute the updated value of the concentration with a time step dt=0.2.
Writing NetCDF output to file and the current time step to the console every 100 time steps.
Swapping of the time level indicators (see below).

Hints:

All new variables required to solve the task are already declared in the source file.

!! Simulation setup
real(rkind), parameter :: dt = 0.2         ! time step size
integer, parameter :: nsteps = 2000        ! number of time steps to compute
integer, parameter :: output_steps = 100   ! number of steps between output
[...]
! buffer array for iter_get_ghost
integer :: bext(0:3)
real(rkind), allocatable :: buf(:,:,:,:)
! FD4 ghost communication
type(fd4_ghostcomm) :: ghostcomm(2)
! time step loop
integer :: now, new               ! time step indicators for varC
integer :: step                   ! current time step

Use the function advection_compute to update the concentration at a grid cell, see advevtion.F90 for details on the parameters.
In each time step, read from one time level (e.g. now=1) of the concentration variable and write to the other one (e.g. new=2). At the end of the time stepping loop, swap now and new, such that the updated values are current state in the next time step.
Declare an array of two ghost communicators, one for each time level of the concentration.
You need to allocate a buffer array for the concentration. It must be large enough to hold the larges block including ghost cells. Use fd4_domain_max_bext(domain, bext(1:3), .true.) to get the max block extent in bext = (/max_x, max_y, max_z/).

Run the code again and open the new NetCDF file with ncview. Observe how the concentration changes over time. The pictures below show the concentration at two different time steps:

Concentration after 500 time steps

Concentration after 2000 time steps

The picture to the right shows that there is still a problem with the boundary conditions, which we will fix in the next task.

Task 6: Boundary Conditions

First, compile and execute the program for this task. A check for mass conservation of the concentration variable has been added. The output shows, that the total mass is not constant.

step   100   mass:   56.5475
[...]
step  2000   mass:   56.5523

This is due to problems with boundary conditions in the grid: periodic boundary conditions are used (for simplicity), but the wind variables are not consistent at the domain boundaries, e.g. the uWind at the leftmost face has not the same value as the rightmost face (similar for vWind).

To solve the problem, zero-gradient boundary conditions should be used. This is straightforward with FD4: Before computing the next time step on a block, call fd4_boundary_zerograd_block to set the boundaries for all input variables for that time step. This routine will do nothing if the block is not at the domain boundary. Of course, you also need to disable periodic boundary conditions when calling fd4_domain_create.

Run the program again to ensure that the mass in constant. In ncview, the concentration at the end of the simulation now looks like this:

Concentration after 2000 time steps

Task 7: Dynamic Load Balancing

In this program the load of the advection computation is artificially increased and small concentrations are eliminated. This leads to mass loss (which we accept for the sake of this artificial example) and strong load imbalances between regions with concentration larger than zero and regions were the concentration is zero.

Insert calls to measure the computation time of a block and set the block weight. Call the FD4 load balancing routine fd4_balance_readjust at the end of each time step. Printout the last measured load balance (from the opt_stats argument of fd4_balance_readjust) at the output time steps. Run the program again and compare the run time to the original program without load balancing. Also take a look at the NetCDF output: you can see how blocks are migrated between processes in the "blocks" variable and the workload per block in the "weight" variable:

Weight after 100 time steps

Blocks after 100 time steps

Now explore some load balancing parameters. Use fd4_balance_params (before allocating all blocks) to test the following:

Reduce the load balancing threshold, i.e., run load balancing only if balance is very weak: opt_lbtol=0.3.
Try out the Morton space-filling curve (in contrast to Hilbert): opt_sfctype=FD4_PART_SFC_MORTON.
Try out the recursive bisection methods (only works if number of ranks is a power of two): method=FD4_BALANCE_RCB

For each of the 3 subtasks check the printout (measured balance) and the partitioning ("blocks" variable in NetCDF output).

Blocks after 100 time steps (Morton SFC)

Blocks after 100 time steps (RCB)

Another thing to try out is to increase the number of blocks, i.e. from 64 8x8 blocks to 256 4x4 blocks. This makes load balancing more efficient for higher number of ranks (because of the finer granularity).

Blocks after 1400 time steps (with 4x4 blocks)

Task 8: Coupling

This program uses FD4's model coupling techniques to initialize the Concentration and to update the wind variables each time step. The wind variables are "computed" by a toy model that has a different spatial decomposition than the one FD4 uses. This artificial set-up is just to demonstrate model coupling with as few as possible source code lines. These pictures show the partitioning in FD4 and for the "wind model" using 4 MPI ranks:

Initial FD4 partitioning

Static "wind model" partitioning

The sequence within a time step now looks like this:

The "wind model" updates windU and windV on its own data structures.
FD4 coupling is used to transfer the updated wind variables to the FD4 blocks.
The advection is computed on the FD4 data structures.
FD4 load balancer is called, which may change the partitioning.

Take a look at the code. It contains a bug: only the windU (varU) is added to the FD4 couple context, but not windV (varV). Your task is to add windV such that the computations are correct. In ncview, vWind and Concentration look like this after 1200 steps:

vWind after 1200 steps

Concentration after 1200 steps

Task 9: Adaptive Block Allocation

This is just a demo of the adaptive block mode of FD4. The pictures show that FD4 dynamically allocates blocks only where required (i.e. variable Concentration is larger than a specified threshold):

Concentration

Block allocation and partitioning

The following things are changed in comparison to solution of Task 8:

Variable table: Threshold added for variable varC.
Call to fd4_util_allocate_all_blocks is removed (we do not want to allocate all blocks).
Before calling fd4_couple_put to put initial data for varC, we call fd4_couple_mark_blocks to mark all blocks with varC greater than the threshold and fd4_balance_readjust to create a partitioning for these blocks and allocate them (this is also what fd4_util_allocate_all_blocks does, it just marks all blocks).
In the block loop, after the computations: The call to fd4_iter_empty checks if neighbor blocks need to be allocated or if this block may be removed.

The pictures above were created with an increased number of blocks (16x16), 8 ranks and RCB partitioning.