Back to home page

darwin3

 
 

    


Warning, /doc/software_arch/software_arch.rst is written in an unsupported language. File is not indexed.

view on githubraw file Latest commit 1c87316f on 2019-01-25 23:56:49 UTC
1c65381846 Jeff*0001 .. _sarch:
                0002 
                0003 Software Architecture
                0004 *********************
                0005 
4c8caab8bd Jeff*0006 This chapter focuses on describing the **WRAPPER** environment within
                0007 which both the core numerics and the pluggable packages operate. The
                0008 description presented here is intended to be a detailed exposition and
                0009 contains significant background material, as well as advanced details on
                0010 working with the WRAPPER. The tutorial examples in this manual (see
                0011 :numref:`chap_modelExamples`) contain more
                0012 succinct, step-by-step instructions on running basic numerical
                0013 experiments, of various types, both sequentially and in parallel. For
                0014 many projects, simply starting from an example code and adapting it to
                0015 suit a particular situation will be all that is required. The first part
                0016 of this chapter discusses the MITgcm architecture at an abstract level.
                0017 In the second part of the chapter we described practical details of the
                0018 MITgcm implementation and the current tools and operating system features
                0019 that are employed.
                0020 
                0021 Overall architectural goals
                0022 ===========================
                0023 
                0024 Broadly, the goals of the software architecture employed in MITgcm are
                0025 three-fold:
                0026 
                0027 -  To be able to study a very broad range of interesting and
                0028    challenging rotating fluids problems;
                0029 
                0030 -  The model code should be readily targeted to a wide range of
                0031    platforms; and
                0032 
                0033 -  On any given platform, performance should be
                0034    comparable to an implementation developed and specialized
                0035    specifically for that platform.
                0036 
                0037 These points are summarized in :numref:`mitgcm_goals`,
                0038 which conveys the goals of the MITgcm design. The goals lead to a
                0039 software architecture which at the broadest level can be viewed as
                0040 consisting of:
                0041 
                0042 #. A core set of numerical and support code. This is discussed in detail
                0043    in :numref:`discret_algorithm`.
                0044 
                

** Warning **

Wide character in print at /usr/local/share/lxr/source line 1030, <$git> line 46.

0045 #. A scheme for supporting optional “pluggable” **packages** (containing 0046 for example mixed-layer schemes, biogeochemical schemes, atmospheric 0047 physics). These packages are used both to overlay alternate dynamics 0048 and to introduce specialized physical content onto the core numerical 0049 code. An overview of the package scheme is given at the start of 0050 :numref:`packagesI`. 0051 0052 #. A support framework called WRAPPER (Wrappable Application 0053 Parallel Programming Environment Resource), within which the core 0054 numerics and pluggable packages operate. 0055 0056 .. figure:: figs/mitgcm_goals.* 0057 :width: 70% 0058 :align: center 0059 :alt: span of mitgcm goals 0060 :name: mitgcm_goals 0061 0062 The MITgcm architecture is designed to allow simulation of a wide range of physical problems on a wide range of hardware. The computational resource requirements of the applications targeted range from around 10\ :sup:`7` bytes ( :math:`\approx` 10 megabytes) of memory to 10\ :sup:`11` bytes ( :math:`\approx` 100 gigabytes). Arithmetic operation counts for the applications of interest range from 10\ :sup:`9` floating point operations to more than 10\ :sup:`17` floating point operations. 0063 0064 0065 This chapter focuses on describing the WRAPPER environment under 0066 which both the core numerics and the pluggable packages function. The 0067 description presented here is intended to be a detailed exposition and 0068 contains significant background material, as well as advanced details on

** Warning **

Wide character in print at /usr/local/share/lxr/source line 1030, <$git> line 70.

0069 working with the WRAPPER. The “Getting Started” chapter of this manual 0070 (:numref:`chap_getting_started`) contains more succinct, step-by-step 0071 instructions on running basic numerical experiments both sequentially 0072 and in parallel. For many projects simply starting from an example code 0073 and adapting it to suit a particular situation will be all that is 0074 required. 0075 af61fa61c7 Jeff*0076 .. _wrapper: 0077 4c8caab8bd Jeff*0078 WRAPPER 0079 ======= 0080 0081 A significant element of the software architecture utilized in MITgcm is 0082 a software superstructure and substructure collectively called the 0083 WRAPPER (Wrappable Application Parallel Programming Environment

** Warning **

Wide character in print at /usr/local/share/lxr/source line 1030, <$git> line 85.

0084 Resource). All numerical and support code in MITgcm is written to “fit” 0085 within the WRAPPER infrastructure. Writing code to fit within the 0086 WRAPPER means that coding has to follow certain, relatively 0087 straightforward, rules and conventions (these are discussed further in 0088 :numref:`specify_decomp`). 0089 0090 The approach taken by the WRAPPER is illustrated in :numref:`fit_in_wrapper`, 0091 which shows how the WRAPPER serves to insulate 0092 code that fits within it from architectural differences between hardware 0093 platforms and operating systems. This allows numerical code to be easily 0094 retargeted. 0095 0096 .. figure:: figs/fit_in_wrapper.png 0097 :width: 70% 0098 :align: center 0099 :alt: schematic of a wrapper 0100 :name: fit_in_wrapper 0101 0102 Numerical code is written to fit within a software support infrastructure called WRAPPER. The WRAPPER is portable and can be specialized for a wide range of specific target hardware and programming environments, without impacting numerical code that fits within the WRAPPER. Codes that fit within the WRAPPER can generally be made to run as fast on a particular platform as codes specially optimized for that platform. 0103 0104 .. _target_hardware: 0105 0106 Target hardware 0107 --------------- 0108 0109 The WRAPPER is designed to target as broad as possible a range of 0110 computer systems. The original development of the WRAPPER took place on 0111 a multi-processor, CRAY Y-MP system. On that system, numerical code 0112 performance and scaling under the WRAPPER was in excess of that of an

** Warning **

Wide character in print at /usr/local/share/lxr/source line 1030, <$git> line 114.

0113 implementation that was tightly bound to the CRAY system’s proprietary 0114 multi-tasking and micro-tasking approach. Later developments have been 0115 carried out on uniprocessor and multiprocessor Sun systems with both 0116 uniform memory access (UMA) and non-uniform memory access (NUMA) 0117 designs. Significant work has also been undertaken on x86 cluster 0118 systems, Alpha processor based clustered SMP systems, and on 0119 cache-coherent NUMA (CC-NUMA) systems such as Silicon Graphics Altix 0120 systems. The MITgcm code, operating within the WRAPPER, is also 0121 routinely used on large scale MPP systems (for example, Cray T3E and IBM 0122 SP systems). In all cases, numerical code, operating within the WRAPPER, 0123 performs and scales very competitively with equivalent numerical code 0124 that has been modified to contain native optimizations for a particular 0125 system (see Hoe et al. 1999) :cite:`hoe:99` . 0126 0127 Supporting hardware neutrality 0128 ------------------------------ 0129 0130 The different systems mentioned in :numref:`target_hardware` can be 0131 categorized in many different ways. For example, one common distinction 0132 is between shared-memory parallel systems (SMP and PVP) and distributed 0133 memory parallel systems (for example x86 clusters and large MPP 0134 systems). This is one example of a difference between compute platforms 0135 that can impact an application. Another common distinction is between 0136 vector processing systems with highly specialized CPUs and memory 0137 subsystems and commodity microprocessor based systems. There are 0138 numerous other differences, especially in relation to how parallel 0139 execution is supported. To capture the essential differences between 0140 different platforms the WRAPPER uses a *machine model*. 0141 0142 WRAPPER machine model 0143 --------------------- 0144 0145 Applications using the WRAPPER are not written to target just one 0146 particular machine (for example an IBM SP2) or just one particular 0147 family or class of machines (for example Parallel Vector Processor 0148 Systems). Instead the WRAPPER provides applications with an abstract 0149 *machine model*. The machine model is very general; however, it can 0150 easily be specialized to fit, in a computationally efficient manner, any 0151 computer architecture currently available to the scientific computing 0152 community. 0153 0154 Machine model parallelism 0155 ------------------------- 0156 0157 Codes operating under the WRAPPER target an abstract machine that is 0158 assumed to consist of one or more logical processors that can compute 0159 concurrently. Computational work is divided among the logical processors

** Warning **

Wide character in print at /usr/local/share/lxr/source line 1030, <$git> line 161.

0160 by allocating “ownership” to each processor of a certain set (or sets) 0161 of calculations. Each set of calculations owned by a particular 0162 processor is associated with a specific region of the physical space 0163 that is being simulated, and only one processor will be associated with each 0164 such region (domain decomposition). 0165 0166 In a strict sense the logical processors over which work is divided do 0167 not need to correspond to physical processors. It is perfectly possible 0168 to execute a configuration decomposed for multiple logical processors on 0169 a single physical processor. This helps ensure that numerical code that 0170 is written to fit within the WRAPPER will parallelize with no additional 0171 effort. It is also useful for debugging purposes. Generally, however, 0172 the computational domain will be subdivided over multiple logical 0173 processors in order to then bind those logical processors to physical 0174 processor resources that can compute in parallel. 0175 1c87316fba Jeff*0176 .. _tile_description: 0177 4c8caab8bd Jeff*0178 Tiles 0179 ~~~~~ 0180 0181 Computationally, the data structures (e.g., arrays, scalar variables, 0182 etc.) that hold the simulated state are associated with each region of 0183 physical space and are allocated to a particular logical processor. We 0184 refer to these data structures as being **owned** by the processor to 0185 which their associated region of physical space has been allocated. 0186 Individual regions that are allocated to processors are called 0187 **tiles**. A processor can own more than one tile. :numref:`domain_decomp` 0188 shows a physical domain being mapped to a set of 0189 logical processors, with each processor owning a single region of the 0190 domain (a single tile). Except for periods of communication and 0191 coordination, each processor computes autonomously, working only with 0192 data from the tile that the processor owns. If instead multiple 0193 tiles were allotted to a single processor, each of these tiles would be computed on 0194 independently of the other allotted tiles, in a sequential fashion. 0195 0196 .. figure:: figs/domain_decomp.png 0197 :width: 70% 0198 :align: center 0199 :alt: domain decomposition 0200 :name: domain_decomp 0201 0202 The WRAPPER provides support for one and two dimensional decompositions of grid-point domains. The figure shows a hypothetical domain of total size :math:`N_{x}N_{y}N_{z}`. This hypothetical domain is decomposed in two-dimensions along the :math:`N_{x}` and :math:`N_{y}` directions. The resulting tiles are owned by different processors. The owning processors perform the arithmetic operations associated with a tile. Although not illustrated here, a single processor can own several tiles. Whenever a processor wishes to transfer data between tiles or communicate with other processors it calls a WRAPPER supplied function. 0203 0204 Tile layout 0205 ~~~~~~~~~~~ 0206 0207 Tiles consist of an interior region and an overlap region. The overlap 0208 region of a tile corresponds to the interior region of an adjacent tile. 0209 In :numref:`tiled-world` each tile would own the region within the 0210 black square and hold duplicate information for overlap regions 0211 extending into the tiles to the north, south, east and west. During 0212 computational phases a processor will reference data in an overlap 0213 region whenever it requires values that lie outside the domain it owns. 0214 Periodically processors will make calls to WRAPPER functions to 0215 communicate data between tiles, in order to keep the overlap regions up 0216 to date (see :numref:`comm_primitives`). The WRAPPER 0217 functions can use a variety of different mechanisms to communicate data 0218 between tiles. 0219 0220 .. figure:: figs/tiled-world.png 0221 :width: 70% 0222 :align: center 0223 :alt: global earth subdivided into tiles 0224 :name: tiled-world 0225 0226 A global grid subdivided into tiles. Tiles contain a interior region and an overlap region. Overlap regions are periodically updated from neighboring tiles. 0227 0228 0229 Communication mechanisms 0230 ------------------------ 0231 0232 Logical processors are assumed to be able to exchange information 0233 between tiles (and between each other) using at least one of two possible 0234 mechanisms, shared memory or distributed memory communication. 0235 The WRAPPER assumes that communication will use one of these two styles. 0236 The underlying hardware and operating system support 0237 for the style used is not specified and can vary from system to system. 0238 0239 .. _shared_mem_comm: 0240 0241 Shared memory communication 0242 ~~~~~~~~~~~~~~~~~~~~~~~~~~~ 0243 0244 Under this mode of communication, data transfers are assumed to be possible using direct addressing of 0245 regions of memory. In the WRAPPER shared memory communication model, 0246 simple writes to an array can be made to be visible to other CPUs at 0247 the application code level. So, as shown below, if one CPU (CPU1) writes 0248 the value 8 to element 3 of array a, then other CPUs (here, CPU2) will be 0249 able to see the value 8 when they read from a(3). This provides a very low 0250 latency and high bandwidth communication mechanism. Thus, in this way one CPU 0251 can communicate information to another CPU by assigning a particular value to a particular memory 0252 location. 0253 0254 :: 0255 0256 0257 CPU1 | CPU2 0258 ==== | ==== 0259 | 0260 a(3) = 8 | WHILE ( a(3) .NE. 8 ) 0261 | WAIT 0262 | END WHILE 0263 | 0264 0265 0266 Under shared communication independent CPUs are operating on the exact 0267 same global address space at the application level. This is the model of 0268 memory access that is supported at the basic system design level in

** Warning **

Wide character in print at /usr/local/share/lxr/source line 1030, <$git> line 270.

0269 “shared-memory” systems such as PVP systems, SMP systems, and on 0270 distributed shared memory systems (e.g., SGI Origin, SGI Altix, and some 0271 AMD Opteron systems). On such systems the WRAPPER will generally use 0272 simple read and write statements to access directly application data 0273 structures when communicating between CPUs. 0274 0275 In a system where assignments statements map directly to hardware instructions that 0276 transport data between CPU and memory banks, this can be a very 0277 efficient mechanism for communication. In such case multiple CPUs 0278 can communicate simply be reading and writing to agreed 0279 locations and following a few basic rules. The latency of this sort of 0280 communication is generally not that much higher than the hardware 0281 latency of other memory accesses on the system. The bandwidth available 0282 between CPUs communicating in this way can be close to the bandwidth of 0283 the systems main-memory interconnect. This can make this method of 0284 communication very efficient provided it is used appropriately. 0285 0286 Memory consistency 0287 ################## 0288 0289 When using shared memory communication between multiple processors, the 0290 WRAPPER level shields user applications from certain counter-intuitive 0291 system behaviors. In particular, one issue the WRAPPER layer must deal 0292 with is a systems memory model. In general the order of reads and writes 0293 expressed by the textual order of an application code may not be the 0294 ordering of instructions executed by the processor performing the 0295 application. The processor performing the application instructions will 0296 always operate so that, for the application instructions the processor 0297 is executing, any reordering is not apparent. However, 0298 machines are often designed so that reordering of instructions is not 0299 hidden from other second processors. This means that, in general, even 0300 on a shared memory system two processors can observe inconsistent memory 0301 values. 0302 0303 The issue of memory consistency between multiple processors is discussed 0304 at length in many computer science papers. From a practical point of 0305 view, in order to deal with this issue, shared memory machines all 0306 provide some mechanism to enforce memory consistency when it is needed. 0307 The exact mechanism employed will vary between systems. For 0308 communication using shared memory, the WRAPPER provides a place to 0309 invoke the appropriate mechanism to ensure memory consistency for a 0310 particular platform. 0311 0312 Cache effects and false sharing 0313 ############################### 0314 0315 Shared-memory machines often have local-to-processor memory caches which 0316 contain mirrored copies of main memory. Automatic cache-coherence 0317 protocols are used to maintain consistency between caches on different 0318 processors. These cache-coherence protocols typically enforce 0319 consistency between regions of memory with large granularity (typically 0320 128 or 256 byte chunks). The coherency protocols employed can be 0321 expensive relative to other memory accesses and so care is taken in the 0322 WRAPPER (by padding synchronization structures appropriately) to avoid 0323 unnecessary coherence traffic. 0324 0325 Operating system support for shared memory 0326 ########################################## 0327 0328 Applications running under multiple threads within a single process can 0329 use shared memory communication. In this case *all* the memory locations 0330 in an application are potentially visible to all the compute threads. 0331 Multiple threads operating within a single process is the standard 0332 mechanism for supporting shared memory that the WRAPPER utilizes. 0333 Configuring and launching code to run in multi-threaded mode on specific 0334 platforms is discussed in :numref:`multi-thread_exe`. 0335 However, on many systems, potentially very efficient mechanisms for 0336 using shared memory communication between multiple processes (in 0337 contrast to multiple threads within a single process) also exist. In 0338 most cases this works by making a limited region of memory shared 0339 between processes. The MMAP and IPC facilities in UNIX systems provide 0340 this capability as do vendor specific tools like LAPI and IMC. 0341 Extensions exist for the WRAPPER that allow these mechanisms to be used 0342 for shared memory communication. However, these mechanisms are not 0343 distributed with the default WRAPPER sources, because of their 0344 proprietary nature. 0345 0346 .. _distributed_mem_comm: 0347 0348 Distributed memory communication 0349 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 0350 0351 Under this mode of communication there is no mechanism, at the application code level, 0352 for directly addressing regions of memory owned and visible to 0353 another CPU. Instead a communication library must be used, as 0354 illustrated below. If one CPU (here, CPU1) writes the value 8 to element 3 of array a, 0355 then at least one of CPU1 and/or CPU2 will need to call a 0356 function in the API of the communication library to communicate data 0357 from a tile that it owns to a tile that another CPU owns. By default 0358 the WRAPPER binds to the MPI communication library 0359 for this style of communication (see https://computing.llnl.gov/tutorials/mpi/ 0360 for more information about the MPI Standard). 0361 0362 :: 0363 0364 0365 CPU1 | CPU2 0366 ==== | ==== 0367 | 0368 a(3) = 8 | WHILE ( a(3) .NE. 8 ) 0369 CALL SEND( CPU2,a(3) ) | CALL RECV( CPU1, a(3) ) 0370 | END WHILE 0371 | 0372 0373 0374 Many parallel systems are not constructed in a way where it is possible 0375 or practical for an application to use shared memory for communication. 0376 For cluster systems consisting of individual computers connected by 0377 a fast network, there is no notion of shared memory at 0378 the system level. For this sort of system the WRAPPER provides support 0379 for communication based on a bespoke communication library. 0380 The default communication library used is MPI. It is relatively 0381 straightforward to implement bindings to optimized platform specific 0382 communication libraries. For example the work described in 0383 Hoe et al. (1999) :cite:`hoe:99` substituted standard MPI communication 0384 for a highly optimized library. 0385 0386 .. _comm_primitives: 0387 0388 Communication primitives 0389 ------------------------ 0390 0391 Optimized communication support is assumed to be potentially available 0392 for a small number of communication operations. It is also assumed that 0393 communication performance optimizations can be achieved by optimizing a 0394 small number of communication primitives. Three optimizable primitives 0395 are provided by the WRAPPER. 0396 0397 0398 .. figure:: figs/comm-prim.png 0399 :width: 70% 0400 :align: center 0401 :alt: global sum and exchange comm primitives 0402 :name: comm-prim 0403 0404 Three performance critical parallel primitives are provided by the WRAPPER. These primitives are always used to communicate data between tiles. The figure shows four tiles. The curved arrows indicate exchange primitives which transfer data between the overlap regions at tile edges and interior regions for nearest-neighbor tiles. The straight arrows symbolize global sum operations which connect all tiles. The global sum operation provides both a key arithmetic primitive and can serve as a synchronization primitive. A third barrier primitive is also provided, which behaves much like the global sum primitive. 0405 0406 0407 - **EXCHANGE** This operation is used to transfer data between interior 0408 and overlap regions of neighboring tiles. A number of different forms 0409 of this operation are supported. These different forms handle: 0410 0411 - Data type differences. Sixty-four bit and thirty-two bit fields 0412 may be handled separately. 0413 0414 - Bindings to different communication methods. Exchange primitives 0415 select between using shared memory or distributed memory 0416 communication. 0417 0418 - Transformation operations required when transporting data between 0419 different grid regions. Transferring data between faces of a 0420 cube-sphere grid, for example, involves a rotation of vector 0421 components. 0422 0423 - Forward and reverse mode computations. Derivative calculations 0424 require tangent linear and adjoint forms of the exchange 0425 primitives. 0426 0427 - **GLOBAL SUM** The global sum operation is a central arithmetic 0428 operation for the pressure inversion phase of the MITgcm algorithm. 0429 For certain configurations, scaling can be highly sensitive to the 0430 performance of the global sum primitive. This operation is a 0431 collective operation involving all tiles of the simulated domain. 0432 Different forms of the global sum primitive exist for handling: 0433 0434 - Data type differences. Sixty-four bit and thirty-two bit fields 0435 may be handled separately. 0436 0437 - Bindings to different communication methods. Exchange primitives 0438 select between using shared memory or distributed memory 0439 communication. 0440 0441 - Forward and reverse mode computations. Derivative calculations 0442 require tangent linear and adjoint forms of the exchange 0443 primitives. 0444 0445 - **BARRIER** The WRAPPER provides a global synchronization function 0446 called barrier. This is used to synchronize computations over all 0447 tiles. The **BARRIER** and **GLOBAL SUM** primitives have much in 0448 common and in some cases use the same underlying code. 0449 0450 Memory architecture 0451 ------------------- 0452 0453 The WRAPPER machine model is aimed to target efficient systems with 0454 highly pipelined memory architectures and systems with deep memory 0455 hierarchies that favor memory reuse. This is achieved by supporting a 0456 flexible tiling strategy as shown in :numref:`tiling_detail`. 0457 Within a CPU, computations are carried out sequentially on each tile in 0458 turn. By reshaping tiles according to the target platform it is possible 0459 to automatically tune code to improve memory performance. On a vector 0460 machine a given domain might be subdivided into a few long, thin 0461 regions. On a commodity microprocessor based system, however, the same 0462 region could be simulated use many more smaller sub-domains. 0463 0464 .. figure:: figs/tiling_detail.png 0465 :width: 70% 0466 :align: center 0467 :alt: tiling strategy in WRAPPER 0468 :name: tiling_detail 0469 0470 The tiling strategy that the WRAPPER supports allows tiles to be shaped to suit the underlying system memory architecture. Compact tiles that lead to greater memory reuse can be used on cache based systems (upper half of figure) with deep memory hierarchies, whereas long tiles with large inner loops can be used to exploit vector systems having highly pipelined memory systems. 0471 0472 Summary 0473 ------- 0474 0475 Following the discussion above, the machine model that the WRAPPER 0476 presents to an application has the following characteristics: 0477 0478 - The machine consists of one or more logical processors. 0479 0480 - Each processor operates on tiles that it owns. 0481 0482 - A processor may own more than one tile. 0483 0484 - Processors may compute concurrently. 0485 0486 - Exchange of information between tiles is handled by the machine 0487 (WRAPPER) not by the application. 0488 0489 Behind the scenes this allows the WRAPPER to adapt the machine model 0490 functions to exploit hardware on which: 0491 0492 - Processors may be able to communicate very efficiently with each 0493 other using shared memory. 0494 0495 - An alternative communication mechanism based on a relatively simple 0496 interprocess communication API may be required. 0497 0498 - Shared memory may not necessarily obey sequential consistency, 0499 however some mechanism will exist for enforcing memory consistency. 0500 0501 - Memory consistency that is enforced at the hardware level may be 0502 expensive. Unnecessary triggering of consistency protocols should be 0503 avoided. 0504 0505 - Memory access patterns may need to be either repetitive or highly 0506 pipelined for optimum hardware performance. 0507 0508 This generic model, summarized in :numref:`tiles_and_wrapper`, captures the essential hardware ingredients of almost 0509 all successful scientific computer systems designed in the last 50 0510 years. 0511 0512 .. figure:: figs/tiles_and_wrapper.png 0513 :width: 85% 0514 :align: center 0515 :alt: summary figure tiles and wrapper 0516 :name: tiles_and_wrapper 0517 0518 Summary of the WRAPPER machine model. 0519 4bad209eba Jeff*0520 .. _using_wrapper: 0521 4c8caab8bd Jeff*0522 Using the WRAPPER 0523 ================= 0524 0525 In order to support maximum portability the WRAPPER is implemented 0526 primarily in sequential Fortran 77. At a practical level the key steps 0527 provided by the WRAPPER are: 0528 0529 #. specifying how a domain will be decomposed 0530 0531 #. starting a code in either sequential or parallel modes of operations 0532 0533 #. controlling communication between tiles and between concurrently 0534 computing CPUs. 0535 0536 This section describes the details of each of these operations. 0537 :numref:`specify_decomp` explains the way a 0538 domain is decomposed (or composed) is expressed. :numref:`starting_code` 0539 describes practical details of running codes 0540 in various different parallel modes on contemporary computer systems. 0541 :numref:`controlling_comm` explains the internal 0542 information that the WRAPPER uses to control how information is 0543 communicated between tiles. 0544 0545 .. _specify_decomp: 0546 0547 Specifying a domain decomposition 0548 --------------------------------- 0549 0550 At its heart, much of the WRAPPER works only in terms of a collection 0551 of tiles which are interconnected to each other. This is also true of 0552 application code operating within the WRAPPER. Application code is 0553 written as a series of compute operations, each of which operates on a 0554 single tile. If application code needs to perform operations involving 0555 data associated with another tile, it uses a WRAPPER function to 0556 obtain that data. The specification of how a global domain is 0557 constructed from tiles or alternatively how a global domain is 0558 decomposed into tiles is made in the file :filelink:`SIZE.h <model/inc/SIZE.h>`. This file defines 0559 the following parameters: 0560 0561 .. admonition:: File: :filelink:`model/inc/SIZE.h` 0562 :class: note 0563 0564 | Parameter: :varlink:`sNx`, :varlink:`sNx` 0565 | Parameter: :varlink:`OLx`, :varlink:`OLy` 0566 | Parameter: :varlink:`nSx`, :varlink:`nSy` 0567 | Parameter: :varlink:`nPx`, :varlink:`nPy` 0568 0569 0570 Together these parameters define a tiling decomposition of the style 0571 shown in :numref:`size_h`. The parameters ``sNx`` and ``sNx`` 0572 define the size of an individual tile. The parameters ``OLx`` and ``OLy`` 0573 define the maximum size of the overlap extent. This must be set to the 0574 maximum width of the computation stencil that the numerical code 0575 finite-difference operations require between overlap region updates. 0576 The maximum overlap required by any of the operations in the MITgcm 0577 code distributed at this time is four grid points (some of the higher-order advection schemes 0578 require a large overlap region). Code modifications and enhancements that involve adding wide 0579 finite-difference stencils may require increasing ``OLx`` and ``OLy``. 0580 Setting ``OLx`` and ``OLy`` to a too large value will decrease code 0581 performance (because redundant computations will be performed), 0582 however it will not cause any other problems. 0583 0584 .. figure:: figs/size_h.png 0585 :width: 80% 0586 :align: center 0587 :alt: explanation of SIZE.h domain decomposition 0588 :name: size_h 0589 0590 The three level domain decomposition hierarchy employed by the WRAPPER. A domain is composed of tiles. Multiple tiles can be allocated to a single process. Multiple processes can exist, each with multiple tiles. Tiles within a process can be spread over multiple compute threads. 0591 0592 The parameters ``nSx`` and ``nSy`` specify the number of tiles that will be 0593 created within a single process. Each of these tiles will have internal 0594 dimensions of ``sNx`` and ``sNy``. If, when the code is executed, these 0595 tiles are allocated to different threads of a process that are then 0596 bound to different physical processors (see the multi-threaded 0597 execution discussion in :numref:`starting_code`), then 0598 computation will be performed concurrently on each tile. However, it is 0599 also possible to run the same decomposition within a process running a 0600 single thread on a single processor. In this case the tiles will be 0601 computed over sequentially. If the decomposition is run in a single 0602 process running multiple threads but attached to a single physical 0603 processor, then, in general, the computation for different tiles will be 0604 interleaved by system level software. This too is a valid mode of 0605 operation. 0606 0607 The parameters ``sNx``, ``sNy``, ``OLx``, ``OLy``, 0608 ``nSx`` and ``nSy`` are used extensively 0609 by numerical code. The settings of ``sNx``, ``sNy``, ``OLx``, and ``OLy`` are used to 0610 form the loop ranges for many numerical calculations and to provide 0611 dimensions for arrays holding numerical state. The ``nSx`` and ``nSy`` are 0612 used in conjunction with the thread number parameter ``myThid``. Much of 0613 the numerical code operating within the WRAPPER takes the form: 0614 0615 .. code-block:: fortran 0616 0617 DO bj=myByLo(myThid),myByHi(myThid) 0618 DO bi=myBxLo(myThid),myBxHi(myThid) 0619 : 0620 a block of computations ranging 0621 over 1,sNx +/- OLx and 1,sNy +/- OLy grid points 0622 : 0623 ENDDO 0624 ENDDO 0625 0626 communication code to sum a number or maybe update 0627 tile overlap regions 0628 0629 DO bj=myByLo(myThid),myByHi(myThid) 0630 DO bi=myBxLo(myThid),myBxHi(myThid) 0631 : 0632 another block of computations ranging 0633 over 1,sNx +/- OLx and 1,sNy +/- OLy grid points 0634 : 0635 ENDDO 0636 ENDDO 0637 0638 The variables ``myBxLo(myThid)``, ``myBxHi(myThid)``, ``myByLo(myThid)`` and 0639 ``myByHi(myThid)`` set the bounds of the loops in ``bi`` and ``bj`` in this 0640 schematic. These variables specify the subset of the tiles in the range 0641 ``1, nSx`` and ``1, nSy1`` that the logical processor bound to thread 0642 number ``myThid`` owns. The thread number variable ``myThid`` ranges from 1 0643 to the total number of threads requested at execution time. For each 0644 value of ``myThid`` the loop scheme above will step sequentially through 0645 the tiles owned by that thread. However, different threads will have 0646 different ranges of tiles assigned to them, so that separate threads can 0647 compute iterations of the ``bi``, ``bj`` loop concurrently. Within a ``bi``, 0648 ``bj`` loop, computation is performed concurrently over as many processes 0649 and threads as there are physical processors available to compute. 0650 0651 An exception to the the use of ``bi`` and ``bj`` in loops arises in the 0652 exchange routines used when the :ref:`exch2 package <sub_phys_pkg_exch2>` is used with the cubed 0653 sphere. In this case ``bj`` is generally set to 1 and the loop runs from 0654 ``1, bi``. Within the loop ``bi`` is used to retrieve the tile number, 0655 which is then used to reference exchange parameters. 0656 0657 The amount of computation that can be embedded in a single loop over ``bi`` 0658 and ``bj`` varies for different parts of the MITgcm algorithm. af61fa61c7 Jeff*0659 Consider a code extract from the 2-D implicit elliptic solver: 4c8caab8bd Jeff*0660 a7275ae7c1 Oliv*0661 .. code-block:: fortran 4c8caab8bd Jeff*0662 0663 REAL*8 cg2d_r(1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy) 0664 REAL*8 err 0665 : 0666 : 0667 other computations 0668 : 0669 : 0670 err = 0. 0671 DO bj=myByLo(myThid),myByHi(myThid) 0672 DO bi=myBxLo(myThid),myBxHi(myThid) 0673 DO J=1,sNy 0674 DO I=1,sNx 0675 err = err + cg2d_r(I,J,bi,bj)*cg2d_r(I,J,bi,bj) 0676 ENDDO 0677 ENDDO 0678 ENDDO 0679 ENDDO 0680 0681 CALL GLOBAL_SUM_R8( err , myThid ) 0682 err = SQRT(err) 0683 0684 0685 0686 This portion of the code computes the :math:`L_2`\ -Norm 0687 of a vector whose elements are held in the array ``cg2d_r``, writing the 0688 final result to scalar variable ``err``. Notice that under the WRAPPER, 0689 arrays such as cg2d_r have two extra trailing dimensions. These right 0690 most indices are tile indexes. Different threads with a single process 0691 operate on different ranges of tile index, as controlled by the settings 0692 of ``myByLo(myThid)``, ``myByHi(myThid)``, ``myBxLo(myThid)`` and 0693 ``myBxHi(myThid)``. Because the :math:`L_2`\ -Norm 0694 requires a global reduction, the ``bi``, ``bj`` loop above only contains one 0695 statement. This computation phase is then followed by a communication 0696 phase in which all threads and processes must participate. However, in 0697 other areas of the MITgcm, code entries subsections of code are within a 0698 single ``bi``, ``bj`` loop. For example the evaluation of all the momentum 0699 equation prognostic terms (see :filelink:`dynamics.F <model/src/dynamics.F>`) is within a single 0700 ``bi``, ``bj`` loop. 0701 0702 The final decomposition parameters are ``nPx`` and ``nPy``. These parameters 0703 are used to indicate to the WRAPPER level how many processes (each with 0704 ``nSx``\ :math:`\times`\ ``nSy`` tiles) will be used for this simulation. 0705 This information is needed during initialization and during I/O phases. 0706 However, unlike the variables ``sNx``, ``sNy``, ``OLx``, ``OLy``, ``nSx`` and ``nSy`` the 0707 values of ``nPx`` and ``nPy`` are absent from the core numerical and support 0708 code. 0709 0710 Examples of :filelink:`SIZE.h <model/inc/SIZE.h>` specifications a7275ae7c1 Oliv*0711 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4c8caab8bd Jeff*0712 0713 The following different :filelink:`SIZE.h <model/inc/SIZE.h>` parameter setting illustrate how to 0714 interpret the values of ``sNx``, ``sNy``, ``OLx``, ``OLy``, ``nSx``, ``nSy``, ``nPx`` and ``nPy``. 0715 0716 #. :: 0717 0718 PARAMETER ( 0719 & sNx = 90, 0720 & sNy = 40, 0721 & OLx = 3, 0722 & OLy = 3, 0723 & nSx = 1, 0724 & nSy = 1, 0725 & nPx = 1, 0726 & nPy = 1) 0727 0728 This sets up a single tile with *x*-dimension of ninety grid points, 0729 *y*-dimension of forty grid points, and *x* and *y* overlaps of three grid 0730 points each. 0731 0732 #. :: 0733 0734 PARAMETER ( 0735 & sNx = 45, 0736 & sNy = 20, 0737 & OLx = 3, 0738 & OLy = 3, 0739 & nSx = 1, 0740 & nSy = 1, 0741 & nPx = 2, 0742 & nPy = 2) 0743 0744 This sets up tiles with *x*-dimension of forty-five grid points, 0745 *y*-dimension of twenty grid points, and *x* and *y* overlaps of three grid 0746 points each. There are four tiles allocated to four separate 0747 processes (``nPx=2, nPy=2``) and arranged so that the global domain size 0748 is again ninety grid points in *x* and forty grid points in *y*. In 0749 general the formula for global grid size (held in model variables 0750 ``Nx`` and ``Ny``) is 0751 0752 :: 0753 0754 Nx = sNx*nSx*nPx 0755 Ny = sNy*nSy*nPy 0756 0757 #. :: 0758 0759 PARAMETER ( 0760 & sNx = 90, 0761 & sNy = 10, 0762 & OLx = 3, 0763 & OLy = 3, 0764 & nSx = 1, 0765 & nSy = 2, 0766 & nPx = 1, 0767 & nPy = 2) 0768 0769 This sets up tiles with *x*-dimension of ninety grid points, 0770 *y*-dimension of ten grid points, and *x* and *y* overlaps of three grid 0771 points each. There are four tiles allocated to two separate processes 0772 (``nPy=2``) each of which has two separate sub-domains ``nSy=2``. The 0773 global domain size is again ninety grid points in *x* and forty grid 0774 points in *y*. The two sub-domains in each process will be computed 0775 sequentially if they are given to a single thread within a single 0776 process. Alternatively if the code is invoked with multiple threads 0777 per process the two domains in y may be computed concurrently. 0778 0779 #. :: 0780 0781 PARAMETER ( 0782 & sNx = 32, 0783 & sNy = 32, 0784 & OLx = 3, 0785 & OLy = 3, 0786 & nSx = 6, 0787 & nSy = 1, 0788 & nPx = 1, 0789 & nPy = 1) 0790 0791 This sets up tiles with *x*-dimension of thirty-two grid points, 0792 *y*-dimension of thirty-two grid points, and *x* and *y* overlaps of three 0793 grid points each. There are six tiles allocated to six separate 0794 logical processors (``nSx=6``). This set of values can be used for a 0795 cube sphere calculation. Each tile of size :math:`32 \times 32` 0796 represents a face of the cube. Initializing the tile connectivity 0797 correctly (see :numref:`cubed_sphere_comm`. allows the 0798 rotations associated with moving between the six cube faces to be 0799 embedded within the tile-tile communication code. 0800 0801 .. _starting_code: 0802 0803 Starting the code 0804 ----------------- 0805 0806 When code is started under the WRAPPER, execution begins in a main 0807 routine :filelink:`eesupp/src/main.F` that is owned by the WRAPPER. Control is 0808 transferred to the application through a routine called 0809 :filelink:`model/src/the_model_main.F` once the WRAPPER has initialized correctly and has 0810 created the necessary variables to support subsequent calls to 0811 communication routines by the application code. The main stages of the WRAPPER startup calling 0812 sequence are as follows: 0813 0814 :: 0815 0816 0817 MAIN 0818 | 0819 |--EEBOOT :: WRAPPER initialization 0820 | | 0821 | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to 0822 | | allow basic I/O. 0823 | |-- EEINTRO_MSG :: Write startup greeting. 0824 | | 0825 | |-- EESET_PARMS :: Set WRAPPER parameters 0826 | | 0827 | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings 0828 | | 0829 | |-- INI_PROCS :: Associate processes with grid regions. 0830 | | 0831 | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions. 0832 | | 0833 | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile 0834 | :: communication data structures 0835 | 0836 | 0837 |--CHECK_THREADS :: Validate multiple thread start up. 0838 | 0839 |--THE_MODEL_MAIN :: Numerical code top-level driver routine 0840 0841 The steps above preceeds transfer of control to application code, which occurs in the procedure :filelink:`the_main_model.F <model/src/the_model_main.F>` 0842 0843 .. _multi-thread_exe: 0844 0845 Multi-threaded execution 0846 ~~~~~~~~~~~~~~~~~~~~~~~~ 0847 0848 Prior to transferring control to the procedure :filelink:`the_main_model.F <model/src/the_model_main.F>` 0849 the WRAPPER may cause several coarse grain threads to be initialized. 0850 The routine :filelink:`the_main_model.F <model/src/the_model_main.F>` is called once for each thread and is 0851 passed a single stack argument which is the thread number, stored in 0852 the :varlink:`myThid`. In addition to specifying a decomposition with 0853 multiple tiles per process (see :numref:`specify_decomp`) configuring and starting a code to 0854 run using multiple threads requires the following steps. 0855 0856 Compilation 0857 ########### 0858 0859 First the code must be compiled with appropriate multi-threading 0860 directives active in the file :filelink:`eesupp/src/main.F` and with appropriate compiler 0861 flags to request multi-threading support. The header files 0862 :filelink:`eesupp/inc/MAIN_PDIRECTIVES1.h` and :filelink:`eesupp/inc/MAIN_PDIRECTIVES2.h` contain directives 0863 compatible with compilers for Sun, Compaq, SGI, Hewlett-Packard SMP 0864 systems and CRAY PVP systems. These directives can be activated by using 0865 compile time directives ``-DTARGET_SUN``, ``-DTARGET_DEC``, a7275ae7c1 Oliv*0866 ``-DTARGET_SGI``, ``-DTARGET_HP`` or ``-DTARGET_CRAY_VECTOR`` 4c8caab8bd Jeff*0867 respectively. Compiler options for invoking multi-threaded compilation 0868 vary from system to system and from compiler to compiler. The options 0869 will be described in the individual compiler documentation. For the 0870 Fortran compiler from Sun the following options are needed to correctly 0871 compile multi-threaded code 0872 0873 :: 0874 0875 -stackvar -explicitpar -vpara -noautopar 0876 0877 These options are specific to the Sun compiler. Other compilers will use 0878 different syntax that will be described in their documentation. The 0879 effect of these options is as follows: 0880 0881 #. **-stackvar** Causes all local variables to be allocated in stack 0882 storage. This is necessary for local variables to ensure that they 0883 are private to their thread. Note, when using this option it may be 0884 necessary to override the default limit on stack-size that the 0885 operating system assigns to a process. This can normally be done by

** Warning **

Wide character in print at /usr/local/share/lxr/source line 1030, <$git> line 887.

0886 changing the settings of the command shell’s ``stack-size``. 0887 However, on some systems changing this limit will require 0888 privileged administrator access to modify system parameters. 0889 0890 #. **-explicitpar** Requests that multiple threads be spawned in 0891 response to explicit directives in the application code. These 0892 directives are inserted with syntax appropriate to the particular 0893 target platform when, for example, the ``-DTARGET_SUN`` flag is 0894 selected. 0895 0896 #. **-vpara** This causes the compiler to describe the multi-threaded 0897 configuration it is creating. This is not required but it can be 0898 useful when troubleshooting. 0899 0900 #. **-noautopar** This inhibits any automatic multi-threaded 0901 parallelization the compiler may otherwise generate. 0902 0903 An example of valid settings for the ``eedata`` file for a domain with two 0904 subdomains in *y* and running with two threads is shown below 0905 0906 :: 0907 0908 nTx=1,nTy=2 0909 0910 This set of values will cause computations to stay within a single 0911 thread when moving across the ``nSx`` sub-domains. In the *y*-direction, 0912 however, sub-domains will be split equally between two threads. 0913 0914 Despite its appealing programming model, multi-threaded execution 0915 remains less common than multi-process execution (described in :numref:`multi-process_exe`). 0916 One major reason for

** Warning **

Wide character in print at /usr/local/share/lxr/source line 1030, <$git> line 918.

0917 this is that many system libraries are still not “thread-safe”. This 0918 means that, for example, on some systems it is not safe to call system 0919 routines to perform I/O when running in multi-threaded mode (except, 0920 perhaps, in a limited set of circumstances). Another reason is that 0921 support for multi-threaded programming models varies between systems. 0922 0923 .. _multi-process_exe: 0924 0925 Multi-process execution 0926 ~~~~~~~~~~~~~~~~~~~~~~~ 0927 0928 Multi-process execution is more ubiquitous than multi-threaded execution. 0929 In order to run code in a 0930 multi-process configuration, a decomposition specification (see 0931 :numref:`specify_decomp`) is given (in which at least one 0932 of the parameters ``nPx`` or ``nPy`` will be greater than one). Then, as 0933 for multi-threaded operation, appropriate compile time and run time 0934 steps must be taken. 0935 0936 Compilation 0937 ########### 0938 0939 Multi-process execution under the WRAPPER assumes that portable, 0940 MPI libraries are available for controlling the start-up of multiple 0941 processes. The MPI libraries are not required, although they are 0942 usually used, for performance critical communication. However, in 0943 order to simplify the task of controlling and coordinating the start 0944 up of a large number (hundreds and possibly even thousands) of copies 0945 of the same program, MPI is used. The calls to the MPI multi-process 0946 startup routines must be activated at compile time. Currently MPI 0947 libraries are invoked by specifying the appropriate options file with 0948 the ``-of`` flag when running the :filelink:`genmake2 <tools/genmake2>` 0949 script, which generates the 0950 Makefile for compiling and linking MITgcm. (Previously this was done 0951 by setting the ``ALLOW_USE_MPI`` and ``ALWAYS_USE_MPI`` flags in the 0952 :filelink:`CPP_EEOPTIONS.h </eesupp/inc/CPP_EEOPTIONS.h>` file.) More 0953 detailed information about the use of 0954 :filelink:`genmake2 <tools/genmake2>` for specifying local compiler 0955 flags is located in :numref:`genmake2_desc`. 0956 0957 Execution 0958 ######### 0959 0960 The mechanics of starting a program in multi-process mode under MPI is 0961 not standardized. Documentation associated with the distribution of MPI 0962 installed on a system will describe how to start a program using that 0963 distribution. For the open-source `MPICH <https://www.mpich.org/>`_ 0964 system, the MITgcm program can 0965 be started using a command such as 0966 0967 :: 0968 0969 mpirun -np 64 -machinefile mf ./mitgcmuv 0970 0971 In this example the text ``-np 64`` specifies the number of processes 0972 that will be created. The numeric value 64 must be equal to (or greater than) the 0973 product of the processor grid settings of ``nPx`` and ``nPy`` in the file 0974 :filelink:`SIZE.h <model/inc/SIZE.h>`. The option ``-machinefile mf`` 0975 specifies that a text file called ``mf`` 0976 will be read to get a list of processor names on which the sixty-four 0977 processes will execute. The syntax of this file is specified by the 0978 MPI distribution. 0979 0980 Environment variables 0981 ~~~~~~~~~~~~~~~~~~~~~ 0982 0983 On some systems multi-threaded execution also requires the setting of a 0984 special environment variable. On many machines this variable is called 0985 ``PARALLEL`` and its values should be set to the number of parallel threads 0986 required. Generally the help or manual pages associated with the 0987 multi-threaded compiler on a machine will explain how to set the 0988 required environment variables. 0989 0990 Runtime input parameters 0991 ~~~~~~~~~~~~~~~~~~~~~~~~ 0992 0993 Finally the file ``eedata`` 0994 needs to be configured to indicate the number 0995 of threads to be used in the *x* and *y* directions: 0996 0997 :: 0998 0999 # Example "eedata" file 1000 # Lines beginning "#" are comments 1001 # nTx - No. threads per process in X 1002 # nTy - No. threads per process in Y 1003 &EEPARMS 1004 nTx=1, 1005 nTy=1, 1006 & 1007 1008 1009 The product of ``nTx`` and ``nTy`` must be equal to the number of threads 1010 spawned, i.e., the setting of the environment variable ``PARALLEL``. The value 1011 of ``nTx`` must subdivide the number of sub-domains in *x* (``nSx``) exactly. 1012 The value of ``nTy`` must subdivide the number of sub-domains in *y* (``nSy``) 1013 exactly. The multi-process startup of the MITgcm executable ``mitgcmuv`` is 1014 controlled by the routines :filelink:`eeboot_minimal.F <eesupp/src/eeboot_minimal.F>` 1015 and :filelink:`ini_procs.F <eesupp/src/ini_procs.F>`. The 1016 first routine performs basic steps required to make sure each process is 1017 started and has a textual output stream associated with it. By default 1018 two output files are opened for each process with names ``STDOUT.NNNN`` 1019 and ``STDERR.NNNN``. The *NNNNN* part of the name is filled in with 1020 the process number so that process number 0 will create output files 1021 ``STDOUT.0000`` and ``STDERR.0000``, process number 1 will create output 1022 files ``STDOUT.0001`` and ``STDERR.0001``, etc. These files are used for 1023 reporting status and configuration information and for reporting error 1024 conditions on a process-by-process basis. The :filelink:`eeboot_minimal.F <eesupp/src/eeboot_minimal.F>` 1025 procedure also sets the variables :varlink:`myProcId` and :varlink:`MPI_COMM_MODEL`. 1026 These variables are related to processor identification and are used 1027 later in the routine :filelink:`ini_procs.F <eesupp/src/ini_procs.F>` to allocate tiles to processes. 1028 1029 Allocation of processes to tiles is controlled by the routine 1030 :filelink:`ini_procs.F <eesupp/src/ini_procs.F>`. For each process this routine sets the variables 1031 :varlink:`myXGlobalLo` and :varlink:`myYGlobalLo`. These variables specify, in index 1032 space, the coordinates of the southernmost and westernmost corner of 1033 the southernmost and westernmost tile owned by this process. The 1034 variables :varlink:`pidW`, :varlink:`pidE`, :varlink:`pidS` and :varlink:`pidN` are also set in this 1035 routine. These are used to identify processes holding tiles to the 1036 west, east, south and north of a given process. These values are 1037 stored in global storage in the header file :filelink:`EESUPPORT.h <eesupp/inc/EESUPPORT.h>` for use by 1038 communication routines. The above does not hold when the :ref:`exch2 package <sub_phys_pkg_exch2>` 1039 is used. The :ref:`exch2 package <sub_phys_pkg_exch2>` sets its own parameters to specify the global 1040 indices of tiles and their relationships to each other. See the 1041 documentation on the :ref:`exch2 package <sub_phys_pkg_exch2>` for details. 1042 1043 .. _controlling_comm: 1044 1045 Controlling communication 1046 ------------------------- 1047 1048 The WRAPPER maintains internal information that is used for 1049 communication operations and can be customized for different 1050 platforms. This section describes the information that is held and used. 1051 1052 1. **Tile-tile connectivity information** For each tile the WRAPPER sets 1053 a flag that sets the tile number to the north, south, east and west 1054 of that tile. This number is unique over all tiles in a 1055 configuration. Except when using the cubed sphere and 1056 the :ref:`exch2 package <sub_phys_pkg_exch2>`, 1057 the number is held in the variables :varlink:`tileNo` (this holds 1058 the tiles own number), :varlink:`tileNoN`, :varlink:`tileNoS`, 1059 :varlink:`tileNoE` and :varlink:`tileNoW`. 1060 A parameter is also stored with each tile that specifies the type of 1061 communication that is used between tiles. This information is held in 1062 the variables :varlink:`tileCommModeN`, :varlink:`tileCommModeS`, 1063 :varlink:`tileCommModeE` and 1064 :varlink:`tileCommModeW`. This latter set of variables can take one of the 1065 following values ``COMM_NONE``, ``COMM_MSG``, ``COMM_PUT`` and 1066 ``COMM_GET``. A value of ``COMM_NONE`` is used to indicate that a tile 1067 has no neighbor to communicate with on a particular face. A value of 1068 ``COMM_MSG`` is used to indicate that some form of distributed memory 1069 communication is required to communicate between these tile faces 1070 (see :numref:`distributed_mem_comm`). A value of 1071 ``COMM_PUT`` or ``COMM_GET`` is used to indicate forms of shared memory 1072 communication (see :numref:`shared_mem_comm`). The 1073 ``COMM_PUT`` value indicates that a CPU should communicate by writing 1074 to data structures owned by another CPU. A ``COMM_GET`` value 1075 indicates that a CPU should communicate by reading from data 1076 structures owned by another CPU. These flags affect the behavior of 1077 the WRAPPER exchange primitive (see :numref:`comm-prim`). The routine 1078 :filelink:`ini_communication_patterns.F <eesupp/src/ini_communication_patterns.F>` 1079 is responsible for setting the 1080 communication mode values for each tile. 1081 1082 When using the cubed sphere configuration with the :ref:`exch2 package <sub_phys_pkg_exch2>`, the 1083 relationships between tiles and their communication methods are set 1084 by the :ref:`exch2 package <sub_phys_pkg_exch2>` and stored in different variables. 1085 See the :ref:`exch2 package <sub_phys_pkg_exch2>` 1086 documentation for details. 1087 1088 | 1089 1090 2. **MP directives** The WRAPPER transfers control to numerical 1091 application code through the routine 1092 :filelink:`the_model_main.F <model/src/the_model_main.F>`. This routine 1093 is called in a way that allows for it to be invoked by several 1094 threads. Support for this is based on either multi-processing (MP) 1095 compiler directives or specific calls to multi-threading libraries 1096 (e.g., POSIX threads). Most commercially available Fortran compilers 1097 support the generation of code to spawn multiple threads through some 1098 form of compiler directives. Compiler directives are generally more 1099 convenient than writing code to explicitly spawn threads. On 1100 some systems, compiler directives may be the only method available. 1101 The WRAPPER is distributed with template MP directives for a number 1102 of systems. 1103 1104 These directives are inserted into the code just before and after the 1105 transfer of control to numerical algorithm code through the routine 1106 :filelink:`the_model_main.F <model/src/the_model_main.F>`. An example of 1107 the code that performs this process for a Silicon Graphics system is as follows: 1108 1109 :: 1110 1111 C-- 1112 C-- Parallel directives for MIPS Pro Fortran compiler 1113 C-- 1114 C Parallel compiler directives for SGI with IRIX 1115 C$PAR PARALLEL DO 1116 C$PAR& CHUNK=1,MP_SCHEDTYPE=INTERLEAVE, 1117 C$PAR& SHARE(nThreads),LOCAL(myThid,I) 1118 C 1119 DO I=1,nThreads 1120 myThid = I 1121 1122 C-- Invoke nThreads instances of the numerical model 1123 CALL THE_MODEL_MAIN(myThid) 1124 1125 ENDDO 1126 1127 Prior to transferring control to the procedure 1128 :filelink:`the_model_main.F <model/src/the_model_main.F>` the 1129 WRAPPER may use MP directives to spawn multiple threads. This code 1130 is extracted from the files :filelink:`main.F <eesupp/src/main.F>` and 1131 :filelink:`eesupp/inc/MAIN_PDIRECTIVES1.h`. The variable 1132 :varlink:`nThreads` specifies how many instances of the routine 1133 :filelink:`the_model_main.F <model/src/the_model_main.F>` will be created. 1134 The value of :varlink:`nThreads` is set in the routine 1135 :filelink:`ini_threading_environment.F <eesupp/src/ini_threading_environment.F>`. 1136 The value is set equal to the the product of the parameters 1137 :varlink:`nTx` and :varlink:`nTy` that are read from the file 1138 ``eedata``. If the value of :varlink:`nThreads` is inconsistent with the number 1139 of threads requested from the operating system (for example by using 1140 an environment variable as described in :numref:`multi-thread_exe`) 1141 then usually an error will be 1142 reported by the routine :filelink:`check_threads.F <eesupp/src/check_threads.F>`. 1143 1144 | 1145 1146 3. **memsync flags** As discussed in :numref:`shared_mem_comm`, 1147 a low-level system function may be need to force memory consistency 1148 on some shared memory systems. The routine 1149 :filelink:`memsync.F <eesupp/src/memsync.F>` is used for 1150 this purpose. This routine should not need modifying and the 1151 information below is only provided for completeness. A logical 1152 parameter :varlink:`exchNeedsMemSync` set in the routine 1153 :filelink:`ini_communication_patterns.F <eesupp/src/ini_communication_patterns.F>` 1154 controls whether the :filelink:`memsync.F <eesupp/src/memsync.F>` 1155 primitive is called. In general this routine is only used for 1156 multi-threaded execution. The code that goes into the :filelink:`memsync.F <eesupp/src/memsync.F>` 1157 routine is specific to the compiler and processor used. In some 1158 cases, it must be written using a short code snippet of assembly 1159 language. For an Ultra Sparc system the following code snippet is 1160 used 1161 1162 :: 1163 1164 asm("membar #LoadStore|#StoreStore"); 1165 1166 For an Alpha based system the equivalent code reads 1167 1168 :: 1169 1170 asm("mb"); 1171 1172 while on an x86 system the following code is required 1173 1174 :: 1175 1176 asm("lock; addl $0,0(%%esp)": : :"memory") 1177 1178 #. **Cache line size** As discussed in :numref:`shared_mem_comm`, 1179 multi-threaded codes 1180 explicitly avoid penalties associated with excessive coherence 1181 traffic on an SMP system. To do this the shared memory data 1182 structures used by the :filelink:`global_sum.F <eesupp/src/global_sum.F>`, 1183 :filelink:`global_max.F <eesupp/src/global_max.F>` and 1184 :filelink:`barrier.F <eesupp/src/barrier.F>` 1185 routines are padded. The variables that control the padding are set 1186 in the header file :filelink:`EEPARAMS.h <eesupp/inc/EEPARAMS.h>`. 1187 These variables are called :varlink:`cacheLineSize`, :varlink:`lShare1`, 1188 :varlink:`lShare4` and :varlink:`lShare8`. The default 1189 values should not normally need changing. 1190 1191 | 1192 1193 #. **\_BARRIER** This is a CPP macro that is expanded to a call to a 1194 routine which synchronizes all the logical processors running under 1195 the WRAPPER. Using a macro here preserves flexibility to insert a 1196 specialized call in-line into application code. By default this 1197 resolves to calling the procedure :filelink:`barrier.F <eesupp/src/barrier.F>`. 1198 The default setting for the ``_BARRIER`` macro is given in the 1199 file :filelink:`CPP_EEMACROS.h <eesupp/inc/CPP_EEMACROS.h>`. 1200 1201 | 1202 1203 #. **\_GSUM** This is a CPP macro that is expanded to a call to a 1204 routine which sums up a floating point number over all the logical 1205 processors running under the WRAPPER. Using a macro here provides 1206 extra flexibility to insert a specialized call in-line into 1207 application code. By default this resolves to calling the procedure 1208 ``GLOBAL_SUM_R8()`` for 64-bit floating point operands or 1209 ``GLOBAL_SUM_R4()`` for 32-bit floating point operand 1210 (located in file :filelink:`global_sum.F <eesupp/src/global_sum.F>`). The default 1211 setting for the ``_GSUM`` macro is given in the file 1212 :filelink:`CPP_EEMACROS.h <eesupp/inc/CPP_EEMACROS.h>`. 1213 The ``_GSUM`` macro is a performance critical operation, especially for 1214 large processor count, small tile size configurations. The custom 1215 communication example discussed in :numref:`jam_example` shows 1216 how the macro is used to invoke a custom global sum routine for a 1217 specific set of hardware. 1218 1219 | 1220 1221 #. **\_EXCH** The ``_EXCH`` CPP macro is used to update tile overlap 1222 regions. It is qualified by a suffix indicating whether overlap 1223 updates are for two-dimensional (``_EXCH_XY``) or three dimensional 1224 (``_EXCH_XYZ``) physical fields and whether fields are 32-bit floating 1225 point (``_EXCH_XY_R4``, ``_EXCH_XYZ_R4``) or 64-bit floating point 1226 (``_EXCH_XY_R8``, ``_EXCH_XYZ_R8``). The macro mappings are defined in 1227 the header file :filelink:`CPP_EEMACROS.h <eesupp/inc/CPP_EEMACROS.h>`. 1228 As with ``_GSUM``, the ``_EXCH`` 1229 operation plays a crucial role in scaling to small tile, large 1230 logical and physical processor count configurations. The example in 1231 :numref:`jam_example` discusses defining an optimized and 1232 specialized form on the ``_EXCH`` operation. 1233 1234 The ``_EXCH`` operation is also central to supporting grids such as the 1235 cube-sphere grid. In this class of grid a rotation may be required 1236 between tiles. Aligning the coordinate requiring rotation with the 1237 tile decomposition allows the coordinate transformation to be 1238 embedded within a custom form of the ``_EXCH`` primitive. In these cases 1239 ``_EXCH`` is mapped to exch2 routines, as detailed in the :ref:`exch2 package <sub_phys_pkg_exch2>` 1240 documentation. 1241 1242 | 1243 1244 #. **Reverse Mode** The communication primitives ``_EXCH`` and ``_GSUM`` both 1245 employ hand-written adjoint forms (or reverse mode) forms. These 1246 reverse mode forms can be found in the source code directory 1247 :filelink:`pkg/autodiff`. For the global sum primitive the reverse mode form 1248 calls are to ``GLOBAL_ADSUM_R4()`` and ``GLOBAL_ADSUM_R8()`` (located in 1249 file :filelink:`global_sum_ad.F <pkg/autodiff/global_sum_ad.F>`). The reverse 1250 mode form of the exchange primitives are found in routines prefixed 1251 ``ADEXCH``. The exchange routines make calls to the same low-level 1252 communication primitives as the forward mode operations. However, the 1253 routine argument :varlink:`theSimulationMode` is set to the value 1254 ``REVERSE_SIMULATION``. This signifies to the low-level routines that 1255 the adjoint forms of the appropriate communication operation should 1256 be performed. 1257 1258 | 1259 1260 #. **MAX_NO_THREADS** The variable :varlink:`MAX_NO_THREADS` is used to 1261 indicate the maximum number of OS threads that a code will use. This 1262 value defaults to thirty-two and is set in the file 1263 :filelink:`EEPARAMS.h <eesupp/inc/EEPARAMS.h>`. For 1264 single threaded execution it can be reduced to one if required. The 1265 value is largely private to the WRAPPER and application code will not 1266 normally reference the value, except in the following scenario. 1267 1268 For certain physical parametrization schemes it is necessary to have 1269 a substantial number of work arrays. Where these arrays are allocated 1270 in heap storage (for example COMMON blocks) multi-threaded execution 1271 will require multiple instances of the COMMON block data. This can be 1272 achieved using a Fortran 90 module construct. However, if this 1273 mechanism is unavailable then the work arrays can be extended with 1274 dimensions using the tile dimensioning scheme of :varlink:`nSx` and :varlink:`nSy` (as 1275 described in :numref:`specify_decomp`). However, if 1276 the configuration being specified involves many more tiles than OS 1277 threads then it can save memory resources to reduce the variable 1278 :varlink:`MAX_NO_THREADS` to be equal to the actual number of threads that 1279 will be used and to declare the physical parameterization work arrays a7275ae7c1 Oliv*1280 with a single :varlink:`MAX_NO_THREADS` extra dimension. An example of this 4c8caab8bd Jeff*1281 is given in the verification experiment :filelink:`verification/aim.5l_cs`. Here the a7275ae7c1 Oliv*1282 default setting of :varlink:`MAX_NO_THREADS` is altered to 4c8caab8bd Jeff*1283 1284 :: 1285 1286 INTEGER MAX_NO_THREADS 1287 PARAMETER ( MAX_NO_THREADS = 6 ) 1288 1289 and several work arrays for storing intermediate calculations are 1290 created with declarations of the form. 1291 1292 :: 1293 1294 common /FORCIN/ sst1(ngp,MAX_NO_THREADS) 1295 1296 This declaration scheme is not used widely, because most global data 1297 is used for permanent, not temporary, storage of state information. In 1298 the case of permanent state information this approach cannot be used 1299 because there has to be enough storage allocated for all tiles. 1300 However, the technique can sometimes be a useful scheme for reducing 1301 memory requirements in complex physical parameterizations. 1302 1303 Specializing the Communication Code 1304 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1305 1306 The isolation of performance critical communication primitives and the 1307 subdivision of the simulation domain into tiles is a powerful tool. 1308 Here we show how it can be used to improve application performance and 1309 how it can be used to adapt to new gridding approaches. 1310 1311 .. _jam_example: 1312 1313 JAM example 1314 ~~~~~~~~~~~ 1315 1316 On some platforms a big performance boost can be obtained by binding the 1317 communication routines ``_EXCH`` and ``_GSUM`` to specialized native 1318 libraries (for example, the shmem library on CRAY T3E systems). The 1319 ``LETS_MAKE_JAM`` CPP flag is used as an illustration of a specialized 1320 communication configuration that substitutes for standard, portable 1321 forms of ``_EXCH`` and ``_GSUM``. It affects three source files 1322 :filelink:`eeboot.F <eesupp/src/eeboot.F>`, :filelink:`CPP_EEMACROS.h <eesupp/inc/CPP_EEMACROS.h>` 1323 and :filelink:`cg2d.F </model/src/cg2d.F>`. When the flag is defined is 1324 has the following effects. 1325 1326 - An extra phase is included at boot time to initialize the custom 1327 communications library (see ini_jam.F). 1328 1329 - The ``_GSUM`` and ``_EXCH`` macro definitions are replaced with calls 1330 to custom routines (see gsum_jam.F and exch_jam.F) 1331 1332 - a highly specialized form of the exchange operator (optimized for 1333 overlap regions of width one) is substituted into the elliptic solver 1334 routine :filelink:`cg2d.F </model/src/cg2d.F>`. 1335 1336 Developing specialized code for other libraries follows a similar 1337 pattern. 1338 1339 .. _cubed_sphere_comm: 1340 1341 Cube sphere communication 1342 ~~~~~~~~~~~~~~~~~~~~~~~~~ 1343 1344 Actual ``_EXCH`` routine code is generated automatically from a series of 1345 template files, for example 1346 :filelink:`exch2_rx1_cube.template </pkg/exch2/exch2_rx1_cube.template>`. 1347 This is done to allow a 1348 large number of variations of the exchange process to be maintained. One 1349 set of variations supports the cube sphere grid. Support for a cube 1350 sphere grid in MITgcm is based on having each face of the cube as a 1351 separate tile or tiles. The exchange routines are then able to absorb 1352 much of the detailed rotation and reorientation required when moving 1353 around the cube grid. The set of ``_EXCH`` routines that contain the word 1354 cube in their name perform these transformations. They are invoked when 1355 the run-time logical parameter :varlink:`useCubedSphereExchange` is 1356 set ``.TRUE.``. To 1357 facilitate the transformations on a staggered C-grid, exchange 1358 operations are defined separately for both vector and scalar quantities 1359 and for grid-centered and for grid-face and grid-corner quantities. 1360 Three sets of exchange routines are defined. Routines with names of the 1361 form ``exch2_rx`` are used to exchange cell centered scalar quantities. 1362 Routines with names of the form ``exch2_uv_rx`` are used to exchange 1363 vector quantities located at the C-grid velocity points. The vector 1364 quantities exchanged by the ``exch_uv_rx`` routines can either be signed 1365 (for example velocity components) or un-signed (for example grid-cell 1366 separations). Routines with names of the form ``exch_z_rx`` are used to 1367 exchange quantities at the C-grid vorticity point locations. 1368 1369 MITgcm execution under WRAPPER 1370 ============================== 1371 1372 Fitting together the WRAPPER elements, package elements and MITgcm core 1373 equation elements of the source code produces the calling sequence shown below. 1374 1375 Annotated call tree for MITgcm and WRAPPER 1376 ------------------------------------------ 1377 1378 WRAPPER layer. 1379 1380 :: 1381 1382 1383 MAIN 1384 | 1385 |--EEBOOT :: WRAPPER initialization 1386 | | 1387 | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to 1388 | | allow basic I/O. 1389 | |-- EEINTRO_MSG :: Write startup greeting. 1390 | | 1391 | |-- EESET_PARMS :: Set WRAPPER parameters 1392 | | 1393 | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings 1394 | | 1395 | |-- INI_PROCS :: Associate processes with grid regions. 1396 | | 1397 | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions. 1398 | | 1399 | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile 1400 | :: communication data structures 1401 | 1402 | 1403 |--CHECK_THREADS :: Validate multiple thread start up. 1404 | 1405 |--THE_MODEL_MAIN :: Numerical code top-level driver routine 1406 1407 Core equations plus packages. 1408 1a0ceabba8 Timo*1409 .. _model_main_call_tree: 1410 232737bee5 timo*1411 .. literalinclude:: ../../model/src/the_model_main.F 2a7411a401 timo*1412 :start-at: C Invocation from WRAPPER level... 1413 :end-at: C | :: events. 4c8caab8bd Jeff*1414 1415 1416 Measuring and Characterizing Performance 1417 ---------------------------------------- 1418 1419 TO BE DONE (CNH) 1420 1421 Estimating Resource Requirements 1422 -------------------------------- 1423 1424 TO BE DONE (CNH) 1425 1426 Atlantic 1/6 degree example 1427 ~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1428 1429 Dry Run testing 1430 ~~~~~~~~~~~~~~~ 1431 1432 Adjoint Resource Requirements 1433 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1434 1435 State Estimation Environment Resources 1436 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~