Parallel and Distributed Simulaion

Simulus supports parallel and distributed simulation via concurrent execution of multiple simulators. These simulators can be created and run simultaneously on different processors or cores on the same or different machines in a cluster. To distinguish the two cases, we use the term “parallel simulation” to refer to running simulation on shared-memory multiprocessors, and use the term “distributed simulation” to refer to running simulation on distributed-memory machines in a cluster. The use of these terms have not been standardized in the community. For example, people have been using the term “parallel simulation” to refer to both cases.

To understand how simulus handles parallel and distributed simulation, we first introduce the concept of a synchronized group of simulators, where time advances synchronously among the simulators in the group. In addition, when a simulator belongs to a synchronized group, it can also communicate with the other simulators in the group. A synchronized group may consist of simulators running on multiprocessors or on parallel machines in a cluster.

Synchronized Group

A synchronized group is a group of simulators whose simulation clocks will advance synchronously. That is, although each simulator still processes events on its own event list according to the timestamp ordering, the simulators in a synchronized group will advance their simulation clock in a coordinated fashion such that no one simulator will get too far ahead in simulation time from the rest of the simulators in the group.

Here, synchronization or coordination must take place so as to guarantee that an event generated from one simulator destined for another simulator in the synchronized group would not arrive in the simulated past. This would happen, however, if the other simulator’s clock gets too far ahead into the simulated future, so when other simulators generate messages destined for this simulator, these messages may arrive at the simulator’s past. In parallel discrete-event simulation terminology, this is called a causality error.

Causality error should never happen in simulation as long as we are modeling a world without time traveling. In parallel simulation, one has to contemplate the possibility of causality errors as simulators may executed on separate processors or cores, or even on separate machines. To solve the problem, one should either constrain the time advancement of the simulators to prevent causality errors from ever happening, or somehow rollback the simulators to an earlier state to correct the causality errors.

The former is called “conservative simulation”. The latter is called “optimistic simulation”. Simulus implements the synchronized group using a conservative simulation approach. In this case, a synchronized group is created with a “lookahead”, which dictates how far a simulator can advance its simulation clock ahead of the rest of the simulators in the group.

Let’s look at an example. First, let’s create a couple of simulators:

[1]:
import random, simulus
random.seed(13579)

def p(sim, mean_intv):
    while True:
        sim.sleep(random.expovariate(1/mean_intv))
        print("'%s' gets to %g" % (sim.name, sim.now))

sim1 = simulus.simulator('sim1')
sim1.process(p, sim1, 1)

sim2 = simulus.simulator('sim2')
sim2.process(p, sim2, 0.5)

sim1.run(5)
sim2.run(2)
'sim1' gets to 0.315117
'sim1' gets to 0.498901
'sim1' gets to 1.78212
'sim1' gets to 2.46515
'sim1' gets to 4.82285
'sim2' gets to 0.414588
'sim2' gets to 0.69689
'sim2' gets to 1.67856
'sim2' gets to 1.72484

The two simulators, sim1 and sim2, each runs one process starting from the same function p(), which takes two arguments: the simulator instance and the mean sleep interval. The process simply sleeps for some random time, which is exponentially distributed with the given mean, prints out a message, and then repeats.

We run both simulators separately, one to time 5 and the other to time 2. As expected, both simulators advance their simulation time separately. Their simulation time is not synchronized or coordinated.

[2]:
sim1.now, sim2.now
[2]:
(5, 2)

We now create a synchronized group to include both simulators above. This can be done by calling sync() and pass as the parameter either a list of names of the simulators or a list of simulator instances. sync() will first bring all simulators in the group to synchrony by advancing the simulation time of the simulators in the group independently to the maximum simulation time among all simulators (that’s 5, in our example).

[3]:
# this cell can be run only once, because each simulator
# can belong to only one synchronized group and the
# membership is immutable; one has to restart the
# notebook's kernel in order to to run this cell again
g = simulus.sync([sim1, 'sim2'], lookahead=2)
'sim2' gets to 3.4852
'sim2' gets to 3.672
'sim2' gets to 4.11777

We see that sim2 gets to advance its simulation from time 2 to 5. Both simulators’ current simulation time should be 5 by now. To find out about the current simulation time, we can either inspect each simulator’s now variable, or using the now variable provided by the synchronized group.

[4]:
sim1.now, sim2.now, g.now
[4]:
(5, 5, 5)

From now on, both simulators are bound to the synchronized group. That is, their simulation time will be advanced synchronously. In the above example, when we call sync(), we have also specified the lookahead to be 2. That is, the simulation clock of one simulator will never gets ahead of the other by more than 2. A more technical definition is that the lookahead is the maximum difference in simulation time of the next events of the simulators. (Later we will see that we don’t need to specify the lookahead at all when creating the synchronized group; simulus calculates the lookahead automatically).

We can now run all simulators together using the run() method of the synchronized group. The method is similar to the simulator’s run() method. The user can specify either ‘offset’ or ‘until’ (but not both). Each simulator will process their events in timestamp order up to the given time, and yet the simulators are guaranteed not to get too far ahead of the other simulators in the group as dictated by the lookahead. Simulus handles the synchronization, so that when messages are sent between the simulators (described in the next section), the messages will not produce causality errors.

[5]:
# 5 is the offset (as a positional argument)
g.run(5)
'sim1' gets to 5.20161
'sim1' gets to 5.2917
'sim1' gets to 5.77787
'sim1' gets to 5.79523
'sim2' gets to 5.84559
'sim2' gets to 5.90297
'sim2' gets to 6.00801
'sim2' gets to 6.09771
'sim2' gets to 6.7076
'sim2' gets to 6.7555
'sim1' gets to 7.84578
'sim1' gets to 8.35717
'sim1' gets to 8.55197
'sim1' gets to 8.76079
'sim2' gets to 7.53865
'sim2' gets to 8.68613
'sim2' gets to 9.35494

We see that both simulators gets to run from time 5 to time 10. But since the lookahead is 2, the simulators don’t gets too far ahead of the others. And eventually both simulators reach time 10.

[6]:
sim1.now, sim2.now, g.now
[6]:
(10, 10, 10)

Communication among Simulators

There’s no obviously advantage for having multiple simulators in a synchronized group unless we want to have them participate in a larger model with each simulator modeling a component of the system. For example, when simulating a computer system, we could have one or more simulators to model the processor cores, one to model the memory, one for each I/O devices, and so on. To be part of a large model, the simulators need to communicate by sending timestamped messages to each other.

Simulus facilitates communication between simulators through mailboxes. Recall a mailbox in simulus is a facility for message passing between processes or functions. A mailbox consists of one or more compartments or partitions. A sender can send a message to one of the partitions of a mailbox with a specified delay. The messages are delivered to the designated mailboxes at the expected time. The messages are stored in the designated partitions until a receiver retrieves them and removes them from the mailbox.

Simulators in a synchronized group can send messages to named mailboxes that belong to other simulators. A message in simulus takes a broader meaning: a message could be any Python object as long as it’s pickle-able. This is because, as we will see, the simulators can potentially run in different processes (with different Python interpreters) and also on separate machines. Simulus depends on the ‘pickle’ module to serialize and deserialize the Python objects.

In the following example, we create two simulators each having a mailbox. A ping-pong message will be sent back and forth between the two simulators.

[7]:
# %load "../examples/advanced/pingpong.py"
import simulus

def p(sim, mbox, mbname):
    while True:
        msg = mbox.recv(isall=False)
        print("%g: '%s' rcvd msg '%s'" % (sim.now, sim.name, msg))
        sim.sync().send(sim, mbname, 'pong' if msg=='ping' else 'ping')

sim1 = simulus.simulator('sim1')
mb1 = sim1.mailbox('mb1', 1)
sim1.process(p, sim1, mb1, 'mb2')

sim2 = simulus.simulator('sim2')
mb2 = sim2.mailbox('mb2', 1)
sim2.process(p, sim2, mb2, 'mb1')

mb1.send('ping') # send initial message to start ping-ponging

g = simulus.sync([sim1, sim2])
g.run(10)

1: 'sim1' rcvd msg 'ping'
2: 'sim2' rcvd msg 'pong'
3: 'sim1' rcvd msg 'ping'
4: 'sim2' rcvd msg 'pong'
5: 'sim1' rcvd msg 'ping'
6: 'sim2' rcvd msg 'pong'
7: 'sim1' rcvd msg 'ping'
8: 'sim2' rcvd msg 'pong'
9: 'sim1' rcvd msg 'ping'

In the above example, each simulator creates a mailbox with a distinct name. Each mailbox also has a ‘min_delay’, which, as the name suggests, is the minimum delay for which messages are expected to be delivered to the mailbox. Simulus uses the min_delay of all the named mailboxes in the synchronized group to calculate the lookahead.

It is required that the min_delay of all named mailboxes for the simulators in the synchronized group be strictly positive. The overhead of parallel simulation is directly related to the size of the lookahead. In general, a larger min_delay is always preferable. A larger min_delay for mailboxes would mean a larger lookahead; and a larger lookahead would entail less synchronization overhead and therefore better parallel performance.

The simulators communicate by sending messages using the send() method of the synchronized group to which they belong. (A simulator can find its synchronized group using the sync() method.) The send() method of the synchronized group takes at least three arguments: ‘sim’ is the simulator from which the message will be sent; ‘name’ is the name of the mailbox to which the message is expected to be delivered (the mailbox has to belong to one of the simulators in the group), ‘msg’ is the message itself, which can be any Python object as long as it’s pickle-able (a message cannot be None). Optionally, one can specify the ‘delay’ of the message. If it is ignored, the delay will be set to be the min_delay of the target mailbox; if it is set, the delay value must not be smaller than the min_delay of the mailbox. One can also specify the parameter ‘part’, which is the partition number of the mailbox to which the message will be delivered; the default is zero.

In the example, the process at each simulator directly uses the recv() method of the mailbox to receive the messages. This method must be called within a process context (in a starting function of a process or at a function called directly or indirectly from the starting function). If the mailbox partition is empty when the call is made, the process will be put on hold until a message arrives. When this method returns, at least one message will be retrieved from the mailbox. It will return a list containing all the messages currently stored at the mailbox partition, if ‘isall’ is True (by default). If ‘isall’ is False, this method returns only the first arrived message (not wrapped in a list).

Parallel Simulation on Shared-Memory Multiprocessors

A synchronized group of simulators can run sequentially (as shown in the previous section) or in parallel. In the latter case, they can run on shared-memory multiprocessors, or on distributed-memory machines in a cluster, or a combination of both.

Most computers today are shared-memory multiprocessors. A computer with multiple CPUs or cores can support parallel execution of multiple processes. For example, my laptop has one CPU with two cores. With hyper-threading enabled, the machine in theory has four “processors” for parallel execution.

To take advantage of parallel simulation, we need to set enable_smp to be True when we create the synchronized group. In that case, simulus will automatically fork separate processes to run the simulators in the group. By default, simulus will use as many processes as the number of “processors” (four on my laptop) to run the simulators. If there are more simulators in the synchronized group than the processors, simulus tries to evenly divide the simulators among the processors, in which case each processor may run multiple simulators. If the number of simulators is less than the number of processors, each simulator will run as a separate process and each process will run on a different processor.

In the following example, we run the ping-pong example on shared-memory multiprocessors. The main difference here from the previous example is that we set enable_smp to be True when calling sync(). We also set show_runtime_report to be True so that we view the performance report of the simulation run.

[8]:
# %load "../examples/advanced/pingpong-smp.py"
import simulus

def p(sim, mbox, mbname):
    while True:
        msg = mbox.recv(isall=False)
        print("%g: '%s' rcvd msg '%s'" % (sim.now, sim.name, msg))
        sim.sync().send(sim, mbname, 'pong' if msg=='ping' else 'ping')

sim1 = simulus.simulator('sim1')
mb1 = sim1.mailbox('mb1', 1)
sim1.process(p, sim1, mb1, 'mb2')

sim2 = simulus.simulator('sim2')
mb2 = sim2.mailbox('mb2', 1)
sim2.process(p, sim2, mb2, 'mb1')

mb1.send('ping') # send initial message to start ping-ponging

g = simulus.sync([sim1, sim2], enable_smp=True)
g.run(10, show_runtime_report=True)

2: 'sim2' rcvd msg 'pong'
4: 'sim2' rcvd msg 'pong'
6: 'sim2' rcvd msg 'pong'
8: 'sim2' rcvd msg 'pong'
1: 'sim1' rcvd msg 'ping'
3: 'sim1' rcvd msg 'ping'
5: 'sim1' rcvd msg 'ping'
7: 'sim1' rcvd msg 'ping'
9: 'sim1' rcvd msg 'ping'
*********** sync group performance metrics ***********
partitioning information (simulator assignment):
  'sim1' on rank 0 proc 0
  'sim2' on rank 0 proc 1
execution time: 0.0464962
scheduled events: 12 (rate=258.086)
executed events: 11 (rate=236.579)
cancelled events: 0
created processes: 2
finished processes: 0
cancelled processes: 0
process context switches: 11

We note that the output is no longer in strict ordering according to what is happening in time. We have two simulators running on separate processes. The print-out from the two processes are buffered separately by the Python system and mixed together in the standard output. We see, however, the output from each process is still in the proper order.

The performance report shows that the two simulators are assigned to separate processes (‘proc 0’ and ‘proc 1’). They are both running on rank 0 (this machine). In the next section, we discuss how to run simulus on parallel machines in a cluster using MPI.

In the following example, we show a situation where there are more than two simulators. More specifically, we create 15 simulators and organize them in a ring. A simulator receives a message (‘hello’) from the previous simulator in the ring and forwards the message to the next one in the ring.

[9]:
# %load "../examples/advanced/ring-smp.py"
import simulus

from functools import partial
print = partial(print, flush=True)

def p(sim, idx, mbox):
    while True:
        msg = mbox.recv(isall=False)
        print("%g: '%s' rcvd msg '%s'" % (sim.now, sim.name, msg))
        sim.sync().send(sim, 'mb%d'% ((idx+1)%nnodes), msg)

nnodes = 15
sims = []
for i in range(nnodes):
    sim = simulus.simulator('sim%d'%i)
    sims.append(sim)
    mbox = sim.mailbox('mb%d'%i, 1)
    sim.process(p, sim, i, mbox)

g = simulus.sync(sims, enable_smp=True)
g.send(sims[0], 'mb0', 'hello') # send initial message to start circulation
g.run(10)
g.run(5, show_runtime_report=True)

1: 'sim0' rcvd msg 'hello'
2: 'sim1' rcvd msg 'hello'
3: 'sim2' rcvd msg 'hello'
4: 'sim3' rcvd msg 'hello'
5: 'sim4' rcvd msg 'hello'
6: 'sim5' rcvd msg 'hello'
7: 'sim6' rcvd msg 'hello'
8: 'sim7' rcvd msg 'hello'
9: 'sim8' rcvd msg 'hello'
10: 'sim9' rcvd msg 'hello'
11: 'sim10' rcvd msg 'hello'
12: 'sim11' rcvd msg 'hello'
13: 'sim12' rcvd msg 'hello'
14: 'sim13' rcvd msg 'hello'
*********** sync group performance metrics ***********
partitioning information (simulator assignment):
  'sim0' on rank 0 proc 0
  'sim1' on rank 0 proc 0
  'sim2' on rank 0 proc 0
  'sim3' on rank 0 proc 0
  'sim4' on rank 0 proc 1
  'sim5' on rank 0 proc 1
  'sim6' on rank 0 proc 1
  'sim7' on rank 0 proc 1
  'sim8' on rank 0 proc 2
  'sim9' on rank 0 proc 2
  'sim10' on rank 0 proc 2
  'sim11' on rank 0 proc 2
  'sim12' on rank 0 proc 3
  'sim13' on rank 0 proc 3
  'sim14' on rank 0 proc 3
execution time: 0.100925
scheduled events: 30 (rate=297.251)
executed events: 29 (rate=287.342)
cancelled events: 0
created processes: 15
finished processes: 0
cancelled processes: 0
process context switches: 29

In this example, in order to sort the print-outs according to the time it is generated, we flush the output of the print function by redefining it using the partial function. Just for demonstration, we also call run() twice, once with ‘offset’ (as the default positional argument) being 10 and the other time with ‘offset’ being 5. That is, accumulatively, we run the simulation for a total of 15.

The final performance report shows that, on my laptop, the simulators are evenly partitioned into four groups: ‘sim0’ to ‘sim3’ on ‘proc 0’, ‘sim4’ to sim7’ on ‘proc 1’, so on and so forth. The assignment is done automatically by simulus using a block decomposition scheme (where neighboring simulators are assigned to the same processors if possible).

We can control the number of SMP processes used for parallel simulation by setting the smp_ways parameter when creating the synchronized group. In the following, we use the same example except that each simulator now runs on a separate “processor”. If the number of simulators is larger than the physical processors on the machine, the system is going to multiplex them onto the physical processors.

[10]:
import simulus

from functools import partial
print = partial(print, flush=True)

def p(sim, idx, mbox):
    while True:
        msg = mbox.recv(isall=False)
        print("%g: '%s' rcvd msg '%s'" % (sim.now, sim.name, msg))
        sim.sync().send(sim, 'mb%d'% ((idx+1)%nnodes), msg)

nnodes = 15
sims = []
for i in range(nnodes):
    sim = simulus.simulator('sim%d'%i)
    sims.append(sim)
    mbox = sim.mailbox('mb%d'%i, 1)
    sim.process(p, sim, i, mbox)

g = simulus.sync(sims, enable_smp=True, smp_ways=len(sims))
g.send(sims[0], 'mb0', 'hello') # send initial message to start circulation
g.run(15, show_runtime_report=True)

1: 'sim0' rcvd msg 'hello'
2: 'sim1' rcvd msg 'hello'
3: 'sim2' rcvd msg 'hello'
4: 'sim3' rcvd msg 'hello'
5: 'sim4' rcvd msg 'hello'
6: 'sim5' rcvd msg 'hello'
7: 'sim6' rcvd msg 'hello'
8: 'sim7' rcvd msg 'hello'
9: 'sim8' rcvd msg 'hello'
10: 'sim9' rcvd msg 'hello'
11: 'sim10' rcvd msg 'hello'
12: 'sim11' rcvd msg 'hello'
13: 'sim12' rcvd msg 'hello'
14: 'sim13' rcvd msg 'hello'
*********** sync group performance metrics ***********
partitioning information (simulator assignment):
  'sim0' on rank 0 proc 0
  'sim1' on rank 0 proc 1
  'sim2' on rank 0 proc 2
  'sim3' on rank 0 proc 3
  'sim4' on rank 0 proc 4
  'sim5' on rank 0 proc 5
  'sim6' on rank 0 proc 6
  'sim7' on rank 0 proc 7
  'sim8' on rank 0 proc 8
  'sim9' on rank 0 proc 9
  'sim10' on rank 0 proc 10
  'sim11' on rank 0 proc 11
  'sim12' on rank 0 proc 12
  'sim13' on rank 0 proc 13
  'sim14' on rank 0 proc 14
execution time: 0.308509
scheduled events: 30 (rate=97.2418)
executed events: 29 (rate=94.0004)
cancelled events: 0
created processes: 15
finished processes: 0
cancelled processes: 0
process context switches: 29

Distributed Simulation Using MPI

Running parallel simulation on a single machine has its obvious limitations. First, the number of processors available on one machine is small, which would severely limit the level of parallelism and therefore the achievable speedup. Theoretically, with even perfect parallelism, it would be impossible for parallel simulation on 16 processors to achieve a speedup beyond 16. Second, the amount of memory available on one machine is very limited. Large-scale simulation commonly consumes an enormous amount of memory and therefore would not fit on a single machine. Last, modern parallel architectures are popular with high-end clusters which consist of a large number of distributed-memory computing servers connected via high-speed interconnect.

Simulus supports distributed simulation allowing the synchronized group of simulators to be instantiated and run on a parallel cluster. When running on parallel machines, simulus instances need to communicate using the Message-Passing Interface (MPI) under the hood.

MPI for Python (MPI4Py)

The Message Passing Interface (MPI) is the de facto standard library for communication on parallel computers. The MPI standard defines common message-passing functions and data types, which allow users to write portable scientific code in C/C++ or Fortran.

MPI4Py (or MPI for Python) provides a Python interface for MPI. It translates the syntax and semantics of standard MPI version 2 bindings for C++ to Python. Simulus uses MPI4Py as the communication substrate to support message passing among the simulator instances running on parallel computers. It is important to note that simulus users do not use MPI or MPI4Py directly. MPI functions under the hood. However, one does need to install MPI and MPI4Py so that simulus is able to take advantage of the MPI functions.

There are several popular MPI implementation, such as MPICH and OpenMPI. Refer to the specific implementation for installation instructions. Many parallel computers should already have MPI installed.

MPI4Py requires a working MPI implementation before it can be installed. If that’s already done, MPI4Py can be easily installed using pip. For example:

python -m pip install --user mpi4py

For the rest of this section, we assume that you have already installed MPI4Py on your machine.

Single Program Multiple Data (SPMD)

Single Program Multiple Data (SPMD) is the most common parallel programming paradigm. In SPMD, multiple machines simultaneously execute the same program, each operating on a different input or a different portion of the same data. Each machine is identified with a rank, which is an integer that ranges from 0 to the total number of machines minus one. Note that we use the word ‘machine’ here to generally refer to an MPI process. An MPI process is actually just a regular process that uses MPI routines for communication. One can create as many MPI processes as needed, which can be mapped to the parallel computers on which applications are run. MPI provides methods to create the mapping: one can map each MPI process to a separate computer; one can also map multiple MPI processes onto one computer (say, one corresponding to each processor or core).

Simulus offers an SPMD programming style for distributed simulation on parallel computers, which can also be combined with shared-memory multiprocessing as we described in the previous section. Each simulus instance runs as an MPI process. One can create as many simulus instances as needed. Each simulus instance is identified by a rank, which is an integer that ranges from 0 to the total number of simulus instances. The simulus instances are mapped by MPI onto the parallel computers for large-scale models, say one on each physical machine. Each simulus instance can also create multiple processes, say one for each processor core.

The following example shows how the simulus instances are identified when running the same Python program. One can get the rank and the total number of simulus instances, using the static methods: comm_rank() and comm_size(), both defined in sync.

[11]:
# %load '../examples/advanced/whoami.py'
import simulus

rank = simulus.sync.comm_rank()
psize = simulus.sync.comm_size()

print('rank=%d, psize=%d' % (rank, psize))

rank=0, psize=1

If we simply run this script (shift-enter in Jupyter notebook), we will get rank=0 and psize=1. By default, simulus is running as a standalone program, meaning that there is no distributed simulation and there is only one simulus instance.

If we opt to use distributed simulation, we need to run the script using either mpiexec or mpirun command. The -np 4 argument means that we are running four MPI processes (i.e., four simulus instances). They are ranked from 0 to 3.

[12]:
!mpiexec -np 4 python ../examples/advanced/whoami.py -x
rank=1, psize=4
rank=2, psize=4
rank=3, psize=4
rank=0, psize=4

The ‘-x’ option is important here. It tells simulus that we are running the script using MPI. Without it, simulus would simply run as standalone (the default) and it would be similar to making four separate simulation runs, as opposed to running one simulation with four simulus instances operating in parallel.

[13]:
!mpiexec -np 4 python ../examples/advanced/whoami.py
rank=0, psize=1
rank=0, psize=1
rank=0, psize=1
rank=0, psize=1

In the next example, we extend the ring example from the previous section and create an SPMD version for distributed simulation. Recall that the model creates 15 simulators organized as a ring and passes a “hello” message around the ring.

The same program shall be run by all simulus instances. For each instance, we first find out its rank and the total number of simulus instances. The 15 simulators shall be divided among the simulus instances. For example, if we run the program on four MPI processes, each MPI process (which corresponds to one simulus instance) will need to create and run 3 or 4 simulators.

In this example, rank 0 will be responsible for simulator 0, 4, 8, and 12; rank 1 will be responsible for simulator 1, 5, 9, and 13; rank 2 will be responsible for simulator 2, 6, 10, and 14; and finally rank 3 will be responsible for simulator 3, 7, and 11. This is called “cyclic data decomposition”. The simulators are divided among the simulus instances in a round-robin fashion. (Later we will use another method called “block data decomposition”).

We create a synchronized group like before; however at this time, each simulus instance only presents a list of simulators created by the instance; and we set enable_spmd to be True.

If the rank of the simulus instance is greater than zero, it is what we call a ‘worker instance’. In this case, one needs to run the simulation using run() without passing any arguments. If the rank of the simulus instance is zero, it is what we call a ‘manager/worker instance’. In this case, one can call run() and provide as an argument the end simulation time using ‘offset’ (it’s the default positional argument) or ‘until’. One can also choose to view the performance of the entire simulation run by setting show_runtime_report to be True, like before.

[14]:
!mpiexec -np 4 python ../examples/advanced/ring-spmd.py -x
1: 'sim0' rcvd msg 'hello'
2: 'sim1' rcvd msg 'hello'
3: 'sim2' rcvd msg 'hello'
4: 'sim3' rcvd msg 'hello'
5: 'sim4' rcvd msg 'hello'
6: 'sim5' rcvd msg 'hello'
7: 'sim6' rcvd msg 'hello'
8: 'sim7' rcvd msg 'hello'
9: 'sim8' rcvd msg 'hello'
10: 'sim9' rcvd msg 'hello'
11: 'sim10' rcvd msg 'hello'
12: 'sim11' rcvd msg 'hello'
13: 'sim12' rcvd msg 'hello'
14: 'sim13' rcvd msg 'hello'
*********** sync group performance metrics ***********
partitioning information (simulator assignment):
  'sim0' on rank 0 proc 0
  'sim4' on rank 0 proc 0
  'sim8' on rank 0 proc 0
  'sim12' on rank 0 proc 0
  'sim1' on rank 1 proc 0
  'sim5' on rank 1 proc 0
  'sim9' on rank 1 proc 0
  'sim13' on rank 1 proc 0
  'sim2' on rank 2 proc 0
  'sim6' on rank 2 proc 0
  'sim10' on rank 2 proc 0
  'sim14' on rank 2 proc 0
  'sim3' on rank 3 proc 0
  'sim7' on rank 3 proc 0
  'sim11' on rank 3 proc 0
execution time: 0.0184951
scheduled events: 30 (rate=1622.05)
executed events: 29 (rate=1567.98)
cancelled events: 0
created processes: 15
finished processes: 0
cancelled processes: 0
process context switches: 29

Again, the output of the simulation may be mixed together, since the simulus instances are run concurrently. The performance report shows the assignment of the simulators, as expected.

We can combine SPMD and SMP simulation so that one can run simulation on parallel clusters that consist of shared-memory multiprocessors. It’s only a single-line change (when we create the synchronized group):

When we run the simulation, the performance result shows the assignment of the simulators, including both the instance rank and the process id for each simulator. In the following, we see that the simulators are assigned to four simulus instances, rank 0 to rank 3. On each rank, the simulators are assigned to different processes.

[15]:
!mpiexec -np 4 python ../examples/advanced/ring-spmd-smp.py -x
1: 'sim0' rcvd msg 'hello'
2: 'sim1' rcvd msg 'hello'
3: 'sim2' rcvd msg 'hello'
4: 'sim3' rcvd msg 'hello'
5: 'sim4' rcvd msg 'hello'
6: 'sim5' rcvd msg 'hello'
7: 'sim6' rcvd msg 'hello'
8: 'sim7' rcvd msg 'hello'
9: 'sim8' rcvd msg 'hello'
10: 'sim9' rcvd msg 'hello'
11: 'sim10' rcvd msg 'hello'
12: 'sim11' rcvd msg 'hello'
13: 'sim12' rcvd msg 'hello'
14: 'sim13' rcvd msg 'hello'
*********** sync group performance metrics ***********
partitioning information (simulator assignment):
  'sim0' on rank 0 proc 0
  'sim4' on rank 0 proc 1
  'sim8' on rank 0 proc 2
  'sim12' on rank 0 proc 3
  'sim1' on rank 1 proc 0
  'sim5' on rank 1 proc 1
  'sim9' on rank 1 proc 2
  'sim13' on rank 1 proc 3
  'sim2' on rank 2 proc 0
  'sim6' on rank 2 proc 1
  'sim10' on rank 2 proc 2
  'sim14' on rank 2 proc 3
  'sim3' on rank 3 proc 0
  'sim7' on rank 3 proc 1
  'sim11' on rank 3 proc 2
execution time: 0.101676
scheduled events: 30 (rate=295.055)
executed events: 29 (rate=285.22)
cancelled events: 0
created processes: 15
finished processes: 0
cancelled processes: 0
process context switches: 29

The PHOLD Example Revisited

Earlier when we were introducing the mailbox facility provided by simulus, we discussed an example called PHOLD. In this section, we extend the model so that it can run for parallel and distributed simulation.

The PHOLD model consists of a number of nodes that are passing messages to one another. First, we define a class for the nodes. Each node runs in a simulator (‘sim’). It has an index number (‘node_idx’) that ranges from zero to the total number of nodes (‘total_nodes’) minus one. Each node also contains a mailbox (‘mbox’) with a distinct name (derived from the node’s index, ‘node_idx’). The mailbox also has a minimum delay (‘lookahead’). Each node also creates a process with a starting function called ‘forward_jobs()’. During simulation, the process repeatedly waits for jobs (from this and other nodes) and then forward them to other randomly selected nodes.

Second, we parse the command-line arguments. Simulus actually takes several optional command-line arguments (e.g., we used ‘-x’ earlier). These arguments are filtered out automatically from the list of command-line arguments (sys.argv) when the simulus package is imported. In the following, we show that the PHOLD model uses the standard argparse module in Python to parse the command line.

The model takes three positional (mandatory) arguments: ‘NNODES’ for the total number of nodes, ‘NJOBS’ for the total number of initial jobs, and ‘ENDTIME’ for the simulation end time. The PHOLD model also takes optional arguments, including ‘-m’ or ‘–nsims’ for the total number of simulators (by default, each node runs on a separate simulator), ‘-l’ or ‘–lookahead’ for the minimum delay of the mailboxes (default is 1.0), and ‘-c’ or ‘–choice’ for the type of the synchronized group.

One can choose to run the simulation for one of four types of synchronized group: 1 for sequential simulation (which is the default), 2 for parallel simulation on shared-memory multiprocessors (with SMP enabled), 3 for distributed simulation using MPI (with SPMD enabled), and 4 for parallel simulation using both MPI and multiprocessing (with both SMP and SPMD enabled).

We can use -h or –help to list all the command-line options for the PHOLD model. By having ‘parents’ to include ‘simulus.argparse’ when creating ‘argparse.ArgumentParser’, the help message includes all the optional command-line options that simulus is taking.

[16]:
!python ../examples/advanced/phold.py -h
usage: phold.py [-h] [-s SEED] [-v] [-vv] [-x] [-m NSIMS] [-c CHOICE]
                [-l LOOKAHEAD]
                NNODES NJOBS ENDTIME

The PHOLD model.

positional arguments:
  NNODES                total number of nodes
  NJOBS                 total number of initial jobs
  ENDTIME               simulation end time

optional arguments:
  -h, --help            show this help message and exit
  -s SEED, --seed SEED  set global random seed
  -v, --verbose         enable verbose information
  -vv, --debug          enable debug information
  -x, --mpi             enable mpi for parallel simulation
  -m NSIMS, --nsims NSIMS
                        total number of simulators
  -c CHOICE, --choice CHOICE
                        choose simulation method (see below)
  -l LOOKAHEAD, --lookahead LOOKAHEAD
                        min delay of mailboxes

CHOICE (-c or --choice) can be any of the following:
  1: sequential simulation (default)
  2: parallel simulation on shared-memory multiprocessors
  3: parallel simulation using mpi
  4: parallel simulation with mpi and multiprocessing

Third, we create the nodes and the simulators. But before that, we define four utility functions to determine the corresponding starting and ending indices of an array of ‘n’ numbers that belongs to process ‘id’, if the array is to be divided among ‘p’ processes using the “block decomposition method”. The block decomposition method assigns adjacent elements of the array to the same process. These utility functions are described in the book: “Parallel Programming in C with MPI and OpenMP”, by Michael J. Quinn, McGraw-Hill, 2003.

We now create the simulators that belong to the current simulus instance. Their indices can be determined using the utility function BLOCK_LOW() defined above. There are ‘args.nsims’ number of simulators to be divided among ‘psize’ number of simulus instances. The index for the first simulator belonging to the current rank is given by BLOCK_LOW(rank, psize, args.nsims); the index for the last simulator belonging to the current rank is given by BLOCK_LOW(rank+1, psize, args.nsims)-1. All the simulators belonging to the current rank are stored in the array ‘sims’.

We then create the nodes belong to the simulator of index ‘s’. There are ‘args.total_nodes’ number of nodes to be divided among ‘args.nsims’ number of simulators. The index of the first node belonging to simulator ‘s’ is given by BLOCK_LOW(s, args.nsims, args.total_nodes), and the index of the last node belonging to simulator ‘s’ is given by BLOCK_LOW(s+1, args.nsims, args.total_nodes)-1.

Now we are ready to create the synchronized group and run the simulation. We create the synchronized group according to the user’s choice from the command line. If the choice is 1, we run sequential simulation; we create a synchronized group with both SMP and SPMD disabled (the default). If the choice is 2, we run parallel simulation on shared-memory multiprocessors; we create the synchronized group with enable_smp set to be True. If the choice is 3, we run parallel simulation using MPI; we create the synchronized group with enable_spmd set to be True. If the choice is 4, we run parallel simulation both using MPI and multiprocessing; we create the synchronized group with both enable_smp and enable_spmd set to be True.

Regardless of the choice, if the rank is greater than zero, it’s a worker instance, in which case we simply call run() (without parameters). If the rank is zero, it’s a manager/worker instance, in which case we call run() and pass as an argument the simulation end time (‘args.endtime’). On rank zero, before running the simulation, we also need to send the initial jobs to the randomly selected mailboxes.

In this example, we call show_runtime_report() separately after run() with the argument ‘prefix’. In this case, all print-out lines will be prefixed by the given string. This would be helpful if one wants to find the report among a large amount of output from the parallel simulation run.

Let’s first run the simulation sequentially (choice #1). We use a small model with 10 nodes (each on a separate simulator) and 5 initial jobs, and we run the simulation for 5 seconds.

[17]:
!python ../examples/advanced/phold.py 10 5 5 -c 1
> MODEL PARAMETERS:
>> TOTAL NODES: 10
>> TOTAL INIT JOBS: 5
>> LOOKAHEAD: 1.0
>> CHOICE: 1
>> TOTAL SIMS:  10
>> TOTAL SPMD PROCESSES: 1
-init- sent j0 to n6 with d=2.31933
-init- sent j1 to n4 with d=2.93198
-init- sent j2 to n3 with d=1.31505
-init- sent j3 to n6 with d=1.17636
-init- sent j4 to n1 with d=3.07286
1.31505: n3 recv j2, sent to n3 with d=2.8346 (4.14964)
1.17636: n6 recv j3, sent to n6 with d=1.19807 (2.37443)
3.07286: n1 recv j4, sent to n4 with d=2.01853 (5.09139)
2.93198: n4 recv j1, sent to n0 with d=1.179 (4.11098)
2.31933: n6 recv j0, sent to n1 with d=1.21847 (3.5378)
2.37443: n6 recv j3, sent to n0 with d=1.97344 (4.34788)
4.11098: n0 recv j1, sent to n8 with d=1.20257 (5.31355)
4.34788: n0 recv j3, sent to n1 with d=1.2599 (5.60777)
3.5378: n1 recv j0, sent to n7 with d=1.74854 (5.28634)
4.14964: n3 recv j2, sent to n8 with d=1.04599 (5.19563)
>*********** sync group performance metrics ***********
>partitioning information (simulator assignment):
>  'sim0' on rank 0 proc 0
>  'sim1' on rank 0 proc 0
>  'sim2' on rank 0 proc 0
>  'sim3' on rank 0 proc 0
>  'sim4' on rank 0 proc 0
>  'sim5' on rank 0 proc 0
>  'sim6' on rank 0 proc 0
>  'sim7' on rank 0 proc 0
>  'sim8' on rank 0 proc 0
>  'sim9' on rank 0 proc 0
>execution time: 0.00151014
>scheduled events: 25 (rate=16554.7)
>executed events: 20 (rate=13243.8)
>cancelled events: 0
>created processes: 10
>finished processes: 0
>cancelled processes: 0
>process context switches: 20

Next, we run the same simulation model on shared-memory multiprocessors (choice #2). On my laptop, the simulation is run on four processors.

[18]:
!python ../examples/advanced/phold.py 10 5 5 -c 2
> MODEL PARAMETERS:
>> TOTAL NODES: 10
>> TOTAL INIT JOBS: 5
>> LOOKAHEAD: 1.0
>> CHOICE: 2
>> TOTAL SIMS:  10
>> TOTAL SPMD PROCESSES: 1
-init- sent j0 to n6 with d=2.31933
-init- sent j1 to n4 with d=2.93198
-init- sent j2 to n3 with d=1.31505
-init- sent j3 to n6 with d=1.17636
-init- sent j4 to n1 with d=3.07286
1.31505: n3 recv j2, sent to n3 with d=2.8346 (4.14964)
1.17636: n6 recv j3, sent to n6 with d=1.19807 (2.37443)
3.07286: n1 recv j4, sent to n4 with d=2.01853 (5.09139)
2.31933: n6 recv j0, sent to n1 with d=1.21847 (3.5378)
2.37443: n6 recv j3, sent to n0 with d=1.97344 (4.34788)
2.93198: n4 recv j1, sent to n0 with d=1.179 (4.11098)
4.11098: n0 recv j1, sent to n8 with d=1.20257 (5.31355)
4.34788: n0 recv j3, sent to n1 with d=1.2599 (5.60777)
3.5378: n1 recv j0, sent to n7 with d=1.74854 (5.28634)
4.14964: n3 recv j2, sent to n8 with d=1.04599 (5.19563)
>*********** sync group performance metrics ***********
>partitioning information (simulator assignment):
>  'sim0' on rank 0 proc 0
>  'sim1' on rank 0 proc 0
>  'sim2' on rank 0 proc 0
>  'sim3' on rank 0 proc 1
>  'sim4' on rank 0 proc 1
>  'sim5' on rank 0 proc 1
>  'sim6' on rank 0 proc 2
>  'sim7' on rank 0 proc 2
>  'sim8' on rank 0 proc 3
>  'sim9' on rank 0 proc 3
>execution time: 0.0224149
>scheduled events: 25 (rate=1115.33)
>executed events: 20 (rate=892.263)
>cancelled events: 0
>created processes: 10
>finished processes: 0
>cancelled processes: 0
>process context switches: 20

Now we will the simulation using MPI (choice #3).

[19]:
!mpiexec -np 4 python ../examples/advanced/phold.py 10 5 5 -x -c 3
> MODEL PARAMETERS:
>> TOTAL NODES: 10
>> TOTAL INIT JOBS: 5
>> LOOKAHEAD: 1.0
>> CHOICE: 3
>> TOTAL SIMS:  10
>> TOTAL SPMD PROCESSES: 4
-init- sent j0 to n6 with d=2.31933
-init- sent j1 to n4 with d=2.93198
-init- sent j2 to n3 with d=1.31505
-init- sent j3 to n6 with d=1.17636
-init- sent j4 to n1 with d=3.07286
3.07286: n1 recv j4, sent to n4 with d=2.01853 (5.09139)
4.11098: n0 recv j1, sent to n8 with d=1.20257 (5.31355)
4.34788: n0 recv j3, sent to n1 with d=1.2599 (5.60777)
3.5378: n1 recv j0, sent to n7 with d=1.74854 (5.28634)
>*********** sync group performance metrics ***********
>partitioning information (simulator assignment):
>  'sim0' on rank 0 proc 0
>  'sim1' on rank 0 proc 0
>  'sim2' on rank 1 proc 0
>  'sim3' on rank 1 proc 0
>  'sim4' on rank 1 proc 0
>  'sim5' on rank 2 proc 0
>  'sim6' on rank 2 proc 0
>  'sim7' on rank 3 proc 0
>  'sim8' on rank 3 proc 0
>  'sim9' on rank 3 proc 0
>execution time: 0.0157859
>scheduled events: 25 (rate=1583.69)
>executed events: 20 (rate=1266.95)
>cancelled events: 0
>created processes: 10
>finished processes: 0
>cancelled processes: 0
>process context switches: 20
1.17636: n6 recv j3, sent to n6 with d=1.19807 (2.37443)
2.31933: n6 recv j0, sent to n1 with d=1.21847 (3.5378)
2.37443: n6 recv j3, sent to n0 with d=1.97344 (4.34788)
1.31505: n3 recv j2, sent to n3 with d=2.8346 (4.14964)
2.93198: n4 recv j1, sent to n0 with d=1.179 (4.11098)
4.14964: n3 recv j2, sent to n8 with d=1.04599 (5.19563)

Finally, we run the simulation with both MPI and multiprocessing (choice #4).

[20]:
!mpiexec -np 4 python ../examples/advanced/phold.py 10 5 5 -x -c 4
> MODEL PARAMETERS:
>> TOTAL NODES: 10
>> TOTAL INIT JOBS: 5
>> LOOKAHEAD: 1.0
>> CHOICE: 4
>> TOTAL SIMS:  10
>> TOTAL SPMD PROCESSES: 4
-init- sent j0 to n6 with d=2.31933
-init- sent j1 to n4 with d=2.93198
-init- sent j2 to n3 with d=1.31505
-init- sent j3 to n6 with d=1.17636
-init- sent j4 to n1 with d=3.07286
4.11098: n0 recv j1, sent to n8 with d=1.20257 (5.31355)
4.34788: n0 recv j3, sent to n1 with d=1.2599 (5.60777)
>*********** sync group performance metrics ***********
>partitioning information (simulator assignment):
>  'sim0' on rank 0 proc 0
>  'sim1' on rank 0 proc 1
>  'sim2' on rank 1 proc 0
>  'sim3' on rank 1 proc 1
>  'sim4' on rank 1 proc 2
>  'sim5' on rank 2 proc 0
>  'sim6' on rank 2 proc 1
>  'sim7' on rank 3 proc 0
>  'sim8' on rank 3 proc 1
>  'sim9' on rank 3 proc 2
>execution time: 0.0632179
>scheduled events: 25 (rate=395.458)
>executed events: 20 (rate=316.366)
>cancelled events: 0
>created processes: 10
>finished processes: 0
>cancelled processes: 0
>process context switches: 20
3.07286: n1 recv j4, sent to n4 with d=2.01853 (5.09139)
3.5378: n1 recv j0, sent to n7 with d=1.74854 (5.28634)
1.31505: n3 recv j2, sent to n3 with d=2.8346 (4.14964)
4.14964: n3 recv j2, sent to n8 with d=1.04599 (5.19563)
2.93198: n4 recv j1, sent to n0 with d=1.179 (4.11098)
1.17636: n6 recv j3, sent to n6 with d=1.19807 (2.37443)
2.31933: n6 recv j0, sent to n1 with d=1.21847 (3.5378)
2.37443: n6 recv j3, sent to n0 with d=1.97344 (4.34788)

We can observe that, regardless of the choice, as long as the number of simulators is the same, they all produce the same results. This is because the random sequence generated by the simulators is the same regardless where the simulator is run. As we mentioned previously, the random number generator of the simulator is seeded using the global random seed (12345) and the name of the simulator, both of which remain the same across different configurations.

[ ]: