prolog-mpi: a distributed prolog environment

Benchmarks (0.2.x)

These serve to crudely illustrate the overhead of prolog-mpi as regards its distribution functionality (see note). Benchmark/profile utilities are distributed with the system and may be generated on any host (click for a larger picture).

 

Plots

benchmark plots

  • Plot upper-left: distribution time as a factor over the compute node's processing (wallclock) time.
  • Plot upper-right: raw distribution time over the domain of the compute node's processing (wallclock) time.
  • Plot lower-left: line-graph of average distribution time (as computed over a subdomain) over the compute node's processing (wallclock) time.
  • Plot lower-right: line-graph of median distribution time (as computed over a subdomain) over the compute node's processing (wallclock) time.

 

Algorithm

The time-stamping algorithm is roughly as follows:

  1. control: mark-time(t1)
  2. control: send job » compute node
  3. compute: recv job « control node
  4. compute: mark-time(t2)
  5. compute: process(job)
  6. compute: mark-time(t3)
  7. compute: send delta(t2, t3) » control node
  8. control: recv delta(t2, t3) « compute node
  9. control: mark-time(t4)
  10. control: report delta(t2, t3), delta(t1, t4)

 

Generating

These benchmarks were generated by generating traces from test routines and processing them:

$ mpirun -np 10 pl-mpi-test -tur 500
$ pl-mpi-gdtime -qf output.png dump-*

Note: these are not appropriate measurements for estimating run-time. The testing environment is a symmetric multiprocessor; all internode communication occurs over the loop-back interface.

 

Analysis

Due to the extremely fast (subsecond) rate of distribution, compute completion, and acknowledgement, these values represent an upper bound in the rate of distribution, a fact somewhat mediated by running on a local symmetric multiprocessor (the control node must accept bursts of responses).

At this time, data transfer between nodes is a rather heavyweight operation: future versions of prolog-mpi will have optimised transfer sequences. The primary bottle-neck is within the control node's system of responding to events, which is sequential, i.e., once an completion event is received from a compute node, the control node blocks until all data is received. This bottle-neck may be removed by maintaining a series of state machines defining compute node response sequences. This removal will come at the cost of algorithm simplicity; analysis is underway as to whether this provides a significant benefit to prolog-mpi instead of unnecessarily complicating the scheduling mechanisms.

 

$Id: bench.html,v 1.5 2007-01-16 15:51:02 kristaps Exp $