Experiment 3: Systolic-Array Implementation of Matrix-By-Matrix Multiplication

Review the help notes for this experiment.

Objectives

The multiplication of matrices is a very common operation in engineering and scientific problems. The sequential implementation of this operation is very time consuming for large matrices; the brute-force solution results in computation time O(n³), for n x n matrices. For this reason, several parallel algorithms have been developed to solve this problem more efficiently. Here, a simple parallel algorithm is presented for this problem and a "hardwired" (actually, systolic-array) implementation of the algorithm becomes our objective.

Introduction

2-dimensional, mesh-connected parallel computers are often used in systolic-array configuration for the multiplication of matrices. For the sake of simplicity, we assume input matrices of size 4 x 4 containing one-bit integer elements. Figure 3.1 shows the operations to be performed. The ● and + represent the integer operations multiplication and addition, respectively.

Matrix — c₁₁ = a₁₁ ● b₁₁ + a₁₂ ● b₂₁ + a₁₃ ● b₃₁ + a₁₄ ● b₄₁c₁₂ = a₁₁ ● b₁₂ + a₁₂ ● b₂₂ + a₁₃ ● b₃₂ + a₁₄ ● b₄₂c₁₃ = a₁₁ ● b₁₃ + a₁₂ ● b₂₃ + a₁₃ ● b₃₃ + a₁₄ ● b₄₃c₁₄ = a₁₁ ● b₁₄ + a₁₂ ● b₂₄ + a₁₃ ● b₃₄ + a₁₄ ● b₄₄c₂₁ = a₂₁ ● b₁₁ + a₂₂ ● b₂₁ + a₂₃ ● b₃₁ + a₂₄ ● b₄₁c₂₂ = a₂₁ ● b₁₂ + a₂₂ ● b₂₂ + a₂₃ ● b₃₂ + a₂₄ ● b₄₂c₂₃ = a₂₁ ● b₁₃ + a₂₂ ● b₂₃ + a₂₃ ● b₃₃ + a₂₄ ● b₄₃c₂₄ = a₂₁ ● b₁₄ + a₂₂ ● b₂₄ + a₂₃ ● b₃₄ + a₂₄ ● b₄₄c₃₁ = a₃₁ ● b₁₁ + a₃₂ ● b₂₁ + a₃₃ ● b₃₁ + a₃₄ ● b₄₁c₃₂ = a₃₁ ● b₁₂ + a₃₂ ● b₂₂ + a₃₃ ● b₃₂ + a₃₄ ● b₄₂c₃₃ = a₃₁ ● b₁₃ + a₃₂ ● b₂₃ + a₃₃ ● b₃₃ + a₃₄ ● b₄₃c₃₄ = a₃₁ ● b₁₄ + a₃₂ ● b₂₄ + a₃₃ ● b₃₄ + a₃₄ ● b₄₄c₄₁ = a₄₁ ● b₁₁ + a₄₂ ● b₂₁ + a₄₃ ● b₃₁ + a₄₄ ● b₄₁c₄₂ = a₄₁ ● b₁₂ + a₄₂ ● b₂₂ + a₄₃ ● b₃₂ + a₄₄ ● b₄₂c₄₃ = a₄₁ ● b₁₃ + a₄₂ ● b₂₃ + a₄₃ ● b₃₃ + a₄₄ ● b₄₃c₄₄ = a₄₁ ● b₁₄ + a₄₂ ● b₂₄ + a₄₃ ● b₃₄ + a₄₄ ● b₄₄

Figure 3.1: Multiplication of matrices of size 4 4.

The two matrices A and B are shifted into the boundary processors in column 1 and row 1, respectively, as shown in Figure 3.2. The leading and trailing 0s in rows and columns are employed so that elements a_ir and b_rj arrive at processor P_ij simultaneously for the operation a_ir ● b_rj to be performed. c_ij is initialized to 0 in P_ij , for all i, j = 1, 2, 3, 4. At the end, processor P_ij will contain c_ij , for 1 ≤ i, j ≤ 4

Whenever a processor P_ij receives two inputs b and a from the north and the west, respectively, it performs the following set of operations, in this order:

it calculates a ● b;
it adds the result to the previous value c_ij , and stores the result in c_ij ;
it sends a to P_i,j+1, unless j = 4; and
it sends b to P_{i + 1, j}, unless i = 4.

This algorithm takes time O(n), for n x n matrices.

Array — Figure 3.2: A 4 x 4 mesh (systolic array) of processors for matrix multiplication.

Experiment

Implement this parallel algorithm directly in hardware using the Altera UP 1 Education Board. Optimize your design with respect to the size of operands. Use onboard LEDs and/ or BCD displays to display intermediate and final results.

The proper operation of the entire design is to be simulated in MAX+PLUS before UP 1 is programmed. The waveforms from these simulations should be included in the lab report.