# Experiment 3: Systolic-Array Implementation of Matrix-By-Matrix Multiplication

Review the help notes for this experiment.

## Objectives

The multiplication of matrices is a very common operation in engineering
and scientific problems. The sequential implementation of this operation is
very time consuming for large matrices; the brute-force solution results in
computation time **O(n ^{3})**, for

**n x n**matrices. For this reason, several parallel algorithms have been developed to solve this problem more efficiently. Here, a simple parallel algorithm is presented for this problem and a "

**hardwired**" (actually, systolic-array) implementation of the algorithm becomes our objective.

## Introduction

2-dimensional, mesh-connected parallel computers are often used in systolic-array
configuration for the multiplication of matrices. For the sake of simplicity,
we assume input matrices of size 4 x 4 containing one-bit integer elements.
Figure 3.1 shows the operations to be performed. The ● and **+** represent the integer operations multiplication and addition, respectively.

The two matrices ** A** and

**are shifted into the boundary processors in column 1 and row 1, respectively, as shown in Figure 3.2. The leading and trailing 0s in rows and columns are employed so that elements**

*B***and**

*a*_{ir}*arrive at processor*

**b**_{rj}**simultaneously for the operation**

*P*_{ij}**to be performed.**

*a*_{ir}● b_{rj}**is initialized to 0 in**

*c*_{ij}**, for all**

*P*_{ij}**= 1, 2, 3, 4. At the end, processor**

*i, j***will contain**

*P*_{ij}**, for**

*c*_{ij}

**1 ≤ i, j ≤ 4**Whenever a processor * P_{ij}* receives two
inputs

*and*

**b****from the north and the west, respectively, it performs the following set of operations, in this order:**

*a*- it calculates
;**a ● b** - it adds the result to the previous value
, and stores the result in**c**_{ij};**c**_{ij} - it sends
**a**to**P**, unless_{i,j+1}; and*j = 4* - it sends
**b**to**P**, unless_{i + 1, j}.**i = 4**

This algorithm takes time **O(n)**, for **n x
n** matrices.

## Experiment

Implement this parallel algorithm directly in hardware using the Altera UP 1 Education Board. Optimize your design with respect to the size of operands. Use onboard LEDs and/ or BCD displays to display intermediate and final results.

The proper operation of the entire design is to be simulated in MAX+PLUS before UP 1 is programmed. The waveforms from these simulations should be included in the lab report.