ECE Undergraduate Laboratories
ECE 459 - Advanced Computer System Design Laboratory

Experiment 4: Shared-Memory Multiprocessor for Matrix Multiplication
HELP NOTES

NOTES:Simulation of the algorithm will be done in Quartus II.
Hardware implementation of the algorithm will be done on the DE2 Altera board.


Top-Level Implementation

The top level implementation should contain the following modules:
- Two processors, each one having a private Instruction Local Memory (ILM), that employ the Harvard architecture with separate instruction and data buses.
- A Global Arbiter and Shared RAM Controller.
- A Shared Memory (Shared RAM).
Top Level Implementation

Processor

Processor

- A processor’s Instruction Local Memory-ILM should be created using the library of parameterized modules (lpm) - lpm_ram_dq) Use the PM_FILE parameter to initialize the memories. In appendix an example is presented.

The processor should have at least the following components:

- An ALU. You may implement it using the lpm modules lpm_add_sub and lpm_mul (see [2]).
- A Register File containing at least 4 registers (better to have 8).
- An interface to the Shared Memory.

The processor should have at least the following instructions: LOAD, STORE, ADD, SUB, MUL. You may also implement JUMP (JUMPZ or/and JUMPNZ) since this instruction will simplify the software development.

You may you can use your experience and/or VHDL code from ECE495 to build a microcoded uProcessor. Also, in [8] a simple microcoded CPU design is presented.

Global Arbiter and RAM Controller

The Globar Arbiter (see [3]) and RAM Ctrl should be able to resolve simultaneous requests from both processors and then assert an Ack to the chosen processor for memory access. It will also pass the proper signals to the shared memory interface. Use the lpm_ram_dq module to instantiate the shared memory and the PM_FILE parameter to initialize it.

RAM Controller

The Software

You should split the job between these two processors as shown below. All the elements of the B matrix should be fetched by both processors.

The software

As a suggestion, start working on the assembly language code for matrix multiplication in order to figure out what instructions you need to implement. Then you should start designing the processor.

Extra-credit for:

  • Building a processor with JUMP (JUMPZ or/and JUMPNZ) instructions;
  • Build a processor that can be reused in the fourth experiment;
  • Implement software in an elegant way;
  • Add a Data Local Memory (DLM) for the processor that is initially empty. It should be situated between the processor and the global arbiter.
  • Time optimization; the system should run at the highest possible frequency (see the timing-simulation section for Quartus II).

References

  1. Computer Systems Organization & Architecture by John Carpinelli
  2. Altera library of parameterized modules (LPM)
  3. Arbiters: Design Ideas and Coding Styles (see sections 3.0 and 4.0)
    http://www.asic-world.com/examples/verilog/arbiter.html
  4. Altera DE 2 board tutorial
    ftp://ftp.altera.com/up/pub/Tutorials/DE2/Digital_Logic/tut_quartus_intro_vhdl.pdf
  5. Altera DE2 User Manual
  6. Altera DE 2 board resources
  7. VHDL manual
    http://www.usna.edu/EE/ee462/MANUALS/vhdl_ref.pdf
    http://www.cse.unsw.edu.au/~cs3211/refs/vhdl1.pdf
    http://home.dei.polimi.it/sami/VHDL_reference_manual.pdf
  8. Microsequencer design
  9. lpm_add_sub DE2 design
  10. Memory Initialization File (.mif)
  11. Intel HEX editor – HxD (.hex file editor; .hex file is used for initialization of
    lpm_rom/ram modules)

Appendix – Instantiation of lpm_ram_dq

--  ram_experiment_altera.vhd
--  lpm_ram_dq instantiation

library ieee;
use ieee.std_logic_1164.all;
library lpm;
use lpm.lpm_components.all;
LIBRARY altera_mf;
USE altera_mf.altera_mf_components.all;

entity ram_experiment_altera is
port (
TB_addr : in std_logic_vector(0 to 7);
TB_data_in : in std_logic_vector(0 to 15);
TB_data_out : out std_logic_vector(0 to 15);
TB_we : in std_logic;
TB_clock : in std_logic;
TB_outclock : in std_logic
);
end ram_experiment_altera;

architecture structural of ram_experiment_altera is

begin

lpm_ram_dq_inst : lpm_ram_dq
  generic map (LPM_ADDRESS_CONTROL => "REGISTERED",

LPM_FILE => "ram1.mif",
LPM_INDATA => "REGISTERED",
LPM_NUMWORDS => 256,
LPM_OUTDATA => "UNREGISTERED",
LPM_WIDTH => 16,
LPM_WIDTHAD => 8 )
port map (data => TB_data_in,
address => TB_addr,
inclock => TB_clock,
--outclock => TB_outclock,
we => TB_we,
q => TB_data_out);

end structural;

-- ram1.mif

DEPTH = 256; % Memory depth and width are required %
% DEPTH is the number of addresses %

WIDTH = 16; % WIDTH is the number of bits of data per word %
% DEPTH and WIDTH should be entered as decimal numbers %

ADDRESS_RADIX = HEX; % Address and value radixes are required %
DATA_RADIX = HEX; % Enter BIN, DEC, HEX, OCT, or UNS; unless %
% otherwise specified, radixes = HEX %

--  Specify values for addresses, which can be single address or range

CONTENT
BEGIN
00 : 3FFF;                              % Single Address%
01 : ABCD;
02 : 1234;
03 : 4567;
[4..F]: 3FFF;                          % Range--Every address from 4 to F = 3FFF %
10 : 000F 000E 0005;          %Addresses 10, 11, 12%
13 : 123F;
14 : ABCE;
15 : 1234;
END;