Experiment 4: Shared-Memory Multiprocessor for Matrix Multiplication
HELP NOTES
NOTES:Simulation of the algorithm will be done in Quartus II.
Hardware implementation of the algorithm will be done on the DE2 Altera board.
Top-Level Implementation
The top level implementation should contain the following modules:- Two processors, each one having a private Instruction Local Memory (ILM), that employ the Harvard architecture with separate instruction and data buses.
- A Global Arbiter and Shared RAM Controller.
- A Shared Memory (Shared RAM).
Processor
- A processor’s Instruction Local Memory-ILM should be created using the library of parameterized modules (lpm) - lpm_ram_dq) Use the PM_FILE parameter to initialize the memories. In appendix an example is presented.
The processor should have at least the following components:
- An ALU. You may implement it using the lpm modules lpm_add_sub and lpm_mul (see [2]).
- A Register File containing at least 4 registers (better to have 8).
- An interface to the Shared Memory.
The processor should have at least the following instructions: LOAD, STORE, ADD, SUB, MUL. You may also implement JUMP (JUMPZ or/and JUMPNZ) since this instruction will simplify the software development.
You may you can use your experience and/or VHDL code from ECE495 to build a microcoded uProcessor. Also, in [8] a simple microcoded CPU design is presented.
Global Arbiter and RAM Controller
The Globar Arbiter (see [3]) and RAM Ctrl should be able to resolve simultaneous requests from
both processors and then assert an Ack to the chosen processor for memory access. It will also
pass the proper signals to the shared memory interface. Use the lpm_ram_dq module to
instantiate the shared memory and the PM_FILE parameter to initialize it.
The Software
You should split the job between these two processors as shown below. All the elements of the B matrix should be fetched by both processors.
As a suggestion, start working on the assembly language code for matrix multiplication in order to figure out what instructions you need to implement. Then you should start designing the processor.
Extra-credit for:
- Building a processor with JUMP (JUMPZ or/and JUMPNZ) instructions;
- Build a processor that can be reused in the fourth experiment;
- Implement software in an elegant way;
- Add a Data Local Memory (DLM) for the processor that is initially empty. It should be situated between the processor and the global arbiter.
- Time optimization; the system should run at the highest possible frequency (see the timing-simulation section for Quartus II).
References
- Computer Systems Organization & Architecture by John Carpinelli
- Altera library of parameterized modules (LPM)
- Arbiters: Design Ideas and Coding Styles (see sections 3.0 and 4.0)
http://www.asic-world.com/examples/verilog/arbiter.html - Altera DE 2 board tutorial
ftp://ftp.altera.com/up/pub/Tutorials/DE2/Digital_Logic/tut_quartus_intro_vhdl.pdf - Altera DE2 User Manual
- Altera DE 2 board resources
- VHDL manual
http://www.usna.edu/EE/ee462/MANUALS/vhdl_ref.pdf
http://www.cse.unsw.edu.au/~cs3211/refs/vhdl1.pdf
http://home.dei.polimi.it/sami/VHDL_reference_manual.pdf - Microsequencer design
- lpm_add_sub DE2 design
- Memory Initialization File (.mif)
- Intel HEX editor – HxD (.hex file editor; .hex file is used for initialization of
lpm_rom/ram modules)
Appendix – Instantiation of lpm_ram_dq
-- ram_experiment_altera.vhd
-- lpm_ram_dq instantiation
library ieee;
use ieee.std_logic_1164.all;
library lpm;
use lpm.lpm_components.all;
LIBRARY altera_mf;
USE altera_mf.altera_mf_components.all;
entity ram_experiment_altera is
port (
TB_addr : in std_logic_vector(0 to 7);
TB_data_in : in std_logic_vector(0 to 15);
TB_data_out : out std_logic_vector(0 to 15);
TB_we : in std_logic;
TB_clock : in std_logic;
TB_outclock : in std_logic
);
end ram_experiment_altera;
architecture structural of ram_experiment_altera is
begin
lpm_ram_dq_inst : lpm_ram_dq
generic map (LPM_ADDRESS_CONTROL => "REGISTERED",
port map (data => TB_data_in,LPM_FILE => "ram1.mif",
LPM_INDATA => "REGISTERED",
LPM_NUMWORDS => 256,
LPM_OUTDATA => "UNREGISTERED",
LPM_WIDTH => 16,
LPM_WIDTHAD => 8 )
address => TB_addr,
inclock => TB_clock,
--outclock => TB_outclock,
we => TB_we,
q => TB_data_out);
end structural;
-- ram1.mif
DEPTH = 256; % Memory depth and width are required %
% DEPTH is the number of addresses %
WIDTH = 16; % WIDTH is the number of bits of data per word %
% DEPTH and WIDTH should be entered as decimal numbers %
ADDRESS_RADIX = HEX; % Address and value radixes are required %
DATA_RADIX = HEX; % Enter BIN, DEC, HEX, OCT, or UNS; unless %
% otherwise specified, radixes = HEX %
-- Specify values for addresses, which can be single address or range
CONTENT
BEGIN
00 : 3FFF; % Single Address%
01 : ABCD;
02 : 1234;
03 : 4567;
[4..F]: 3FFF; % Range--Every address from 4 to F = 3FFF %
10 : 000F 000E 0005; %Addresses 10, 11, 12%
13 : 123F;
14 : ABCE;
15 : 1234;
END;