Computer Organization and Architecture: December 2013

Thursday, 19 December 2013

4. Base Addressing

4. Base Addressing

Base addressing is also known as indirect addressing.
The address of the operand is the sum of the immediate and the value in a register (rs). The size of operand is limited to 16 bits because each MIPS instruction fits into a word.
This is used in the lw and sw (load word, store word) instructions.
The offset value is a signed number which is represented in a two's complement format.

3. PC-Relative Addressing

A data or instruction memory location is specified as an offset relative to the incremented PC.
PC-Relative addressing usually used in the beq and bne (branch equal, branch not equal) instructions.
It implements position-independent codes. Only a small offset is adequate for shorter loops.

2. Immediate Addressing

The operand is a constant within the instruction itself in immediate addressing.
Immediate addressing is similar to register addressing, the instructions will be executed quickly because they avoid the delays associated with memory access.
Since the destination of the jump instruction format is held in the instruction, so it can also be considered an example of immediate addressing.

1. Register Modes

Register addressing is the simplest addressing mode.
Both operands are in a register.Instructions will be executed quickly because they avoid the delays associated with memory access.
The number of registers is limited since only a few bits are reserved to select a register.
It takes n bits to address 2ⁿregisters, which can fit in five bits since there are 32 registers.

Addressing Modes

MIPS has only a small number of ways that is computes addresses in memory. The address can be an address of an instruction (for branch and jump instructions) or it can be an address of data (for load and store instructions).

For any given operation, such as load, add or branch, there are often many different ways to specify the address of the operand(s).
The different ways of determining the address are called addressing modes.

We will look at the 4 ways addresses are computed

Register Addressing
Immediate Addressing
PC-Relative Addressing
Base Addressing

Thursday, 12 December 2013

Associative Caches

-Increased associativity decrease miss rate but with diminishing returns

Spectrum of associativity

Type of associative

1. Fully Associative
--In a fully associative scheme, any slot can store the cache line. The hardware for finding whether the desired data is in the cache requires comparing the tag bits of the address to the tag bits of every slot (in parallel), and making sure the valid bit is set.

2. Set associative
--A set-associative cache scheme is a combination of fully associative and direct mapped schemes. You group slots into sets. You find the appropriate set for a given address (which is like the direct mapped scheme), and within the set you find the appropriate slot (which is like the fully associative scheme).

Example

--Compare 4-blocks caches with block access sequence: 0,8,0,6,8

Cache Performance

Cache Performance

There are 3 formula need to memorize:

1. Memory stall cycles

2. CPU time
= (CPU execution cycles + Memory stall cycles) x Cycle time

3. Average memory access time (AMAT)

= Hit time + (Miss rate x Miss penalty)

More Cache Performance Formula

nCan split access time into instructions & data:

pAverage. memory. access. time =
(% instruction accesses) × (instruction. memory. access time) + (% data accesses) × (data memory. access time)

nAnother simple formula:

pCPU time = (CPU execution clock cycles + Memory stall clock cycles) × cycle time

nCan break stalls into reads and writes:

pMemory stall cycles =
(Reads × read miss rate × read miss penalty) + (Writes × write miss rate × write miss penalty)

Performance Summary

1. Improving Cache Performance

2. The organization of a memory system affects its performance.

--The cache size, block size, and associativity affect the miss rate.

--We can organize the main memory to help reduce miss penalties. For example, interleaved memory supports pipelined data accesses.

3.Can’t neglect cache behavior when evaluating system performance

Saturday, 7 December 2013

Instruction Type Application

Instruction type : Non-Jump, R-type

Type : R Type

Example : add rd,rs,rt

ALU Usage : 1. PC update : No update beyond the normal increment.
2. Source operand fetch : source are rs and rt
3. ALU operation : Determined by the function ( fn ) field
4. Memory access : None
5. Register write : The result from the ALU is written to rd .

Instruction type : Jump Register, R-type

Type : R Type

Example : jalr rd,rs

ALU Usage : 1. PC update : No update beyond the normal increment.
2. Source operand fetch :source are rs and rt
3. ALU operation : Determined by the function field
4. Memory access : None
5. Register write : The result from the ALU is written to rd .

Instruction type : Immediate

Type : I Type

Example : addi rt,rs,imn

ALU Usage : 1. PC update : No update beyond the normal increment.
2. Source operand fetch : Source are rs and immediate field. For all instructions except sltiu ,the immediate field is sign extended . For sltiu , the immediate field is zero extended .
3. ALU operation : Determined by the opcode
4. Memory access : None.
5. Register write : The result from the ALU is written to rt .

Instruction type : Branch

Type : I Type

Example : beq $rs,$rt, imm

ALU Usage : 1. PC update : If the branch condition is true , PC

--PC + 4 + ( signed-extended immediate field ) << 2 .
2. Source operand fetch : Source are rs and rt
3. ALU operation : The source operands are subtracted for comparison .
4. Memory access : None.
5. Register write : None.

Instruction type : Load

Type : I Type

Example : Iw rt, imm(rs)

ALU Usage : 1. PC update : No update beyond the normal increment
2. Source operand fetch : Source are rs and the sign extended immediate field
3. ALU operation : The two source operands are added to get the memory address .
4. Memory access : A memory read control signal is sent to memory . The result from the ALU is sent to memory as the address .
5. Register write : The data from memory is written to rt .

Instruction type : Store

Type : I Type

Example : sw rt,imm(rs)

ALU Usage : 1. PC update : No update beyond the normal increment

2. Source operand fetch : Source are rs and the sign extended immediate field. rt register also fetched.

3. ALU operation : Two source operand are added to get he memory address.

4. Memory access : A memory write control signal is sent to memory. The result is sent to memory as the address. Content of rt are sent to memory as the write data.

5. Register write : None.

Instruction type : Register Jump

Type : R Type

Example : beq $rs.$rt, imm

ALU Usage : 1. PC update : jr and jalr PC <-- rs

2. Source operand fetch : Only rs register.

3. ALU operation : None.

4. Memory access : None.

5. Register write : jr : There is no register write

jalr : rd<-- PC + 4

*For jalr, the incremented PC value must be capture the target address is placed into the PC. With edge-triggered clocking this is easy to do. There is an adder to produce the PC+4 values. The PC does not change until the start of the next cycle, adder output will no change until the start of the next cycle.

Instruction type : Non-Register Jump

Type : R Type

Example : jal target

ALU Usage : 1. PC update : jr and jalr PC <-- target address

* The target address is the concatenation of the high order 4 bits of PC+ . The target field of the instruction and two 0 bits.

2. Source operand fetch : None

3. ALU opration : None.

Memory access : None

jr : There is no register write.

jal : ra <-- PC +4

MIPS (Introduction/ Register Type/Immediate Type/Jump Type)

MIPS

The MIPS architecture (which originally stood for “Microprocessor without Interlocked Pipeline Stages”) is a little endian, word addressable, three-address, fixed-length ISA. This is a load and store architecture, which means only the load and store instructions can access memory. All other instructions must use registers for operands, which implies that this ISA needs a large register set. MIPS is also limited to fixed-length operations (those that operate on data with the same number of bytes).

The MIPS instructions had up to four fields: an opcode, two
operand addresses, and one result address. Essentially three instruction formats are available: the I type (immediate), the R type (register), and the J type (jump). R type instructions have a 6-bit opcode, a 5-bit source register, a 5-bit target register, a 5-bit shift amount, and a 6-bit function. I type instructions have a 6-bit operand, a 5-bit source register, a 5-bit target register or branch condition, and a 16-bit immediate branch displacement or address displacement. J type instructions have a 6-bit opcode and a 26-bit target address.

Instruction format

An instruction is normally made up of a combination of an operation code and some way of specifying an operand,most commonly by its location or address in memory.

There are 3 types of instruction format

Register Type (R-type)
Immediate Type (I-type)
Jump Type (J-type)

Opcode : Machine code representation of instruction mnemonic. The opcode field is 6 bits long ( bit 26 to bit 31)

rs,rt,rd : The numeric representations of the source registers and the destination register. These numbers
correspond to the $X representation of a register, such as $0 or $31. Each of these fields is 5 bits long. (25 to 21, 20 to 16, and 15 to 11, respectively).

Shamt : Used with the shift and rotate instructions, this is the amount by which the source operand rs is
rotated/shifted. This field is 5 bits long (6 to 10).

Funct : For instructions that share an opcode, the funct parameter contains the necessary control codes to
differentiate the different instructions. 6 bits long (0 to 5).

Immediate Type (I-type)

Opcode : The 6-bit opcode of the instruction. In I instructions, all mneumonics have a one-to-one correspondence with the underlying opcodes. This is because there is no funct parameter to differentiate instructions with an identical opcode. 6 bits (26 to 31).

rs,rt : The source and destination register operands, respectively. 5 bits each (21 to 25 and 16 to 20, respectively).

Jump Type (J-type)

Opcode : The 6 bit opcode corresponding to the particular jump command. (26 to 31).

Target (address) : A 26-bit address of the destination. (0 to 25).

Monday, 2 December 2013

Cache Misses

Cache miss occur when a program accesses a memory location that is not in the cache. Since the processor then has to wait for the data to be fetched from the next cache level or form main memory before it can continue to execute.
The performance of the application directly influenced by cache misses.
It is hard to tell from just the number of misses if cache misses are causing performance problems in an application. The same number of misses will cause a much greater relative slowdown in a short-running application than in a long-running one.
A more useful metric is the cache miss ratio, that is, the ratio of memory accesses that cause a cache miss. From the miss ratio you can usually tell whether cache misses may be a performance problem in an application.
The cache miss ratio of an application depends on the size of the cache. A larger cache can hold more cache lines and is therefore expected to get fewer misses.
The performance impact of a cache miss depends on the latency of fetching the data from the next cache level or main memory. For example, assume that you have a processor with two cache levels. A miss in the L1 cache then causes data to be fetched from the L2 cache which has a relatively low latency, so a quite high L1 miss ratio can be acceptable. A miss in the L2 cache on the other hand will cause a long stall while fetching data from main memory, so only a much lower L2 miss ratio is acceptable.

Write-Through

Block in cache could just update on the data write hit.
-but cache and memory would be inconsistent
Write through also update memory but makes writes take longer
-e.g. if base CPI =1, 10% of instructions are stores, write to memory takes 100 cycles
-Effective CPI = 1 + 0.1 x 100 =11
Solution: write buffer
-Holds data waiting to be written to memory
-CPU continues immediately and only stalls on write if write buffer is already full

Write Back

Alternative: On data write hit, just update the block in cache
-keep track of whether each block is dirty
When a dirty block is replaced
-write it back to memory
-can use a write buffer to allow replacing block to be read first

Accessing Cache

Accessing Cache

Total number of bits needed for a cache
-size of tag field is
-32-(n+m+2)
-the total number of bits in direct-mapped cache
-2ⁿx(block size+ tag size+ valid field size)
Block size = 2^m words (2^m+5 bits), 1 bit for valid field size
Total number of bits in cache is
-2ⁿ x (2^m x 32 + (32-n-m-2) +1)
=2ⁿx (2^mx32 +31-n-m)

Example
How many total bit are required for a direct mapped cache with 16kb of data and 4-word blocks, assuming a 32-bit address?
-16kb is 4K (2¹²) words
-with block size of 4 words (2²), 1024 (2¹⁰) blocks remaining for cache size
-each block has 4 x 32= 128 bits data plus a tag, which is 32 – 10 – 2 – 2 bits
-plus a valid bit
-total cache size = 2¹⁰ x (4 x 32+ (32 – 10 – 2 – 2) + 1)
= 2¹⁰ x 147
=147 Kbits

Cache example

-8-block, 1 word/block, direct mapped
-initial state

Direct Mapped Cache

Direct Mapped Cache
Let assume, as we did for fully associate caches that we have 128 slots and 32 bytes per slot
In a direct mapped cache, we treat the 128 slots as like they were an array of slots. We can index the into the array using binary numbers. For 128 slots, you need 7 bits. So, the index of the array is from 0000000₂ up to 1111111_2.

Since we have 128 slots, we need to specify which one we need the cache line to go in and this requires lg128=7bits. We can get the bits from the address itself directly.
Bits A_4-0 is still the offset. The slot number are the next 7bits, Bits A_11-5. The remaining bits, A_31-12 is the
tag.

Finding the slot

Suppose you have address of B_31-0 and you want to find it in cache memory.
By using bits B_11-5 to find the slot.
See if bits B_31-12match the tag of the slot.
If so, get the bytes at offset B_4-0.
If not, fetch the 32 bytes from memory and place it in the slot, updating valid bit, dirty bit and tag as needed.

Advantages

If there's an advantage to the scheme, it's that it's very simple. You don't have to simultaneously match tags with all slots. You just have one slot to check. If the data isn't in the slot, it's obvious that this is the slot to get evicted.

Summary

A direct-mapped cache scheme makes picking the slot easily. It treats the slot as a large array, and the index of the array is picked from bits of the address which is why we need the number of slots to be a power of 2, otherwise we can’t select bits from the address.
The scheme can suffer from many addresses “colliding” to the same slot, thus causing the cache line to be repeatedly evicted, even though there may be empty slots than aren’t being used, or being used with less frequency.

Memory Hierarchy

CPU Speed is dominated by memory performance

-More significant than : ISA, circuit optimization, pipelining, etc

Trick 1: Make slow main memory appear faster (caching)
Trick 2: Make small main memory appear bigger(virtual memory)

Memory Hierarchy Levels

Block(aka line): unit of copying
-may be multiple words
If accessed data is present in upper level
-access satisfied by upper level

If accessed data is absent

-miss-->block copied from lower level
-then accessed data is supplied from upper level

Parking Lot Analogy

Suppose we have 1000 parking spots. However, instead of being unnumbered, each parking spot is given a 3 digit number from 000 to 999.

Your parking spot is based on the first 3 digits of your student ID number.

What problem can occur? If you received your social security number (student ID) in Maryland, the first three digits of your student ID is likely to be in the low 200's.

Thus, there's a much higher chance someone will be in your parking spot. There's also a chance that parking spots numbered in the 900's might be mostly empty since few students may have student IDs with the first 3 digits in that range.

This is basically the direct-mapped scheme. The advantages of this scheme are that it's simple to find your parking spot. It can only be in one location.

Memory Organization

Overview

When the processor needs to read from or write to a location in main memory, it first checks whether a copy of that data is in the cache. If so, the processor immediately reads from or writes to the cache, which is much faster than reading from or writing to main memory.
The data cache is usually organized as a hierarchy of more cache levels.

Cache memory

Cache is a small but high speed memory which is used by central processing unit of a computer to reduce average time to access memory. The cache stores copies of the data from frequently used main memory location. Often the main memory will provide a wider data word to the cache than the CPU requires to fill the cache more rapidly. As long as most memory accesses are cached memory locations, the average latency of memory accesses will be closer to the cache latency than to the latency of main memory.

Cache Analogy

You are writing a term paper for your history class at a table in the library
-As you work you realize you need a book
-You stop writing, fetch the reference, continue writing
-You don’t immediately return the book, maybe you’ll need it again
-Soon you have a few books at your table, and you can work smoothly without needing to fetch more books from the shelves
-The table is a cache for the rest of the library

Now you switch to doing your biology homework
-You need to fetch your biology textbook from the shelf
-If your table is full, you need to return one of the history books back to the shelf to make room for the biology book

Thursday, 19 December 2013

3. PC-Relative Addressing

2. Immediate Addressing

1. Register Modes

Thursday, 12 December 2013

Saturday, 7 December 2013

Instruction Type Application

MIPS

Monday, 2 December 2013

Cache Misses

Overview

Line