- Base addressing is also known as indirect addressing.
- The address of the operand is the sum of the immediate and the value in a register (rs). The size of operand is limited to 16 bits because each MIPS instruction fits into a word.
- This is used in the lw and sw (load word, store word) instructions.
- The offset value is a signed number which is represented in a two's complement format.
Thursday 19 December 2013
4. Base Addressing
4. Base Addressing
3. PC-Relative Addressing
3. PC-Relative Addressing
- A data or instruction memory location is specified as an offset relative to the incremented PC.
- PC-Relative addressing usually used in the beq and bne (branch equal, branch not equal) instructions.
- It implements position-independent codes. Only a small offset is adequate for shorter loops.
2. Immediate Addressing
2. Immediate Addressing
- The operand is a constant within the instruction itself in immediate addressing.
- Immediate addressing is similar to register addressing, the instructions will be executed quickly because they avoid the delays associated with memory access.
- Since the destination of the jump instruction format is held in the instruction, so it can also be considered an example of immediate addressing.
1. Register Modes
1. Register Modes
- Register addressing is the simplest addressing mode.
- Both operands are in a register.Instructions will be executed quickly because they avoid the delays associated with memory access.
- The number of registers is limited since only a few bits are reserved to select a register.
- It takes n bits to address 2n registers, which can fit in five bits since there are 32 registers.
Addressing Modes
Addressing Modes
MIPS has only a small number of ways that is computes addresses in memory. The address can be an address of an instruction (for branch and jump instructions) or it can be an address of data (for load and store instructions).
For any given operation, such as load, add or branch, there are often many different ways to specify the address of the operand(s).
The different ways of determining the address are called addressing modes.
We will look at the 4 ways addresses are computed
MIPS has only a small number of ways that is computes addresses in memory. The address can be an address of an instruction (for branch and jump instructions) or it can be an address of data (for load and store instructions).
For any given operation, such as load, add or branch, there are often many different ways to specify the address of the operand(s).
The different ways of determining the address are called addressing modes.
We will look at the 4 ways addresses are computed
- Register Addressing
- Immediate Addressing
- PC-Relative Addressing
- Base Addressing
Thursday 12 December 2013
Associative Caches
Associative Caches
Spectrum of associativity
Type of associative
1. Fully Associative
--In a fully associative scheme, any slot can store the cache line. The hardware for finding whether the desired data is in the cache requires comparing the tag bits of the address to the tag bits of every slot (in parallel), and making sure the valid bit is set.
2. Set associative
--A set-associative cache scheme is a combination of fully associative and direct mapped schemes. You group slots into sets. You find the appropriate set for a given address (which is like the direct mapped scheme), and within the set you find the appropriate slot (which is like the fully associative scheme).
Example
--Compare 4-blocks caches with block access sequence: 0,8,0,6,8
Cache Performance
Cache Performance
- There are 3 formula need to memorize:
2. CPU time
= (CPU execution cycles + Memory stall cycles) x Cycle time
3. Average memory access time (AMAT)
= Hit time + (Miss rate x Miss penalty)
More Cache Performance Formula
nCan split access time into instructions
& data:
pAverage. memory. access. time =
(% instruction accesses) × (instruction. memory. access time) + (% data accesses) × (data memory. access time)
(% instruction accesses) × (instruction. memory. access time) + (% data accesses) × (data memory. access time)
nAnother simple formula:
pCPU time = (CPU execution clock cycles +
Memory stall clock cycles) × cycle time
nCan break stalls into reads and writes:
pMemory stall cycles =
(Reads × read miss rate × read miss penalty) + (Writes × write miss rate × write miss penalty)
(Reads × read miss rate × read miss penalty) + (Writes × write miss rate × write miss penalty)
Performance Summary
1. Improving Cache Performance
2. The organization of a memory system affects its performance.
--The cache size, block size, and associativity affect the miss rate.
--We can organize the main memory to help reduce miss penalties. For example, interleaved memory supports pipelined data accesses.
3.Can’t neglect cache behavior when evaluating system performance
Saturday 7 December 2013
Instruction Type Application
Instruction Type Application
a)
Instruction type : Non-Jump, R-type
Type : R Type
Example : add rd,rs,rt
ALU Usage : 1. PC update : No update beyond the normal increment.
2. Source operand fetch : source are rs and rt
3. ALU operation : Determined by the function ( fn ) field
4. Memory access : None
5. Register write : The result from the ALU is written to rd .
2. Source operand fetch : source are rs and rt
3. ALU operation : Determined by the function ( fn ) field
4. Memory access : None
5. Register write : The result from the ALU is written to rd .
b)
Instruction type : Jump Register, R-type
Type : R Type
Example : jalr rd,rs
ALU Usage : 1. PC update : No update beyond the normal increment.
2. Source operand fetch :source are rs and rt
3. ALU operation : Determined by the function field
4. Memory access : None
5. Register write : The result from the ALU is written to rd .
2. Source operand fetch :source are rs and rt
3. ALU operation : Determined by the function field
4. Memory access : None
5. Register write : The result from the ALU is written to rd .
c)
Instruction type : Immediate
Type : I Type
Example : addi rt,rs,imn
ALU Usage : 1. PC update : No update beyond the normal increment.
2. Source operand fetch : Source are rs and immediate field. For all instructions except sltiu ,the immediate field is sign extended . For sltiu , the immediate field is zero extended .
3. ALU operation : Determined by the opcode
4. Memory access : None.
5. Register write : The result from the ALU is written to rt .
2. Source operand fetch : Source are rs and immediate field. For all instructions except sltiu ,the immediate field is sign extended . For sltiu , the immediate field is zero extended .
3. ALU operation : Determined by the opcode
4. Memory access : None.
5. Register write : The result from the ALU is written to rt .
d)
Instruction type : Branch
Type : I Type
Example : beq $rs,$rt, imm
ALU Usage : 1. PC update : If the branch condition is true , PC--PC + 4 + ( signed-extended immediate field ) << 2 .
2. Source operand fetch : Source are rs and rt
3. ALU operation : The source operands are subtracted for comparison .
4. Memory access : None.
5. Register write : None.
2. Source operand fetch : Source are rs and rt
3. ALU operation : The source operands are subtracted for comparison .
4. Memory access : None.
5. Register write : None.
e)
Instruction type : Load
Type : I Type
Example : Iw rt, imm(rs)
ALU Usage : 1. PC update : No update beyond the normal increment
2. Source operand fetch : Source are rs and the sign extended immediate field
3. ALU operation : The two source operands are added to get the memory address .
4. Memory access : A memory read control signal is sent to memory . The result from the ALU is sent to memory as the address .
5. Register write : The data from memory is written to rt .
2. Source operand fetch : Source are rs and the sign extended immediate field
3. ALU operation : The two source operands are added to get the memory address .
4. Memory access : A memory read control signal is sent to memory . The result from the ALU is sent to memory as the address .
5. Register write : The data from memory is written to rt .
f)
Instruction type : Store
Type : I Type
Example : sw rt,imm(rs)
ALU Usage : 1. PC update : No update beyond the normal increment
2. Source operand fetch : Source are rs and the sign extended immediate field. rt register also fetched.
3. ALU operation : Two source operand are added to get he memory address.
4. Memory access : A memory write control signal is sent to memory. The result is sent to memory as the address. Content of rt are sent to memory as the write data.
5. Register write : None.
g)
Instruction type : Register Jump
Type : R Type
Example : beq $rs.$rt, imm
ALU Usage : 1. PC update : jr and jalr PC <-- rs
2. Source operand fetch : Only rs register.
3. ALU operation : None.
4. Memory access : None.
5. Register write : jr : There is no register write
jalr : rd<-- PC + 4
*For jalr, the incremented PC value must be capture the target address is placed into the PC. With edge-triggered clocking this is easy to do. There is an adder to produce the PC+4 values. The PC does not change until the start of the next cycle, adder output will no change until the start of the next cycle.
h)
Instruction type : Non-Register Jump
Type : R Type
Example : jal target
ALU Usage : 1. PC update : jr and jalr PC <-- target address
* The target address is the concatenation of the high order 4 bits of PC+ . The target field of the instruction and two 0 bits.
2. Source operand fetch : None
3. ALU opration : None.
Memory access : None
jr : There is no register write.
jal : ra <-- PC +4
*For jalr, the incremented PC value must be capture the target address is placed into the PC. With edge-triggered clocking this is easy to do. There is an adder to produce the PC+4 values. The PC does not change until the start of the next cycle, adder output will no change until the start of the next cycle.
MIPS (Introduction/ Register Type/Immediate Type/Jump Type)
MIPS
The MIPS architecture (which originally stood for “Microprocessor without Interlocked Pipeline Stages”) is a little endian, word addressable, three-address, fixed-length ISA. This is a load and store architecture, which means only the load and store instructions can access memory. All other instructions must use registers for operands, which implies that this ISA needs a large register set. MIPS is also limited to fixed-length operations (those that operate on data with the same number of bytes). The MIPS instructions had up to four fields: an opcode, two
operand addresses, and one result address. Essentially three instruction formats are available: the I type (immediate), the R type (register), and the J type (jump). R type instructions have a 6-bit opcode, a 5-bit source register, a 5-bit target register, a 5-bit shift amount, and a 6-bit function. I type instructions have a 6-bit operand, a 5-bit source register, a 5-bit target register or branch condition, and a 16-bit immediate branch displacement or address displacement. J type instructions have a 6-bit opcode and a 26-bit target address.
Instruction format
An instruction is normally made up of a combination of an operation code and some way of specifying an operand,most commonly by its location or address in memory.
There are 3 types of instruction format
- Register Type (R-type)
- Immediate Type (I-type)
- Jump Type (J-type)
Register Type (R-type)
Opcode : Machine code representation of instruction mnemonic. The opcode field is 6 bits long ( bit 26 to bit 31)
rs,rt,rd : The numeric representations of the source registers and the destination register. These numbers
correspond to the $X representation of a register, such as $0 or $31. Each of these fields is 5 bits long. (25 to 21, 20 to 16, and 15 to 11, respectively).
Shamt : Used with the shift and rotate instructions, this is the amount by which the source operand rs is
rotated/shifted. This field is 5 bits long (6 to 10).
Funct : For instructions that share an opcode, the funct parameter contains the necessary control codes to
differentiate the different instructions. 6 bits long (0 to 5).
Immediate Type (I-type)
Opcode : The 6-bit opcode of the instruction. In I instructions, all mneumonics have a one-to-one correspondence with the underlying opcodes. This is because there is no funct parameter to differentiate instructions with an identical opcode. 6 bits (26 to 31).
Monday 2 December 2013
Cache Misses
Cache Misses
Cache miss occur when a program accesses a memory location
that is not in the cache. Since the processor then has to wait for the data to
be fetched from the next cache level or form main memory before it can continue
to execute.
The performance of the application directly influenced by cache misses.
It is hard to tell from just the number of misses if cache misses are causing performance problems in an application. The same number of misses will cause a much greater relative slowdown in a short-running application than in a long-running one.
A more useful metric is the cache miss ratio, that is, the ratio of memory accesses that cause a cache miss. From the miss ratio you can usually tell whether cache misses may be a performance problem in an application.
The cache miss ratio of an application depends on the size of the cache. A larger cache can hold more cache lines and is therefore expected to get fewer misses.
The performance impact of a cache miss depends on the latency of fetching the data from the next cache level or main memory. For example, assume that you have a processor with two cache levels. A miss in the L1 cache then causes data to be fetched from the L2 cache which has a relatively low latency, so a quite high L1 miss ratio can be acceptable. A miss in the L2 cache on the other hand will cause a long stall while fetching data from main memory, so only a much lower L2 miss ratio is acceptable.
The performance of the application directly influenced by cache misses.
It is hard to tell from just the number of misses if cache misses are causing performance problems in an application. The same number of misses will cause a much greater relative slowdown in a short-running application than in a long-running one.
A more useful metric is the cache miss ratio, that is, the ratio of memory accesses that cause a cache miss. From the miss ratio you can usually tell whether cache misses may be a performance problem in an application.
The cache miss ratio of an application depends on the size of the cache. A larger cache can hold more cache lines and is therefore expected to get fewer misses.
The performance impact of a cache miss depends on the latency of fetching the data from the next cache level or main memory. For example, assume that you have a processor with two cache levels. A miss in the L1 cache then causes data to be fetched from the L2 cache which has a relatively low latency, so a quite high L1 miss ratio can be acceptable. A miss in the L2 cache on the other hand will cause a long stall while fetching data from main memory, so only a much lower L2 miss ratio is acceptable.
Write-Through
Block in cache could just update on the data write hit.
-but cache and memory would be inconsistent
Write through also update memory but makes writes take longer
-e.g. if base CPI =1, 10% of instructions are stores, write to memory takes 100 cycles
-Effective CPI = 1 + 0.1 x 100 =11
Solution: write buffer
-Holds data waiting to be written to memory
-CPU continues immediately and only stalls on write if write buffer is already full
-but cache and memory would be inconsistent
Write through also update memory but makes writes take longer
-e.g. if base CPI =1, 10% of instructions are stores, write to memory takes 100 cycles
-Effective CPI = 1 + 0.1 x 100 =11
Solution: write buffer
-Holds data waiting to be written to memory
-CPU continues immediately and only stalls on write if write buffer is already full
Write Back
Alternative: On data write hit, just update the block in
cache
-keep track of whether each block is dirty
When a dirty block is replaced
-write it back to memory
-can use a write buffer to allow replacing block to be read first
-keep track of whether each block is dirty
When a dirty block is replaced
-write it back to memory
-can use a write buffer to allow replacing block to be read first
Accessing Cache
Accessing Cache
Total number of bits needed for a cache
-size of tag field is
-32-(n+m+2)
-the total number of bits in direct-mapped cache
-2nx(block size+ tag size+ valid field size)
Block size = 2m words (2m+5 bits), 1 bit for valid field size
Total number of bits in cache is
-2n x (2m x 32 + (32-n-m-2) +1)
=2n x (2m x32 +31-n-m)
-size of tag field is
-32-(n+m+2)
-the total number of bits in direct-mapped cache
-2nx(block size+ tag size+ valid field size)
Block size = 2m words (2m+5 bits), 1 bit for valid field size
Total number of bits in cache is
-2n x (2m x 32 + (32-n-m-2) +1)
=2n x (2m x32 +31-n-m)
Example
How many total bit are required for a direct mapped cache with 16kb of data and 4-word blocks, assuming a 32-bit address?
-16kb is 4K (212) words
-with block size of 4 words (22), 1024 (210) blocks remaining for cache size
-each block has 4 x 32= 128 bits data plus a tag, which is 32 – 10 – 2 – 2 bits
-plus a valid bit
-total cache size = 210 x (4 x 32+ (32 – 10 – 2 – 2) + 1)
= 210 x 147
=147 Kbits
How many total bit are required for a direct mapped cache with 16kb of data and 4-word blocks, assuming a 32-bit address?
-16kb is 4K (212) words
-with block size of 4 words (22), 1024 (210) blocks remaining for cache size
-each block has 4 x 32= 128 bits data plus a tag, which is 32 – 10 – 2 – 2 bits
-plus a valid bit
-total cache size = 210 x (4 x 32+ (32 – 10 – 2 – 2) + 1)
= 210 x 147
=147 Kbits
Direct Mapped Cache
Direct
Mapped Cache
Let assume, as we did for fully associate caches that we have 128 slots and 32 bytes per slot
In a direct mapped cache, we treat the 128 slots as like they were an array of slots. We can index the into the array using binary numbers. For 128 slots, you need 7 bits. So, the index of the array is from 00000002 up to 11111112.
Let assume, as we did for fully associate caches that we have 128 slots and 32 bytes per slot
In a direct mapped cache, we treat the 128 slots as like they were an array of slots. We can index the into the array using binary numbers. For 128 slots, you need 7 bits. So, the index of the array is from 00000002 up to 11111112.
Since we
have 128 slots, we need to specify which one we need the cache line to go in
and this requires lg128=7bits. We can get the bits from the address itself
directly.
Bits A4-0 is still the offset. The slot number are the next 7bits, Bits A11-5. The remaining bits, A31-12 is the
tag.
Bits A4-0 is still the offset. The slot number are the next 7bits, Bits A11-5. The remaining bits, A31-12 is the
tag.
Finding the
slot
Suppose you
have address of B31-0 and you want to find it in cache memory.
By using bits B11-5 to find the slot.
See if bits B31-12match the tag of the slot.
If so, get the bytes at offset B4-0.
If not, fetch the 32 bytes from memory and place it in the slot, updating valid bit, dirty bit and tag as needed.
By using bits B11-5 to find the slot.
See if bits B31-12match the tag of the slot.
If so, get the bytes at offset B4-0.
If not, fetch the 32 bytes from memory and place it in the slot, updating valid bit, dirty bit and tag as needed.
Advantages
If there's an advantage to the scheme, it's that it's very
simple. You don't have to simultaneously match tags with all slots. You just
have one slot to check. If the data isn't in the slot, it's obvious that this
is the slot to get evicted.
Summary
A direct-mapped cache scheme makes picking the slot easily.
It treats the slot as a large array, and the index of the array is picked from
bits of the address which is why we need the number of slots to be a power of
2, otherwise we can’t select bits from the address.
The scheme can suffer from many addresses “colliding” to the same slot, thus causing the cache line to be repeatedly evicted, even though there may be empty slots than aren’t being used, or being used with less frequency.
The scheme can suffer from many addresses “colliding” to the same slot, thus causing the cache line to be repeatedly evicted, even though there may be empty slots than aren’t being used, or being used with less frequency.
Memory Hierarchy
CPU Speed is dominated by memory performance
Trick
1: Make slow main memory appear faster (caching)
Trick 2: Make small main memory appear bigger(virtual memory)
Trick 2: Make small main memory appear bigger(virtual memory)
Memory
Hierarchy Levels
-may be multiple words
If accessed data is present in upper level
-access satisfied by upper level
If accessed data is absent
-miss-->block copied from lower level-then accessed data is supplied from upper level
Parking Lot Analogy
Suppose we have 1000 parking spots. However, instead of
being unnumbered, each parking spot is given a 3 digit number from 000 to 999.
Your parking spot is based on the first 3 digits of your
student ID number.
What problem can occur? If you received your social security
number (student ID) in Maryland, the first three digits of your student ID is
likely to be in the low 200's.
Thus, there's a much higher chance someone will be in your
parking spot. There's also a chance that parking spots numbered in the 900's
might be mostly empty since few students may have student IDs with the first 3
digits in that range.
This is basically the direct-mapped scheme.
The advantages of this scheme are that it's simple to find your parking spot.
It can only be in one location.
Memory Organization
Overview
When the processor needs to read from or write to a location
in main memory, it first checks whether a copy of that data is in the cache. If
so, the processor immediately reads from or writes to the cache, which is much
faster than reading from or writing to main memory.
The data cache is usually organized as a hierarchy of more cache levels.
The data cache is usually organized as a hierarchy of more cache levels.
Cache
memory
Cache is a
small but high speed memory which is used by central processing unit of a
computer to reduce average time to access memory. The cache stores copies of
the data from frequently used main memory location. Often the main memory will provide a wider data word to the
cache than the CPU requires to fill the cache more rapidly. As long as most
memory accesses are cached memory locations, the average latency of
memory accesses will be closer to the cache latency than to the latency of main
memory.
Cache Analogy
You are writing a term paper for your history class at a
table in the library
-As you work you realize you need a book
-You stop writing, fetch the reference, continue writing
-You don’t immediately return the book, maybe you’ll need it again
-Soon you have a few books at your table, and you can work smoothly without needing to fetch more books from the shelves
-The table is a cache for the rest of the library
-As you work you realize you need a book
-You stop writing, fetch the reference, continue writing
-You don’t immediately return the book, maybe you’ll need it again
-Soon you have a few books at your table, and you can work smoothly without needing to fetch more books from the shelves
-The table is a cache for the rest of the library
Now you switch to doing your biology homework
-You need to fetch your biology textbook from the shelf
-If your table is full, you need to return one of the history books back to the shelf to make room for the biology book
-You need to fetch your biology textbook from the shelf
-If your table is full, you need to return one of the history books back to the shelf to make room for the biology book
Subscribe to:
Posts (Atom)