Thursday 19 December 2013

4. Base Addressing

4. Base Addressing

  • Base addressing is also known as indirect addressing.
  • The address of the operand is the sum of the immediate and the value in a register (rs). The size of operand is limited to 16 bits because each MIPS instruction fits into a word.
  • This is used in the lw and sw (load word, store word) instructions.
  • The offset value is a signed number which is represented in a two's complement format.


3. PC-Relative Addressing

3. PC-Relative Addressing

  • A data or instruction memory location is specified as an offset relative to the incremented PC.
  • PC-Relative addressing usually used in the beq and bne (branch equal, branch not equal) instructions.
  • It implements position-independent codes. Only a small offset is adequate for shorter loops.

2. Immediate Addressing

2. Immediate Addressing

  • The operand is a constant within the instruction itself in immediate addressing.
  • Immediate addressing is similar to register addressing, the instructions will be executed quickly because they avoid the delays associated with memory access.
  • Since the destination of the jump instruction format is held in the instruction, so it can also be considered an example of immediate addressing.

    1. Register Modes

    1. Register Modes


    • Register addressing is the simplest addressing mode.
    • Both operands are in a register.Instructions will be executed quickly because they avoid the delays associated with memory access.
    • The number of registers is limited since only a few bits are reserved to select a register.
    • It takes n bits to address 2registers, which can fit in five bits since there are 32 registers.
















    Addressing Modes

    Addressing Modes

    MIPS has only a small number of ways that is computes addresses in memory. The address can be an address of an instruction (for branch and jump instructions) or it can be an address of data (for load and store instructions).

    For any given operation, such as load, add or branch, there are often many different ways to specify the address of the operand(s).
    The different ways of determining the address are called addressing modes.

    We will look at the 4 ways addresses are computed

    1. Register Addressing
    2. Immediate Addressing
    3. PC-Relative Addressing
    4. Base Addressing





    Thursday 12 December 2013

    Associative Caches

    Associative Caches

    -Increased associativity decrease miss rate but with diminishing returns


    Spectrum of associativity






    Type of associative

    1. Fully Associative
    --In a fully associative scheme, any slot can store the cache line. The hardware for finding whether the desired data is in the cache requires comparing the tag bits of the address to the tag bits of every slot (in parallel), and making sure the valid bit is set.



    2. Set associative
    --A set-associative cache scheme is a combination of fully associative and direct mapped schemes. You group slots into sets. You find the appropriate set for a given address (which is like the direct mapped scheme), and within the set you find the appropriate slot (which is like the fully associative scheme).





    Example
    --Compare 4-blocks caches with block access sequence: 0,8,0,6,8





    Cache Performance

    Cache Performance

    • There are 3 formula need to memorize:
    1. Memory stall cycles









    2. CPU time
    = (CPU execution cycles + Memory stall cycles) x Cycle time


    3. Average memory access time (AMAT)
    = Hit time + (Miss rate x Miss penalty)





    More Cache Performance Formula

    n








    nCan split access time into instructions & data:
    pAverage. memory. access. time =
    (% instruction accesses) × (instruction.
    memory. access time) + (% data accesses) × (data memory. access time)


    nAnother simple formula:
    pCPU time = (CPU execution clock cycles + Memory stall clock cycles) × cycle time

    nCan break stalls into reads and writes:
    pMemory stall cycles =
    (Reads × read miss rate × read miss penalty) + (Writes × write miss rate × write miss penalty)




    Performance Summary
    1. Improving Cache Performance



    2. The organization of a memory system affects its  performance.
    --The cache size, block size, and associativity affect the miss rate.
    --We can organize the main memory to help reduce miss penalties. For example, interleaved memory supports pipelined data accesses.


    3.Can’t neglect cache behavior when evaluating system performance

    Saturday 7 December 2013

    Instruction Type Application

    Instruction Type Application

    a)
    Instruction type : Non-Jump, R-type
    Type                : R Type
    Example           : add rd,rs,rt 
    ALU Usage      :  1. PC update : No update beyond the normal increment.
                               2. Source operand fetch : source are rs and rt
                               3. ALU operation : Determined by the function ( fn ) field
                               4. Memory access : None
                               5. Register write : The result from the ALU is written to rd .

    b)
    Instruction type : Jump Register, R-type
    Type                : R Type
    Example           : jalr rd,rs 
    ALU Usage      :  1. PC update : No update beyond the normal increment.
                               2. Source operand fetch :source are rs and rt
                               3. ALU operation : Determined by the function field
                               4. Memory access : None
                               5. Register write : The result from the ALU is written to rd .

    c)
    Instruction type : Immediate
    Type                :  I Type
    Example           : addi rt,rs,imn
    ALU Usage      : 1. PC update : No update beyond the normal increment.
                              2. Source operand fetch : Source are rs and immediate field. For all instructions except sltiu                           ,the immediate field is sign extended . For sltiu , the immediate field is zero extended .
                             3. ALU operation : Determined by the opcode
                             4. Memory access : None.
                             5. Register write : The result from the ALU is written to rt .

    d)
    Instruction type : Branch
    Type                :  I Type
    Example           : beq $rs,$rt, imm
    ALU Usage      : 1. PC update : If the branch condition is true , PC--PC + 4 + ( signed-extended immediate                              field ) << 2 .
                              2. Source operand fetch : Source are rs and rt
                              3. ALU operation : The source operands are subtracted for comparison .
                              4. Memory access : None.
                              5. Register write : None.

    e)
    Instruction type : Load
    Type                :  I Type
    Example           : Iw rt, imm(rs)
    ALU Usage      : 1. PC update : No update beyond the normal increment
                              2. Source operand fetch : Source are rs and the sign extended immediate field
                              3. ALU operation : The two source operands are added to get the memory address .
                              4. Memory access : A memory read control signal is sent to memory . The result from the                                ALU is sent to memory as the address .
                              5. Register write : The data from memory is written to rt .

    f)
    Instruction type : Store
    Type                :  I Type
    Example           : sw rt,imm(rs)
    ALU Usage      : 1. PC update : No update beyond the normal increment
                              2. Source operand fetch : Source are rs and the sign extended immediate field. rt register                                   also fetched.
                              3. ALU operation : Two source operand are added to get he memory address.
                              4. Memory access : A memory write control signal is sent to memory. The result is sent to                                 memory as the address. Content of rt are sent to memory as the write data.
                              5. Register write : None.

    g) 
    Instruction type : Register Jump
    Type                :  R Type
    Example           : beq $rs.$rt, imm
    ALU Usage      : 1. PC update : jr and jalr    PC <-- rs
                              2. Source operand fetch : Only rs register.
                              3. ALU operation : None.
                              4. Memory access : None.
                              5. Register write : jr :     There is no register write                 
                                                          jalr :   rd<-- PC + 4

                               *For jalr, the incremented PC value must be capture the target address is placed into the                                PC. With edge-triggered clocking this is easy to do. There is an adder to produce the                                    PC+4 values. The PC does not change until the start of the next cycle, adder output will                                no change until the start of the next cycle.

    h)
    Instruction type : Non-Register Jump
    Type                :  R Type
    Example           :  jal target
    ALU Usage      : 1. PC update : jr and jalr   PC <-- target address
                               * The target address is the concatenation of the high order 4 bits of PC+ . The target field                                 of the instruction and two 0 bits.
                              2. Source operand fetch : None
                              3. ALU opration : None.
                                  Memory access : None
                                  jr : There is no register write.
                                  jal : ra <-- PC +4

                                *For jalr, the incremented PC value must be capture the target address is placed into the                                PC. With edge-triggered clocking this is easy to do. There is an adder to produce the                                    PC+4 values. The PC does not change until the start of the next cycle, adder output will                                no change until the start of the next cycle.


    MIPS (Introduction/ Register Type/Immediate Type/Jump Type)


    MIPS

    The MIPS architecture (which originally stood for “Microprocessor without Interlocked Pipeline Stages”) is a little endian, word addressable, three-address, fixed-length ISA. This is a load and store architecture, which means only the load and store instructions can access memory. All other instructions must use registers for operands, which implies that this ISA needs a large register set. MIPS is also limited to fixed-length operations (those that operate on data with the same number of bytes).

    The MIPS instructions had up to four fields: an opcode, two
    operand addresses, and one result address. Essentially three instruction formats are available: the I type (immediate), the R type (register), and the J type (jump). R type instructions have a 6-bit opcode, a 5-bit source register, a 5-bit target register, a 5-bit shift amount, and a 6-bit function. I type instructions have a 6-bit operand, a 5-bit source register, a 5-bit target register or branch condition, and a 16-bit immediate branch displacement or address displacement. J type instructions have a 6-bit opcode and a 26-bit target address.


    Instruction format

    An instruction is normally made up of a combination of an operation code and some way of specifying an operand,most commonly by its location or address in memory.




    There are 3 types of instruction format
    • Register Type (R-type) 
    • Immediate Type (I-type) 
    • Jump Type (J-type) 


    Register Type (R-type)




    Opcode : Machine code representation of instruction mnemonic. The opcode field is 6 bits long ( bit 26 to                     bit 31)

    rs,rt,rd : The numeric representations of the source registers and the destination register. These numbers
                 correspond to the $X representation of a register, such as $0 or $31. Each of these fields is 5 bits                  long. (25 to 21, 20 to 16, and 15 to 11, respectively).

    Shamt : Used with the shift and rotate instructions, this is the amount by which the source operand rs is
                 rotated/shifted. This field is 5 bits long (6 to 10).

    Funct : For instructions that share an opcode, the funct parameter contains the necessary control codes to
               differentiate the different instructions. 6 bits long (0 to 5).



    Immediate Type (I-type)



    Opcode : The 6-bit opcode of the instruction. In I instructions, all mneumonics have a one-to-one                                 correspondence with the underlying opcodes. This is because there is no funct parameter to                           differentiate instructions with an identical opcode. 6 bits (26 to 31).

    rs,rt : The source and destination register operands, respectively. 5 bits each (21 to 25 and 16 to 20,                       respectively).



    Jump Type (J-type)



    Opcode : The 6 bit opcode corresponding to the particular jump command. (26 to 31).


    Target (address) : A 26-bit address of the destination. (0 to 25).












    Monday 2 December 2013

    Cache Misses

    Cache Misses

    Cache miss occur when a program accesses a memory location that is not in the cache. Since the processor then has to wait for the data to be fetched from the next cache level or form main memory before it can continue to execute.
    The performance of the application directly influenced by cache misses.
    It is hard to tell from just the number of misses if cache misses are causing performance problems in an application. The same number of misses will cause a much greater relative slowdown in a short-running application than in a long-running one.
    A more useful metric is the cache miss ratio, that is, the ratio of memory accesses that cause a cache miss. From the miss ratio you can usually tell whether cache misses may be a performance problem in an application.
    The cache miss ratio of an application depends on the size of the cache. A larger cache can hold more cache lines and is therefore expected to get fewer misses.
    The performance impact of a cache miss depends on the latency of fetching the data from the next cache level or main memory. For example, assume that you have a processor with two cache levels. A miss in the L1 cache then causes data to be fetched from the L2 cache which has a relatively low latency, so a quite high L1 miss ratio can be acceptable. A miss in the L2 cache on the other hand will cause a long stall while fetching data from main memory, so only a much lower L2 miss ratio is acceptable.

    Write-Through
    Block in cache could just update on the data write hit.
    -but cache and memory would be inconsistent
    Write through also update memory but makes writes take longer
    -e.g. if base CPI =1, 10% of instructions are stores, write to memory takes 100 cycles
    -Effective CPI = 1 + 0.1 x 100 =11
    Solution: write buffer
    -Holds data waiting to be written to memory
    -CPU continues immediately and only stalls on write if write buffer is already full

    Write Back

    Alternative: On data write hit, just update the block in cache
    -keep track of whether each block is dirty
    When a dirty block is replaced
    -write it back to memory
    -can use a write buffer to allow replacing block to be read first

    Accessing Cache

    Accessing Cache
    Total number of bits needed for a cache
    -size of tag field is
                    -32-(n+m+2)
    -the total number of bits in direct-mapped cache
                    -2nx(block size+ tag size+ valid field size)
    Block size = 2m words (2m+5 bits), 1 bit for valid field size
    Total number of bits in cache is
    -2n x (2m x 32 + (32-n-m-2) +1)
    =2n x (2m x32 +31-n-m)


    Example
    How many total bit are required for a direct mapped cache with 16kb of data and 4-word blocks, assuming a 32-bit address?
    -16kb is 4K (212) words
    -with block size of 4 words (22), 1024 (210) blocks remaining for cache size
    -each block has 4 x 32= 128 bits data plus a tag, which is 32 – 10 – 2 – 2 bits
    -plus a valid bit
    -total cache size = 210 x (4 x 32+ (32 – 10 – 2 – 2) + 1)
                                    = 210 x 147
                                    =147 Kbits

    Cache example

    -8-block, 1 word/block, direct mapped
    -initial state







    Direct Mapped Cache

    Direct Mapped Cache
    Let assume, as we did for fully associate caches that we have 128 slots and 32 bytes per slot
    In a direct mapped cache, we treat the 128 slots as like they were an array of slots. We can index the into the array using binary numbers. For 128 slots, you need 7 bits. So, the index of the array is from 00000002 up to 11111112.
    Since we have 128 slots, we need to specify which one we need the cache line to go in and this requires lg128=7bits. We can get the bits from the address itself directly.
    Bits A4-0 is still the offset. The slot number are the next 7bits, Bits A11-5. The remaining bits, A31-12 is the
    tag.

    Finding the slot
    Suppose you have address of B31-0 and you want to find it in cache memory.
    By using bits B11-5 to find the slot.
    See if bits B31-12match the tag of the slot.
    If so, get the bytes at offset B4-0.
    If not, fetch the 32 bytes from memory and place it in the slot, updating valid bit, dirty bit and tag as needed.

    Advantages
    If there's an advantage to the scheme, it's that it's very simple. You don't have to simultaneously match tags with all slots. You just have one slot to check. If the data isn't in the slot, it's obvious that this is the slot to get evicted.

    Summary
    A direct-mapped cache scheme makes picking the slot easily. It treats the slot as a large array, and the index of the array is picked from bits of the address which is why we need the number of slots to be a power of 2, otherwise we can’t select bits from the address.
    The scheme can suffer from many addresses “colliding” to the same slot, thus causing the cache line to be repeatedly evicted, even though there may be empty slots than aren’t being used, or being used with less frequency.

    Memory Hierarchy


    CPU Speed is dominated by memory performance

    -More significant than : ISA, circuit optimization, pipelining, etc
    Trick 1: Make slow main memory appear faster (caching)
    Trick 2: Make small main memory appear bigger(virtual memory)

    Memory Hierarchy Levels

    Block(aka line): unit of copying
    -may be multiple words
    If accessed data is present in upper level
    -access satisfied by upper level
    If accessed data is absent
    -miss-->block copied from lower level
    -then accessed data is supplied from upper level




    Parking Lot Analogy
    Suppose we have 1000 parking spots. However, instead of being unnumbered, each parking spot is given a 3 digit number from 000 to 999.
    Your parking spot is based on the first 3 digits of your student ID number.
    What problem can occur? If you received your social security number (student ID) in Maryland, the first three digits of your student ID is likely to be in the low 200's.
    Thus, there's a much higher chance someone will be in your parking spot. There's also a chance that parking spots numbered in the 900's might be mostly empty since few students may have student IDs with the first 3 digits in that range.
    This is basically the direct-mapped scheme. The advantages of this scheme are that it's simple to find your parking spot. It can only be in one location.

    Memory Organization

    Overview


    When the processor needs to read from or write to a location in main memory, it first checks whether a copy of that data is in the cache. If so, the processor immediately reads from or writes to the cache, which is much faster than reading from or writing to main memory.
    The data cache is usually organized as a hierarchy of more cache levels.

    Cache memory
    Cache is a small but high speed memory which is used by central processing unit of a computer to reduce average time to access memory. The cache stores copies of the data from frequently used main memory location. Often the main memory will provide a wider data word to the cache than the CPU requires to fill the cache more rapidly. As long as most memory accesses are cached memory locations, the average latency of memory accesses will be closer to the cache latency than to the latency of main memory.

    Cache Analogy
    You are writing a term paper for your history class at a table in the library
    -As you work you realize you need a book
    -You stop writing, fetch the reference, continue writing
    -You don’t immediately return the book, maybe you’ll need it again
    -Soon you have a few books at your table, and you can work smoothly without needing to fetch more books from the shelves
    -The table is a cache for the rest of the library
    Now you switch to doing your biology homework
    -You need to fetch your biology textbook from the shelf
    -If your table is full, you need to return one of the history books back to the shelf to make room for the biology book