Prepare-se para as provas
Obter pontos
Guias e Dicas

Prepare-se para as provas

Estude fácil! Tem muito documento disponível na Docsity

Ganhe pontos para baixar

Ganhe pontos ajudando outros esrudantes ou compre um plano Premium

Guias e Dicas

Venda na Docsity

Entrar Cadastre-se

Prepare-se para as provas

Estude fácil! Tem muito documento disponível na Docsity

Encontrar documentos

Prepare-se para as provas com trabalhos de outros alunos como você, aqui na Docsity

Pesquisar documentos Store

Os melhores documentos à venda: Trabalhos de alunos formados

Videoaulas

Prepare-se com as videoaulas e exercícios resolvidos criados a partir da grade da sua Universidade

Quiz

Responda perguntas de provas passadas e avalie sua preparação.

Pesquise entre todos os recursos de estudo

Docsity AINEW

Resuma seus documentos, faça perguntas, converta-os em questionários e mapas conceituais

TCC e ENEM 2025

Estude com provas passadas, TCCs e dicas úteis

Explorar perguntas

Tire suas dúvidas lendo as respostas dadas por outros alunos como você.

Ganhe pontos para baixar

Ganhe pontos ajudando outros esrudantes ou compre um plano Premium

Compartilhe documentos

20 Pontos

Por cada documento compartilhado

Responda às perguntas

5 Pontos

por cada resposta enviada (máx. 1 por dia)

Todas as maneiras de obter pontos grátis

Ganhe pontos imediatamente

Escolha um Plano Premium com todos os pontos que precisa

Oportunidades de estudo

Escolha seu próximo programa de estudos

Entre em contato direto com as melhores Universidades do mundo. Pesquise entre milhares de Universidades e parceiros oficiais

Comunidade

Pergunte à comunidade

Peça ajuda à comunidade e tire suas dúvidas relacionadas ao estudo

Ranking universidades

Descubra as melhores universidades em seu país de acordo com os usuários da Docsity

Guias grátis

Os eBooks que salvam estudantes!

Baixe gratuitamente nossos guias de estudo, métodos para diminuir a ansiedade, dicas de TCC preparadas pelos professores da Docsity

Do blog

Vá para o blog

Computer Architecture Assigment, Exercícios de Computação Distribuída

Centro de Aperfeiçoamento Profissional (CAP)Computação Distribuída

Assume a hypothetical GPU with the following characteristics

Tipologia: Exercícios

2017

Compartilhado em 08/12/2021

Mofeue 🇧🇷

1 documento

1 / 8

Esta página não é visível na pré-visualização

Não perca as partes importantes!

CSCI 510: Computer Architecture

Written Assignment 3

Due November 10, 2016 23:59:59PM

1. (5 pts) Assume a hypothetical GPU with the following characteristics:

 Clock rate 1.5 GHz

 Contains 16 SIMD processors, each containing 32 single-precision floating point units

 Has 100 GB/sec off-chip memory bandwidth

Without considering memory bandwidth, what is the peak single-precision floating-point

throughput for this GPU in GLFOP/sec, assuming that all memory latencies can be hidden? Is this

throughput sustainable given the memory bandwidth limitation?

Answer:

The peak single-precision floating-point throughput is

1.5 × 16 × 32 = 768 GFLOPS/s

However, assuming each single precision operation requires four-byte two operands and outputs

one four-byte result, sustaining this throughput (assuming no temporal locality) would require

memory bandwidth of

12 bytes/FLOP × 768 GFLOPs/s = 9.216 TB/s

Since 9.216 TB/s >> 100 GB/s, this throughput is not sustainable, but can still be achieved in short

bursts when using on-chip cache.

2. The following code multiplies two vectors that contain single-precision complex values:

for (i = 0; i < 400; ++ i) {

u[i] = a[i] * b[i] – c[i] * d[i];

v[i] = a[i] * d[i] + c[i] * b[i];

}

Assume that the processor runs at 1GHz and has a maximum vector length of 64. The load/store unit

has a start-up overhead of 15 cycles; the multiply unit, 8 cycles; and the add/subtract unit, 5 cycles.

a. (5 pts) What is the arithmetic intensity of this piece of code? Justify your answer.

b. (10 pts) Convert this loop into VMIPS assembly code using strip mining.

c. (15 pts) Assuming chaining and a single memory pipeline, how many chimes are required? How

many clock cycles are required per result value, including start-up overhead?

d. (15 pts) Now assume chaining and three memory pipelines. If there are no bank conflicts in the

loop’s accesses, how many clock cycles are required per result, including start-up overhead?

Answer:

Documentos relacionados

Learning Computer Architecture with Raspberry Pi

Assembly CPU and Computer architecture

Computer Organization and Architecture: Fundamentals and Contemporary Design Issues

Digital Design and Computer Architecture ARM Edition - David Harris, Sarah Harris

(1)

Introduction to Computer Networks

Arquitetura SPARC Scalalable Processor ARChitecture

Scanned Lecture Notes on Computer Science Concepts

Structure as Architecture

(1)

AutoCAD Architecture 2008

(1)

Architectures of neuronal circuits

visual dictionary - architecture

Pré-visualização parcial do texto

Baixe Computer Architecture Assigment e outras Exercícios em PDF para Computação Distribuída, somente na Docsity!

CSCI 510: Computer Architecture

Written Assignment 3

Due November 10, 2016 23:59:59PM

(5 pts) Assume a hypothetical GPU with the following characteristics:  Clock rate 1.5 GHz  Contains 16 SIMD processors, each containing 32 single-precision floating point units  Has 100 GB/sec off-chip memory bandwidth Without considering memory bandwidth, what is the peak single-precision floating-point throughput for this GPU in GLFOP/sec, assuming that all memory latencies can be hidden? Is this throughput sustainable given the memory bandwidth limitation? Answer: The peak single-precision floating-point throughput is 1.5 × 16 × 32 = 768 GFLOPS/s However, assuming each single precision operation requires four-byte two operands and outputs one four-byte result, sustaining this throughput (assuming no temporal locality) would require memory bandwidth of 12 bytes/FLOP × 768 GFLOPs/s = 9.2 16 TB/s Since 9.2 16 TB/s >> 100 GB/s, this throughput is not sustainable, but can still be achieved in short bursts when using on-chip cache.
The following code multiplies two vectors that contain single-precision complex values: for (i = 0; i < 4 00; ++ i) { u[i] = a[i] * b[i] – c[i] * d[i]; v[i] = a[i] * d[i] + c[i] * b[i]; } Assume that the processor runs at 1GHz and has a maximum vector length of 64. The load/store unit has a start-up overhead of 15 cycles; the multiply unit, 8 cycles; and the add/subtract unit, 5 cycles. a. (5 pts) What is the arithmetic intensity of this piece of code? Justify your answer. b. (10 pts) Convert this loop into VMIPS assembly code using strip mining. c. (15 pts) Assuming chaining and a single memory pipeline, how many chimes are required? How many clock cycles are required per result value, including start-up overhead? d. (15 pts) Now assume chaining and three memory pipelines. If there are no bank conflicts in the loop’s accesses, how many clock cycles are required per result, including start-up overhead? Answer:

a. Every single-precision number takes 4 bytes. In each iteration, the code reads 4 × 4 = 16 bytes from the main memory (the second access of a[i], b[i], c[i] and d[i] is from the cache) and writes 2 × 4 = 8 bytes to the main memory, in total 24 bytes. In each iteration, six floating point operations are executed. Therefore, the arithmetic intensity is 6 / 24 = 1/4 = 0.25. b. li $VL,16 # perform the first 16 ops li $r1,0 # initialize index loop: lv $v1,a+$r1 # load a lv $v3,b+$r1 # load b mulvv.s $v5,$v1,$v3 # ab lv $v2,c+$r1 # load c lv $v4,d+$r1 # load d mulvv.s $v6,$v2,$v4 # cd subvv.s $v5,$v5,$v6 # ab - cd sv $v5,u+$r1 # store u mulvv.s $v5,$v1,$v4 # ad mulvv.s $v6,$v2,$v3 # cb addvv.s $v5,$v5,$v6 # ad + cb sv $v5,v+$r1 # store v bnez $r1,else # check if first iteration li $VL, 64 # perform 64 ops for every iteration addi $r1,$r1,# 64 # first iteration, increment by 16 *4= j loop # guaranteed next iteration else: addi $r1,$r1,#256 # not first iteration,

increment by 64*4 = 256

skip: blt $r1,1600,loop # next iteration? c.

lv # load a
lv # load b
mulvv.s lv # a * b, load c
lv mulvv.s # load d, c * d
subvv.s sv # ab – cd, store u
mulvv.s # a * d
mulvv.s # c * b
addvv.s sv # a * d + c * b, store v It takes  4 00 / 64 × 6 = 7 × 6 = 4 2 chimes. In the first iteration, The first two chimes take 2 × (15 + 16) = 62 cycles. The next six chime takes (8 + 15 + 16) × 2 + (5 + 15 + 16 ) + (8 + 16 ) × 2 + (5 + 15 + 16 ) = 196 cycles. In other iterations, The first two chimes take 2 × (15 + 64) = 158 cycles.

The total number of cycles taken is 54 + 39 + 39 + 48 + 87 * 6 + (87 + 72 + 87 + 144) * 6 + 87 = 3129 cycles. The number of cycles per result = 3129 / 400 = 7.8225 cycles.

(7 pts) Convert the loop in Problem 2 into MIPS SIMD. (Refer to the example on page 284 ~ 285 on the textbook.) Answer: LA R1, a ; load base address of a LA R2, b ; load base address of b LA R3, c ; load base address of c LA R4, d ; load base address of d LA R5, u ; load base address of u LA R6, v ; load base address of v DADDIU R7, R1, # LOOP: L.4S F0, 0(R1) ; load a[i]..a[i+3] L.4S F4, 0(R2) ; load b[i]..b[i+3] MUL.4S F16, F4, F0 ; a * b L.4S F8, 0(R3) ; load c[i]..c[i+3] L.4S F12, 0(R4) ; load d[i]..d[i+3] MUL.4S F20, F12, F8 ; c * d SUB.4S F20, F20, F16 ; a * b - c * d S.4S F4, 0(R5) ; store u[i]..u[i+3] MUL.4S F16, F12, F0 ; a * d MUL.4S F20, F8, F4 ; c * b ADD.4S F20, F20, F16 ; a * d + c * b S.4S F20, 0(R6) ; store v[i]..v[i+3] DADDIU R1, R1, # DADDIU R2, R2, # DADDIU R3, R3, # DADDIU R4, R4, # DADDIU R5, R5, # DADDIU R6, R6, # 16 DSUBU R8, R7, R BNEZ R8, LOOP
(7 pts) Convert the loop in Problem 2 into CUDA. Answer: // Invoke remi() with 64 threads per Thread Block host int nblocks = ( 4 00 + 63) / 64 ; remi<<<nblocks, 64>>>( 4 00, a, b, u, c, d, v); // remi in CUDA device void remi(int n, double *a, double *b, double *u, double *c, double *d, double *v)

int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) { u[i] = a[i] * b[i] – c[i] * d[i]; v[i] = a[i] * d[i] + c[i] * b[i]; } }

A reduction is a special type of a loop recurrence. An example is shown below: dot=0.0; for (i = 0; i < 64; ++ i) dot = dot + a[i] * b[i]; A vectorizing compiler might apply a transformation called scalar expansion, which expands dot into a vector and splits the loop such that the multiply can be performed with a vector operation, leaving the reduction as a separate scalar operation: for (i = 0; i < 64; ++ i) dot[i] = a[i] * b[i]; for (i = 0; i < 64; ++ i) dot[0] = dot[0] + dot[i]; As mentioned in Section 4.5, if we allow the floating-point addition to be associative, there are several techniques available for parallelizing the reduction. a. (7 pts) One technique is called recurrence doubling, which adds sequences of progressively shorter vectors (i.e., two 32-element vectors, then two 16-element vectors, and so on). Show how the C code would look for executing the second loop in this way. b. (7 pts) In some vector processors, the individual elements within the vector registers are addressable. In this case, the operands to a vector operation may be two different parts of the same vector register. This allows another solution for the reduction called partial sums. The idea is to reduce the vector to m sums where m is the total latency through the vector functional unit, including the operand read and write times. Assume that the VMIPS vector registers are addressable (e.g., you can initiate a vector operation with the operand V1(16), indicating that the input operand begins with element 16). Also, assume that the total latency for adds, including the operand read and result write, is eight cycles. Write a VMIPS code sequence that reduces the contents of V1 to eight partial sums. c. (7 pts) When performing a reduction on a GPU, one thread is associated with each element in the input vector. The first step is for each thread to write its corresponding value into shared memory. Next, each thread enters a loop that adds each pair of input values. This reduces the number of elements by half after each iteration, meaning that the number of active threads also reduces by half after each iteration. In order to maximize the performance of the reduction, the number of fully populated warps should be maximized throughout the course of the loop. In other words, the active threads should be contiguous. Also, each thread should index the shared

In this exercise, we will examine several loops and analyze their potential for parallelization. a. (5 pts) Does the following loop have a loop-carried dependency? for (i = 0; i < 100; ++ i) { A[i] = B[2i+4]; B[4i+5] = A[i]; } b. ( 8 pts) In the following loop, find all the true dependences, output dependences, and antidependences. Eliminate the output dependences and antidependences by renaming. for (i = 0; i < 100; ++ i) { A[i] = A[i] * B[i]; /* S1 / B[i] = A[i] + c; / S2 / A[i] = C[i] * c; / S3 / C[i] = D[i] * A[i]; / S4 / } c. (5 pts) Consider the following loop: for (i = 0; i < 100; ++ i) { A[i] = A[i] + B[i]; / S1 / B[i+1] = C[i] + D[i]; / S2 / } Are there dependences between S1 and S2? Is this loop parallel? If not, show how to make it parallel. Answer: a. To check whether or not loop-carried dependency exists between B[2i+4] and B[4*i+5], we use the GCD test. A dependency exists if GCD (2,4) divides 5 – 4. As GCD(2,4) = 2, and (5 – 4) mod 2 = 1, i.e. GCD(2,4) does not divide 5 – 4. Therefore, no loop-carried dependency exists for B[]. The same index for A[] is used in the loop body. Therefore, no loop-carried dependency exists for A[] either. b. True dependencies: S1 and S2 through A[i] S3 and S4 through A[i] Output dependencies: S1 and S3 through A[i]

Anti-dependencies S1 and S2 through B[i] S2 and S3 through A[i] S3 and S4 through C[i] Re-written code: /* Assume A1,B1,C1 are copies of A,B,C / for (i=0;i<100;i++) { X[i] = A 1 [i] * B 1 [i]; / S1 / B[i] = X[i] + c; / S2 / A[i] = C 1 [i] * c; / S3 / C[i] = D[i] * A[i]; / S4 */ } c. There is an anti-dependence between iteration i and i+1 for array B Re-written code: A[0] = A[0] + B[0]; for (i=0; i<99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } B[100] = C[99] + D[99];