Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

On-line Profiling Techniques 1 Introduction 2 TEST, Slides of Computer Architecture and Organization

EE392C: Advanced Topics in Computer Architecture. Lecture #11. Polymorphic Processors. Stanford University. Handout Date ???

Typology: Slides

2022/2023

Uploaded on 05/11/2023

aghanashin
aghanashin 🇺🇸

4.7

(22)

253 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
EE392C: Advanced Topics in Computer Architecture Lecture #11
Polymorphic Processors
Stanford University Handout Date ???
On-line Profiling Techniques
Lecture #11: Tuesday, 6 May 2003
Lecturer: Shivnath Babu, David Bloom, Rohit Gupta
Scribe: John Whaley, Jayanth Gummaraju
1 Introduction
On-line profiling refers to a technique for collecting run-time information of a program
on the fly in order to decrease the execution time of the program. Information such as
basic block frequencies, branch behavior, memory access patterns, and so on, is collected
during run time. This information is used by a virtual machine to optimize the program
on the fly.
The amount of hardware and software effort in using profile information can vary
substantially depending on the implementation. On one end, the profile information can
be exclusively used by a dynamic compiler to perform all the optimizations. On the
other end, a dedicated coprocessor can be used to use the profile information to reduce
the execution time of the program. In this report, we discuss two papers that use profile
information extensively. Firstly, we discuss TEST[1], a Tracer for Extracting Speculative
Threads in Hydra. TEST uses abundant hardware support to exploit the profile infor-
mation. Secondly, we discuss Relational Profiling[2] which uses queries (assembly like
instructions) and largely software support to optimize programs.
The rest of the report is organized as follows. Section 2 gives a brief summary of
TEST and Section 3 presents a brief summary of Relational Profiling. Finally, Section 4
discusses several issues about online profiling that were discussed during the class.
2 TEST: A Tracer for Extracting Speculative Threads
TEST (Tracer for Extracting Speculative Threads) provides a hardware mechanism for
analyzing sequential programs with the goal of locating regions with potential thread-level
speculation (TLS). This paper presents TEST and shows how it can be used with Hydra,
a CMP with built-in TLS support, in the Jrpm (Java Runtime Parallelizing Machine) to
provide on-line profile data to mark candidate regions of code for dynamic recompilation
into speculative threads.
The current Jrpm system uses TEST to identify loop level parallelism. The two main
analyses it performs are load dependency analysis and speculative state overflow anlysis.
The load dependency analysis determines dependency arcs between loop iterations by
pf3
pf4
pf5

Partial preview of the text

Download On-line Profiling Techniques 1 Introduction 2 TEST and more Slides Computer Architecture and Organization in PDF only on Docsity!

EE392C: Advanced Topics in Computer Architecture Lecture # Polymorphic Processors Stanford University Handout Date ???

On-line Profiling Techniques

Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath Babu, David Bloom, Rohit Gupta Scribe: John Whaley, Jayanth Gummaraju

1 Introduction

On-line profiling refers to a technique for collecting run-time information of a program on the fly in order to decrease the execution time of the program. Information such as basic block frequencies, branch behavior, memory access patterns, and so on, is collected during run time. This information is used by a virtual machine to optimize the program on the fly. The amount of hardware and software effort in using profile information can vary substantially depending on the implementation. On one end, the profile information can be exclusively used by a dynamic compiler to perform all the optimizations. On the other end, a dedicated coprocessor can be used to use the profile information to reduce the execution time of the program. In this report, we discuss two papers that use profile information extensively. Firstly, we discuss TEST[1], a Tracer for Extracting Speculative Threads in Hydra. TEST uses abundant hardware support to exploit the profile infor- mation. Secondly, we discuss Relational Profiling[2] which uses queries (assembly like instructions) and largely software support to optimize programs. The rest of the report is organized as follows. Section 2 gives a brief summary of TEST and Section 3 presents a brief summary of Relational Profiling. Finally, Section 4 discusses several issues about online profiling that were discussed during the class.

2 TEST: A Tracer for Extracting Speculative Threads

TEST (Tracer for Extracting Speculative Threads) provides a hardware mechanism for analyzing sequential programs with the goal of locating regions with potential thread-level speculation (TLS). This paper presents TEST and shows how it can be used with Hydra, a CMP with built-in TLS support, in the Jrpm (Java Runtime Parallelizing Machine) to provide on-line profile data to mark candidate regions of code for dynamic recompilation into speculative threads. The current Jrpm system uses TEST to identify loop level parallelism. The two main analyses it performs are load dependency analysis and speculative state overflow anlysis. The load dependency analysis determines dependency arcs between loop iterations by

comparing timestamps on stores and loads to determine if a given STL (speculative thread loop) has dependencies to earlier threads. The speculative state overflow analysis is used to determine if a given STL would be able to fit in the speculation hardware elements. Using the results of these two analyses, speculative threads are chosen based on greatest expected speedup and least likelihood to overflow the speculation hardware. The hardware implementation of TEST consists of three main components. First, the dynamic compiler must insert annotation instructions into the code, which allow important events to be communicated with the hardware banks. The second component is the hardware comparator banks, which contains the hardware to perform the timestamp comparisons to calculate the critical arcs and the state overflow analyses and store the results into counters. One comparator bank is used to trace on STL, and an array of comparator banks allows for multiple STLs to be traced concurrently. Finally, the store buffers that are used to hold writes during the speculative execution are used during the profiling to hold the timestamp values needed for analysis. This paper found that the actual speedup achieved with the STLs chosen by TEST closely matched the predicted speedup. The relative speedup is most important (rather than the absolute speedup) when choosing threads to execute speculatively, and TEST did a good job of this in the benchmarks that were run on it. The accuracy and pre- dictability of TEST show promising results for the use of on-line profiling for extracting TLS.

3 Relational Profiling: Enabling Thread Level Par-

allelism in Virtual Machines

This paper discusses hardware techniques based on a co-designed virtual machine for profiling. They propose a relational profiling architecture (RPA) (assembly language and required hardware) and corresponding relational profiling model (RPM). The RPM consists of two basic queries: Instruction-based queries, where all events related to a certain instruction are recorded and Event based queries, where all instruc- tions related to a certain event are recorded. In addition, they propose to support hybrid queries. Each query, defined in the RPA assembly language, contains 4 pieces of infor- mation: records of information to be collected, the rate of collection, selection criteria applied to records and action to be taken. The type of information collected can be either architectural (PC, Thread ID, operand values) or implementation (fetch/dispatch/issue rates, latency, branch outcome). Actions communicate the information to the VM (e.g. through messages). The hardware implementation includes a Profile Control Table that stores the PC of the query instruction and the information that is to be collected and is set by the underlying VM. The information collected from the processor pipeline is passed on to the Query Engine, which itself is a 4-stage pipeline that performs the comparisons and actions specified by the query. The limit on the number of instructions that can be

  • How much hardware support is needed to support profiling?

One problem is that the more hardware you have, the more profile information you will have, and then managing the profile information can become a problem. There is a cost associated with getting better profile data. We need to include a mechanism to say when we want to trigger the software based on the profile information. The software might become a bottleneck or drop information. One solution is to put the software task completely into hardware. Another solution is to still have software, but try to summarize the information in hardware before passing it to the software.

  • How can static compiler help in online profiling?

Static Compiler

Dynamic Compiler Hardware

Figure 1: Hardware software interaction for profiling.

A static compiler can communicate to the dynamic compiler the optimizations that it either failed to perform or performed conservatively. The dynamic compiler, with the run-time profile information, can use the information provided by the static compiler to perform more aggressive optimizations. This especially works out well because not only is the time for performing dynamic compilation reduced, the static compiler can give information corresponding to the source code. Figure 1 illustrates this interaction.

  • How can the association of events with locations in the program be made? For example, for which region of memory was there a cache miss? At which program counter was there a cache miss? An inexpensive way of getting and summarizing information about memory and branch behaviors is to periodically dump the cache tables and branch prediction tables. Hardware probably also already includes a good summary of memory ac- cess patterns (stride, etc.) through prefetch hardware, etc. We could utilize this hardware for gathering and summarizing profile statistics.
  • What are the issues in collecting profile information with respect to the changing behavior of the program? When collecting profile data, it is important to be aware of different phases, espe- cially the initialization phase. It is probably a good idea to turn off profiling at the beginning of the program because there is a lot of noise during initialization. We can utilize the hardware to help detect phase changes. The hardware can identify dramatic changes in branch predictor miss rate, cache miss rate, etc. and flag these as phase changes.
  • Can the virtual machine automatically find the bottleneck in the program? Also, how can you associate the code with the profile data? TEST is clever about this. Not only is there a problem, TEST tests whether or not it is worth solving. If there is a problem with associativity, we can switch to another associativity. In general, a “What-if” analysis is very hard to predict. It is relatively easy for systems like databases, but machines are complicated. It is a general optimization problem, and the model is extremely hard with many variables. The hardware configuration space is small, so the problem may be more manageable. Doing this in general for compilers is much harder.
  • What is the role of VM during profiling?

The hardware can determine what is important. One technique is to start the VM by saying “tell me events that are frequent”, then ask “is my optimization useful or not?” Another technique is to use information from the static compiler. The static compiler can direct the hardware to validate its decisions. The hardware communicates information to the dynamic compiler, and the dynamic compiler can make changes to the code. Dynamic compilers typically insert run-time overhead, and using the information from the static compiler is one way of reducing this overhead.

  • Is it possible to obtain more information than what can be obtained via cycle- accurate simulation? One problem is that I/O, OS, etc. are hard to get accurate. We can also do con- tinuous profiling on actual applications, and the user can use the profiler easily. Another approach is to use a VMware-like solution to profile more easily. (Re- lational profiling uses this approach.) We need a virtual machine for dynamic compilation, anyway.
  • What are some of the issues for developing a profiling infrastructure for CMPs? The first question is what we want to get out of profiling. We can use it to identify parallelism or effective (or ineffective) speculations. We can also use it to dynam- ically reconfigure the CMP to match the characteristics of the application. What can we profile? A couple possibilities are inter-thread dependencies, memory access patterns.