



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
EE392C: Advanced Topics in Computer Architecture. Lecture #11. Polymorphic Processors. Stanford University. Handout Date ???
Typology: Slides
1 / 6
This page cannot be seen from the preview
Don't miss anything!
EE392C: Advanced Topics in Computer Architecture Lecture # Polymorphic Processors Stanford University Handout Date ???
Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath Babu, David Bloom, Rohit Gupta Scribe: John Whaley, Jayanth Gummaraju
On-line profiling refers to a technique for collecting run-time information of a program on the fly in order to decrease the execution time of the program. Information such as basic block frequencies, branch behavior, memory access patterns, and so on, is collected during run time. This information is used by a virtual machine to optimize the program on the fly. The amount of hardware and software effort in using profile information can vary substantially depending on the implementation. On one end, the profile information can be exclusively used by a dynamic compiler to perform all the optimizations. On the other end, a dedicated coprocessor can be used to use the profile information to reduce the execution time of the program. In this report, we discuss two papers that use profile information extensively. Firstly, we discuss TEST[1], a Tracer for Extracting Speculative Threads in Hydra. TEST uses abundant hardware support to exploit the profile infor- mation. Secondly, we discuss Relational Profiling[2] which uses queries (assembly like instructions) and largely software support to optimize programs. The rest of the report is organized as follows. Section 2 gives a brief summary of TEST and Section 3 presents a brief summary of Relational Profiling. Finally, Section 4 discusses several issues about online profiling that were discussed during the class.
TEST (Tracer for Extracting Speculative Threads) provides a hardware mechanism for analyzing sequential programs with the goal of locating regions with potential thread-level speculation (TLS). This paper presents TEST and shows how it can be used with Hydra, a CMP with built-in TLS support, in the Jrpm (Java Runtime Parallelizing Machine) to provide on-line profile data to mark candidate regions of code for dynamic recompilation into speculative threads. The current Jrpm system uses TEST to identify loop level parallelism. The two main analyses it performs are load dependency analysis and speculative state overflow anlysis. The load dependency analysis determines dependency arcs between loop iterations by
comparing timestamps on stores and loads to determine if a given STL (speculative thread loop) has dependencies to earlier threads. The speculative state overflow analysis is used to determine if a given STL would be able to fit in the speculation hardware elements. Using the results of these two analyses, speculative threads are chosen based on greatest expected speedup and least likelihood to overflow the speculation hardware. The hardware implementation of TEST consists of three main components. First, the dynamic compiler must insert annotation instructions into the code, which allow important events to be communicated with the hardware banks. The second component is the hardware comparator banks, which contains the hardware to perform the timestamp comparisons to calculate the critical arcs and the state overflow analyses and store the results into counters. One comparator bank is used to trace on STL, and an array of comparator banks allows for multiple STLs to be traced concurrently. Finally, the store buffers that are used to hold writes during the speculative execution are used during the profiling to hold the timestamp values needed for analysis. This paper found that the actual speedup achieved with the STLs chosen by TEST closely matched the predicted speedup. The relative speedup is most important (rather than the absolute speedup) when choosing threads to execute speculatively, and TEST did a good job of this in the benchmarks that were run on it. The accuracy and pre- dictability of TEST show promising results for the use of on-line profiling for extracting TLS.
This paper discusses hardware techniques based on a co-designed virtual machine for profiling. They propose a relational profiling architecture (RPA) (assembly language and required hardware) and corresponding relational profiling model (RPM). The RPM consists of two basic queries: Instruction-based queries, where all events related to a certain instruction are recorded and Event based queries, where all instruc- tions related to a certain event are recorded. In addition, they propose to support hybrid queries. Each query, defined in the RPA assembly language, contains 4 pieces of infor- mation: records of information to be collected, the rate of collection, selection criteria applied to records and action to be taken. The type of information collected can be either architectural (PC, Thread ID, operand values) or implementation (fetch/dispatch/issue rates, latency, branch outcome). Actions communicate the information to the VM (e.g. through messages). The hardware implementation includes a Profile Control Table that stores the PC of the query instruction and the information that is to be collected and is set by the underlying VM. The information collected from the processor pipeline is passed on to the Query Engine, which itself is a 4-stage pipeline that performs the comparisons and actions specified by the query. The limit on the number of instructions that can be
One problem is that the more hardware you have, the more profile information you will have, and then managing the profile information can become a problem. There is a cost associated with getting better profile data. We need to include a mechanism to say when we want to trigger the software based on the profile information. The software might become a bottleneck or drop information. One solution is to put the software task completely into hardware. Another solution is to still have software, but try to summarize the information in hardware before passing it to the software.
Figure 1: Hardware software interaction for profiling.
A static compiler can communicate to the dynamic compiler the optimizations that it either failed to perform or performed conservatively. The dynamic compiler, with the run-time profile information, can use the information provided by the static compiler to perform more aggressive optimizations. This especially works out well because not only is the time for performing dynamic compilation reduced, the static compiler can give information corresponding to the source code. Figure 1 illustrates this interaction.
The hardware can determine what is important. One technique is to start the VM by saying “tell me events that are frequent”, then ask “is my optimization useful or not?” Another technique is to use information from the static compiler. The static compiler can direct the hardware to validate its decisions. The hardware communicates information to the dynamic compiler, and the dynamic compiler can make changes to the code. Dynamic compilers typically insert run-time overhead, and using the information from the static compiler is one way of reducing this overhead.