US20050071841A1 - Methods and apparatuses for thread management of mult-threading - Google Patents

Methods and apparatuses for thread management of mult-threading Download PDF

Info

Publication number
US20050071841A1
US20050071841A1 US10/676,581 US67658103A US2005071841A1 US 20050071841 A1 US20050071841 A1 US 20050071841A1 US 67658103 A US67658103 A US 67658103A US 2005071841 A1 US2005071841 A1 US 2005071841A1
Authority
US
United States
Prior art keywords
thread
threads
resources
helper
current thread
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/676,581
Inventor
Gerolf Hoflehner
Shih-Wei Liao
Xinmin Tian
Hong Wang
Daniel Lavery
Perry Wang
Dongkeun Kim
Milind Girkar
John Shen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel Corp
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corp filed Critical Intel Corp
Priority to US10/676,581 priority Critical patent/US20050071841A1/en
Assigned to INTEL CORPORATION reassignment INTEL CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: WANG, PERRY, GIRKAR, MILIND, KIM, DONGKEUN, TIAN, XINMIN, WANG, HONG, HOFLEHNER, GEROLF F., LAVERY, DANIEL M., LIAO, SHIH-WEI, SHEN, JOHN P.
Priority to US10/779,193 priority patent/US7398521B2/en
Priority to CN200480027177A priority patent/CN100578453C/en
Priority to JP2006527169A priority patent/JP4528300B2/en
Priority to DE602004026750T priority patent/DE602004026750D1/en
Priority to EP04785288A priority patent/EP1668500B1/en
Priority to AT04785288T priority patent/ATE465446T1/en
Priority to PCT/US2004/032075 priority patent/WO2005033936A1/en
Publication of US20050071841A1 publication Critical patent/US20050071841A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/441Register allocation; Assignment of physical memory space to logical memory space

Definitions

  • Embodiments of the invention relate to information processing system; and more specifically, to thread management for multi-threading.
  • SMT simultaneous multithreading
  • SMT can also improve the performance of applications that are multithreaded. However, SMT does not directly improve the performance, in terms of reducing latency, of single-threaded applications. Since the majority of desktop applications in the traditional PC environment are still single-threaded, it is important to investigate if and how SRI resources can be exploited to enhance single-threaded code performance by reducing its latency. In addition, the current compiler typically cannot automatically allocate resources for the threads it created.
  • FIG. 1 illustrates a computer system having multi-threading capability according to one embodiment.
  • FIG. 2 illustrates a computer system having multi-threading capability according to an alternative embodiment.
  • FIG. 3 illustrates a computer system having a compiler capable of generating a helper thread according to one embodiment.
  • FIG. 4A illustrates a typical symmetric multi-threading process.
  • FIG. 4B illustrates an asymmetric multi-thread process according to one embodiment.
  • FIG. 5 is flow diagram illustrating an exemplary process for executing one or more helper threads according to one embodiment.
  • FIG. 6 is a block diagram illustrating exemplary software architecture of a multi-threading system according to one embodiment.
  • FIG. 7 is a flow diagram illustrating an exemplary process for generating a helper thread according to one embodiment.
  • FIG. 8 is a flow diagram illustrating an exemplary process for parallelization analysis according to one embodiment.
  • FIGS. 9A-9C show pseudo code for an application, a main thread, and a helper thread according to one embodiment.
  • FIG. 10 is a block diagram illustrating an exemplary thread configuration according to one embodiment.
  • FIG. 11 is a block diagram illustrating an exemplary pseudo code for allocating resources for the threads according to one embodiment.
  • FIG. 12 is a block diagram illustrating an exemplary resource data structure containing resource information for the threads according to one embodiment.
  • FIG. 13 is a flow diagram illustrating an exemplary process for allocating resources for threads according to one embodiment.
  • FIGS. 14A-14D show results of a variety benchmark tests using embodiments of techniques.
  • a compiler also referred to as AutoHelper, that implements thread-based prefetching helper threads on a multi-threading system, such as, for example, the Intel PentiumTM 4 Hyper-Threading systems, available from Intel Corporation.
  • the compiler automates the generation of helper threads for Hyper-Threading processors.
  • the techniques focus at identifying and generating helper threads of minimal sizes that can be executed to achieve timely and effective data prefetching, while incurring minimal communication overhead.
  • a runtime system is also implemented to efficiently manage the helper threads and the synchronization between threads. Consequently, helper threads are able to issue timely prefetches for the sequential pointer-intensive applications.
  • register contexts may be managed for helper threads within a compiler.
  • the register set may be statically or dynamically partitioned between main thread and helper threads, and between multiple helper threads.
  • the live-in/live-out register copies via memory for threads may be avoided and the threads may be destroyed at compile-time, when the compiler runs out of resources, or at runtime when infrequent cases of certain main thread event occurs.
  • Embodiments of the present invention also relate to apparatuses for performing the operations described herein.
  • An apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer.
  • a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs) such as Dynamic RAM (DRAM), erasable programmable ROMs (EPROMs), electrically erasable programmable ROMs (EEPROMs), magnetic or optical cards, or any type of media suitable for storing electronic instructions, and each of the above storage components is coupled to a computer system bus.
  • ROMs read-only memories
  • RAMs random access memories
  • DRAM Dynamic RAM
  • EPROMs erasable programmable ROMs
  • EEPROMs electrically eras
  • a machine-readable medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer).
  • a machine-readable medium includes read only memory (“ROM”); random access memory (“RAM”); magnetic disk storage media; optical storage media; flash memory devices; electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.); etc.
  • FIG. 1 is a block diagram of an exemplary computer which may be used with an embodiment.
  • exemplary system 100 shown in FIG. 1 may perform the processes shown in FIGS. 5-8 .
  • Exemplary system 100 may be a multi-threading system, such as an Intel PentiumTM 4 Hyper-Threading system.
  • Exemplary system 100 may be a simultaneous multithreading (SMT) or chip multiprocessing (CMP) enabled system.
  • SMT simultaneous multithreading
  • CMP chip multiprocessing
  • FIG. 1 illustrates various components of a computer system, it is not intended to represent any particular architecture or manner of interconnecting the components, as such details are not germane to the present invention. It will also be appreciated that network computers, handheld computers, cell phones, and other data processing systems which have fewer components or perhaps more components may also be used with the present invention.
  • the computer system 100 which is a form of a data processing system, includes a bus 102 which is coupled to a microprocessor 103 and a ROM 107 , a volatile RAM 105 , and a non-volatile memory 106 .
  • the microprocessor 103 which may be a Pentium processor from Intel Corporation or a PowerPC processor from Motorola, Inc., is coupled to cache memory 104 as shown in the example of FIG. 1 .
  • the bus 102 interconnects these various components together and also interconnects these components 103 , 107 , 105 , and 106 to a display controller and display device 108 , as well as to input/output (I/O) devices 110 , which may be mice, keyboards, modems, network interfaces, printers, and other devices which are well-known in the art.
  • I/O input/output
  • the input/output devices 110 are coupled to the system through input/output controllers 109 .
  • the volatile RAM 105 is typically implemented as dynamic RAM (DRAM) which requires power continuously in order to refresh or maintain the data in the memory.
  • DRAM dynamic RAM
  • the non-volatile memory 106 is typically a magnetic hard drive, a magnetic optical drive, an optical drive, or a DVD RAM or other type of memory system which maintains data even after power is removed from the system.
  • the non-volatile memory will also be a random access memory, although this is not required. While FIG. 1 shows that the non-volatile memory is a local device coupled directly to the rest of the components in the data processing system, it will be appreciated that the present invention may utilize a non-volatile memory which is remote from the system, such as a network storage device which is coupled to the data processing system through a network interface such as a modem or Ethernet interface.
  • the bus 102 may include one or more buses connected to each other through various bridges, controllers, and/or adapters, as is well-known in the art.
  • the I/O controller 109 includes a USB (Universal Serial Bus) adapter for controlling USB peripherals or a PCI controller for controlling PCI devices, which may be included in IO devices 110 .
  • I/O controller 109 includes an IEEE-1394 controller for controlling IEEE-1394 devices, also known as FireWire devices.
  • processor 103 may include one or more logical hardware contexts, also referred to as logical processors, for handling multiple threads simultaneously, including a main thread, also referred to as a non-speculative thread, and one or more helper threads, also referred to as speculative threads, of an application.
  • Processor 103 may be a Hyper Threading processor, such as a Pentium 4 or a Xeon processor capable of performing multithreading processes from Intel Corporation.
  • the main thread and one or more helper threads are executed in parallel.
  • the helper threads are speculatively executed associated with, but somewhat independent to, the main thread to perform some precomputations, such as speculative prefetches of addresses or data, for the main thread to reduce the memory latency incurred by the main thread.
  • the code of the helper threads are generated by a compiler, such as AutoHelper compiler available from Intel Corporation, loaded and executed in a memory, such as volatile RAM 105 , by an operating system (OS) executed by a processor, such as processor 103 .
  • the operating system running within the exemplary system 100 may be a Windows operating system from Microsoft Corporation or a Mac OS from Apple Computer. Alternatively, the operating system may be a Linux or Unix operating system. Other operating systems, such as embedded real-time operating systems, may be utilized.
  • Hyper-Threading processors typically provide two hardware contexts, or logical processors. To improve the performance of a single-threaded application, Hyper-Threading technology can utilize its second context to perform prefetching for the main thread. Having a separate context allows the helper threads' execution to be decoupled from the control flow of the main thread, unlike software prefetching. By running far ahead of the main thread to perform long-range prefetches, the helper threads can trigger prefetches early, and eliminate or reduce the cache miss penalties experienced by the main thread.
  • a compiler is able to automatically generate prefetching helper threads for Hyper-Threading machines.
  • the helper threads aim at bringing the latency-hiding benefit of multithreading to sequential workloads.
  • the helper threads only prefetch for the main thread, which does not reuse the computed results from the helper threads.
  • the program correctness is still maintained by the main thread's execution, while the helper threads do not affect program correctness and are used solely for performance improvement. This attribute permits the use of more aggressive forms of optimization in generating helper threads. For example, when the main thread does not need help, certain optimizations may be performed, which are not possible with conventional throughput threading paradigm.
  • the helper may terminate and release all the resources associate with the helper to main thread.
  • the helper may be in a pause mode, which still consumes some resources on Hyper-Threading hardware. Exponential back-off (via halting) will be invoked if the helper stays in the pause mode too long (e.g., exceeding a programmable timeout period).
  • the helper may be in a snooze mode and may relinquish the occupied processor resources to the main thread.
  • performance monitoring and on-the-fly adjustments are made possible under helper-threading paradigm, because the helper thread does not contribute to the semantics of the main program.
  • a main thread needs a helper, it will wake up the main thread.
  • a run-away helper or a run-behind thread one of the processes described above may be invoked to adjust the run-away helper thread.
  • FIG. 2 is a block diagram illustrating one embodiment of a computing system 200 capable of performing the disclosed techniques.
  • the computing system 200 includes a processor 204 and a memory 202 .
  • Memory 202 may store instructions 210 and data 212 for controlling the operation of the processor 204 .
  • the processor 204 may include a front end 221 that supplies instruction information to an execution core 230 .
  • the front end 221 may supply the instruction information to the processor core 204 in program order.
  • the front end 221 includes a fetch/decode unit 222 that includes logically independent sequencers 220 for each of a plurality of thread contexts.
  • the logically independent sequencer(s) 220 may include marking logic 280 to mark the instruction information for speculative threads as being “speculative.”
  • marking logic 280 to mark the instruction information for speculative threads as being “speculative.”
  • instruction information is meant to refer to instructions that can be understood and executed by the execution core 230 .
  • Instruction information may be stored in a cache 225 .
  • the cache 225 may be implemented as an execution instruction cache or an execution trace cache.
  • instruction information includes instructions that have been fetched from an instruction cache and decoded.
  • instruction information includes traces of decoded micro-operations.
  • instruction information also includes raw bytes for instructions that may store in an instruction cache such as I cache 244 .
  • FIG. 3 is a block diagram illustrating an exemplary system containing a compiler to generate one or more helper threads according to one embodiment.
  • exemplary processing system 300 includes a memory system 302 and a processor 304 .
  • Memory system 302 may store instructions 310 and data 312 for controlling the operation of the processor 304 .
  • instructions 310 may include a compiler program 308 that, when executed, causes the processor 304 to compile a program that resides in the memory system 302 .
  • Memory 302 holds the program to be compiled, intermediate forms of the program, and a resulting compiled program.
  • the compiler program 308 includes instructions to generate code for one or more helper threads with respect to a main thread.
  • Memory system 302 is intended as a generalized representation of memory and may include a variety of forms of memory, such as a hard drive, CD-ROM, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM) and related circuitry.
  • Memory system 302 may store instructions 310 and/or data 312 represented by data signals that may be executed by processor 304 .
  • the instructions 310 and/or data 312 may include code for performing any or all of the techniques discussed herein.
  • compiler 308 may include a delinquent load identifier 320 that, when executed by the processor 304 , identifies one or more delinquent load regions of a main thread.
  • the compiler 308 may also include a parallelization analyzer 324 that, when executed by the processor 304 , performs one or more parallelization analysis for the helper threads.
  • the compiler 308 may include a slicer 322 that identifies one or more slices to be executed by a helper thread in order to perform speculative precomputation.
  • the compiler 308 may further include a code generator 328 that, when executed by the processor 304 , generates the code (e.g., source and executable code) for the helper threads.
  • Executing helper threads in an SMT machine is a form of asymmetric multithreading, as shown in FIG. 4B according to one embodiment.
  • Traditional parallel programming models provide symmetric multithreading, as shown in FIG. 4A .
  • the helper threads such as helper threads 451 - 454 in FIG. 4B execute as user-level threads (fibers) with lightweight thread invocation and switching.
  • symmetric multithreading requires well-tuned data decomposition across symmetric threads, such as threads 401 - 404 in FIG. 4A .
  • the main thread runs the sequential code that operates on the entire data set, without incurring data decomposition overhead. Without decomposing the data, the compiler instead focuses on providing multiple helpers for timely prefetches for the main thread's data.
  • FIG. 5 is a flow diagram illustrating an exemplary process for executing a helper thread according to one embodiment.
  • Exemplary process 500 may be performed by a processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination of both.
  • exemplary process 500 includes executing a main thread of an application in a multi-threading system, and spawning one or more helper threads from the main thread to perform one or more computations for the main thread when the main thread enters a region having one or more delinquent loads, code of the one or more helper thread being created during a compilation of the main thread.
  • the processing logic creates an internal thread pool to maintain a list of logical thread contexts which may be used by one or more helper threads.
  • a new thread team may be created before a main thread enters a delinquent load region (e.g., precomputation region) which may be identified by a compiler.
  • the new thread team initially contains only the calling thread.
  • the compiler may insert a statement, such as start_helper statement, before the main thread enters the region to activate one or more helper threads.
  • the main thread spawns (via a function call, such as invoke_helper) one or more helper threads which are created using the resources from the thread pool to perform one or more precomputations, such as prefetching addresses and data, for the main thread.
  • a function call such as invoke_helper
  • the helper threads may be created and placed in a run queue for the thread team for subsequent execution.
  • the run queue may be associated with a time-out. The request to invoke a helper is simply dropped (e.g., terminated) after the time-out period expires, assuming that the prefetch will no longer be timely. This is different from traditional task-queue model for parallel programming, where each task needs to be executed.
  • At block 504 at least a portion of the code within the region of the main thread is executed using in part the data (e.g., prefetched or precomputed) provided by the one or more helper threads.
  • the results computed by a helper thread are not integrated into the main thread.
  • the benefit of a helper thread lies in its side effects of prefetching, not in reusing its computation results. This allows the compiler to aggressively optimize the code generation for helper threads.
  • the main thread handles the correctness issue, while the helper threads target the performance of a program. This also allows the helper thread invoking statement, such as invoke_helper, to drop requests whenever deemed appropriate.
  • non-faulting instructions such as the prefetch instructions, may be used to avoid disruptions to the main thread if exceptions are signaled in a helper thread.
  • the one or more helper threads associated with the main thread are terminated (via a function call, such as finish_helper) when the main thread is about to exit the delinquent load region and the resources, such as logical thread contexts, associated with the terminated helper threads are released back to the thread pool.
  • a function call such as finish_helper
  • Hyper-Threading technology is well suited for supporting the execution of one or more helper threads.
  • instructions from either of the logical processors can be scheduled and executed simultaneously on shared execution resources. This allows helper threads to issue timely prefetches.
  • the entire on-chip cache hierarchy is shared between the logical processors, which is useful for helper threads to effectively prefetch for the main thread at all levels of the cache hierarchy.
  • the physical execution resources are shared between the logical processors, the architecture state is duplicated in a Hyper-Threading processor. The execution of helper threads will not alter the architecture state in the logical processor executing the main thread.
  • the compiler e.g., AutoHelper
  • the compiler removes stores to non-local variables in the helper threads.
  • FIG. 6 is a block diagram illustrating an exemplary architecture of a compiler according to one embodiment.
  • exemplary architecture 600 includes, among others, a front end module 601 , profiler 602 , interprocedural analysis and optimization module 603 , compiler 604 , global scalar optimization module 605 , and backend module 606 .
  • front end module 601 provides a common intermediate representation, such as IL 0 representation from Intel Corporation, for source codes written in a variety of programming languages, such as C/C++ and Fortran.
  • the compiler such as AutoHelper 604 is applicable irrespective of the source languages and of the target platforms.
  • Profiler 602 performs a profiling run to examine the characteristics of the representation.
  • Interprocedural analysis module 603 may exposes optimization opportunities across procedure call boundaries. Thereafter, the compiler 604 (e.g., AutoHelper) is invoked to generate code for one or more helper threads. Global scalar optimization module 605 applies, using partial redundancy elimination to minimize the number of times an expression is evaluated. Finally, backend module 606 generates binary code for the helper threads for a variety of platforms, such as IA-32 or Itanium platform from Intel Corporation. Other components apparent to those with ordinary skill in the art may be included.
  • the compiler 604 e.g., AutoHelper
  • Global scalar optimization module 605 applies, using partial redundancy elimination to minimize the number of times an expression is evaluated.
  • backend module 606 generates binary code for the helper threads for a variety of platforms, such as IA-32 or Itanium platform from Intel Corporation. Other components apparent to those with ordinary skill in the art may be included.
  • AutoHelper e.g., the compiler
  • the compiler can directly analyze the output from profiling results, such as those generated by Intel's VTuneTM Performance Analyzer, which is enabled for Hyper-Threading technology. Because it is a middle-end pass instead of a post-pass tool, the compiler is able to utilize several product-quality analyses, such as array dependence analysis and global scalar optimization, etc. These analyses, invoked after the compiler, perform aggressive optimizations on the helper threads' code.
  • the compiler generates one or more helper threads to precompute and prefetch the address accessed by a load that misses the cache frequently, also referred to as a delinquent load.
  • the compiler also generates one or more triggers in the main thread that spawns one or more helper threads.
  • the compiler implements the trigger as an invoking function, such as the invoke_helper function call. Once the trigger is reached, the load is expected to appear later in the instruction stream of the main thread, hence the speculatively executed helper threads can reduce the number of cache misses in the main thread.
  • FIG. 7 is flow diagram illustrating an exemplary process performed by a compiler, such as AutoHelper, according to one embodiment.
  • Exemplary process 700 may be performed by a processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination of both.
  • exemplary process 700 starts at block 701 , to identifying delinquent loads using, for example, the VTune tool from Intel Corporation, to perform parallelization analysis for helper threads (block 702 ), to generate code for helper threads (block 703 ), and to allocate resources, such as hardware registers or memories for each helper threads and the main thread (block 704 ), which will be described in details further below.
  • the compiler identifies the most delinquent loads in an application source code using one or more run-time profiles.
  • Traditional compilers collect the profiles in two steps: profile-instrumentation and profile-generation.
  • profile-instrumentation pass does not permit instrumentation of cache misses for the compiler to identify delinquent loads.
  • the profiles for each cache hierarchy are collected via a utility, such as the VTuneTM Analyzer from Intel Corporation.
  • the application may be executed with debugging information in a separate profiling run prior to the compiler. During the profiling run, cache misses are sampled and the hardware counters are accumulated for each static load in the application.
  • the compiler identifies the candidates for thread-based prefetching.
  • the VTuneTM summarizes the cache behavior on a per-load basis. Because the binary for the profiling run is compiled with the debug information (e.g., debug symbols), it is possible to correlate the profiles back to source line numbers and the statements. Certain loads that contribute more than a predetermined threshold may be identified as delinquent loads. In a particular embodiment, the top loads that contribute to 90% of cache misses are denoted as delinquent loads.
  • the compiler In addition to identifying delinquent load instructions, the compiler generates helper threads that compute the addresses of delinquent loads accurately. In one embodiment, separate code for helper threads is generated. The separation between the main thread and the helper thread's code prevents transformations on a helper thread's code from affecting the main thread. In one embodiment, the compiler uses multi-entry threading, instead of conventional out-lining, in the Intel product compiler to generate separate codes for helper threads.
  • the compiler performs multi-entry threading at the granularity of a compiler-selected code region, denoted as precomputation region.
  • This region encompasses a set of delinquent loads and defines the scope for speculative precomputation.
  • the implementation usually targets loop regions, because loops are usually the hot spots in program execution, and the delinquent loads are the loads that were executed many times, usually in a loop.
  • FIG. 8 is flow diagram illustrating an exemplary process for parallelization analysis according to one embodiment.
  • Exemplary process 800 may be performed by a processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination of both.
  • the processing logic builds a dependent graph that captures both data and control dependencies of the main thread.
  • the compiler in order to filter out unrelated code and thus reduce the size of a helper thread's code, the compiler first builds a graph that captures both data and control dependences. The effectiveness and legality of filtering rely on the compiler's ability to accurately disambiguate memory references.
  • a memory disambiguation module in the compiler is invoked to disambiguate pointers to dynamically allocated objects.
  • a pointer could be a global variable or a function parameter
  • the points-to analysis performed by the compiler is interprocedural, if the compiler compiles in the whole-program mode.
  • a series of array dependence tests may be performed, so that each element in an array is disambiguated in building the dependence graph, if all the array accesses are finite expressions. Otherwise, approximation is used.
  • each field in a structure may be disambiguated.
  • the processing logic performs a slicing operation on the main thread using the dependent graph.
  • the compiler first identifies the load addresses of delinquent loads as slice criteria, which specify the intermediate slicing results. After building the dependence graph, the compiler computes the program slices of the identified slice criteria.
  • the program slices of the slice criteria are defined as the set of instructions that contribute to the computation of the addresses for memory prefetches executed by the one or more helper threads. Slicing can reduce the code to only the instructions relevant to the computation of an address, thus allows the helper threads to run quicker and ahead of the main thread. The compiler only needs to copy instructions in a slice to the helper thread's code.
  • slicing in the compiler extracts a minimal sequence of instructions to produce the addresses of delinquent loads by transitively traversing the dependence edges backwards.
  • the leaf nodes on the dependence graph of the resulting slices can be converted to prefetch instructions, because no further instructions are dependent on those leaf nodes.
  • Those prefetch instructions executed by a processor such as the PentiumTM 4 from Intel Corporation, are both non-blocking and non-faulting. Different prefetch instructions exist for bringing data into different levels of cache in the memory hierarchy.
  • slicing operations may be performed with respect to a given code region. Traversal on the dependence graph in a given region must terminate when it reaches code outside of that region. Thus, slicing must be terminated during traversal instead of after traversal, because the graph traversal may span to the outside of a region and then back to the inside of a region. Simply collecting the slices according to regions after the traversal may lose precision.
  • the compiler slices each delinquent loads instruction one by one.
  • the compiler merges slices into one helper thread if they are in the same precomputation region.
  • the processing logic performs scheduling across the threads to overlap multiple prefetches.
  • Hyper-Threading processors support out-of-order execution with large scheduling windows, the processors can look for independent instructions beyond the current executing instruction when it waits on a pending cache miss. This aspect of out-of-order execution can provide substantial performance gain over an in-order processor and reduce the need for chaining speculative precomputation.
  • the compiler selects basic speculative precomputation for Hyper-Threading processors. Namely, only one helper thread is scheduled at a time to save the thread spawning and communication overhead.
  • processing logic selects a communication scheme for the threads.
  • the compiler provides a module that computes live-ness information for any given slice, or any subset of program. Liveness information provides estimates on the communication cost. The information is used to select the precomputation region that provides good trade-off between communication and computation. The liveness information may help find triggers or the points at which the backward slicing ends.
  • the compiler has to be judicious as not to let helper threads slow down the main thread's execution, especially if the main thread issues three micro-ops for execution per cycle already.
  • the compiler makes trade-off between re-computation and communication in choosing the loop level for performing speculative precomputation. For each loop level, starting from the innermost one, according to one embodiment, the compiler selects one of the communication-based scheme and computation-based scheme.
  • the communication-based scheme communicates the live-in values from the main thread to the helper thread in each iteration, so the helper thread does not need to re-compute the live-in values.
  • the compiler will select this scheme if there exists an inner loop encompassing most delinquent loads and if slicing for the inner loop significantly decreases the size of a helper thread. However, this scheme will be disabled if the communication cost for the inner loop level is very large. The compiler will give smaller estimate of communication cost, if the live-in values are computed early and the number of live-ins is small.
  • Communication-based scheme will create multiple communication points between the main thread and its helper thread at runtime. Communication-based scheme is important for Hyper-Threading processors, because relying on only one communication point by re-computing the slice in the helper thread may create too much resource contention between threads. This scheme is similar to constructing a do-across loop in that the main thread initiates the next iteration after it finishes computing the live-in values for that iteration. The scheme trades communication for less computation.
  • the computation-based scheme assumes only one communication point between two threads to pass in the live-in values in the beginning. Afterwards, the helper thread needs to compute everything it needs to generate accurate prefetch addresses. The compiler will select this scheme if there is no inner loop, or if slicing for this loop level does not significantly increases the size of a helper thread. Computation-based scheme gives the helper thread more independence in execution, once the single communication point is reached.
  • the compiler selects the outermost loop that benefits from communication-based scheme. Hence the scheme-selection algorithm described above can terminate once it finds a loop with communication-based scheme. If the compiler does not find any loop with communication-based scheme, the outermost loop will be the targeted region for speculative precomputation. After the compiler selects the precomputation regions and their communication schemes, locating good trigger points in the main thread would ensure timely prefetches, while minimizing the communication between the main thread and the helper threads. Liveness information helps locate triggers, which are the points at which the backward slicing ends. Slicing beyond the precomputation region ends when the number of live-ins increases.
  • the processing logic determines a synchronization period for the threads to synchronize with each other during the execution.
  • the synchronization period is used to express the distance between a helper thread and the main thread.
  • the helper thread performs all of its precomputation in units of synchronization period. This both minimizes communication and limits the possibility of producing run-away helpers. Because the compiler computes the value of synchronization period and generates synchronization code accordingly, special hardware support, such as Outstanding Slice Counter, is no longer needed.
  • the compiler first computes the difference between the length of the slice and the length of program schedule in the main thread. If the difference is small, the run-ahead distance induced by the helper thread in one iteration is consequently small. Multiple iterations may be needed by the helper thread to maintain enough run-ahead distance. Hence, the compiler increases the synchronization period if the difference is small, and vice versa.
  • the compiler generates code for the main thread and the helper thread during a code generation stage.
  • the compiler builds a thread graph as the interface between the analysis phase and code generation phase.
  • Each graph node denotes a sequence of instructions, or a code region.
  • the invocation edge between the nodes denotes the thread-spawning relationship, which is important for specifying chaining helper threads.
  • Having a thread graph enables code reuse because, according to one embodiment, the compiler also allows the user to insert pragmas in the source program to specify the code for helper threads and the live-ins. Both the pragma-based approach and the automatic approach share the same graph abstraction. As a result, the helper thread code generation module may be shared.
  • the helper thread code generation leverages multi-entry threading technology in the compiler to generate helper thread code.
  • the compiler does not create a separate compilation unit (or routine) for the helper thread. Instead, the compiler generates a threaded entry and a threaded return for in the helper thread code.
  • the compiler keeps all newly generated helper thread codes intact or inlined within the same user-defined routine without splitting them into independent subroutines. This method provides later compiler optimizations with more opportunities for performing optimization on the newly generated helper threads. Fewer instructions in the helper thread means less resource contention on a hyper-threaded processor. This demonstrates that using helper threads for hiding latency incurs fewer instructions and less resource contention than the traditional symmetric multithreading model, which is important especially because the hyper-threaded processor issues three micro-ops per processor cycle and has some hard-partitioned resources.
  • the generated codes for helper threads will be re-ordered and optimized by the later on phases in the compiler such as partial dead-store elimination (PDSE), partial redundancy elimination (PRE), and other scalar optimizations.
  • PDSE partial dead-store elimination
  • PRE partial redundancy elimination
  • the helper thread code needs to be optimized to minimize the resource contention. due to the helper thread.
  • those further optimizations may remove prefetching code as well. Therefore, the leaf delinquent loads may be converted to the volatile-assign statements in the compiler.
  • the leaf node in the dependence graph of a slice implies that no further instructions in the helper thread depend on the loaded value.
  • the destination of the volatile-assign statement is changed to a register temp in the representation to speed up the resulting code.
  • Using volatile-assign may prevent all later on compiler global optimizations from removing generated prefetches for delinquent loads.
  • the compiler aims at ensuring the helper thread to run neither too far ahead nor behind the main thread using a self-counting mechanism.
  • value X is pre-set for run-ahead distance control.
  • the X can be modified through a compiler switch by users, or based on program analysis of the length of slice (or helper code) and the length of main code.
  • the compiler generates mc (M-counter) with an initial value X for main thread and hc (H-counter) with an initial value 0 for helper thread, and the compiler generates the counter M and H for counting the sync-up periods in main and helper code.
  • the idea is that the all four counters (mc, M, hc, H) perform self-counting.
  • the helper thread has no inference to main thread. If the helper thread runs too far ahead of main thread, it will issue a wait, if the helper thread runs behind main thread, it will perform a catch-up.
  • the main thread issues a post to ensure that the helper is not waiting and can go ahead to perform non_faulting_load.
  • the helper thread waits for the main thread after issuing a number of non_faulting_loads in chunks of sync-up period, it will wake up to perform non_faulting_loads.
  • the helper thread examines whether its hc counter is greater main thread's mc counter and the hc counter is greater a sync-up period H*X of the helper thread, if so, the helper will issue a wait and go to sleep.
  • FIGS. 9 A-9C are diagrams illustrating exemplary pseudo code of an application, a main thread, and a helper thread according to one embodiment. Referring to FIGS.
  • the compiler compiles a source code 901 of an application and generates code for a main thread 902 and a helper thread 903 using at least one of the aforementioned techniques. It will be appreciated that the code 901 - 903 are not limited to C/C++. Other programming languages, such as Fortran or Assembly, may be used.
  • the compiler may further allocate, statically or dynamically, resources for each helper thread and the main thread to ensure that there is no resource conflict between the main thread and the helper threads, and among the helper threads.
  • Hardware resources such as register contexts, may be managed for helper threads within the compiler.
  • the register set may be statically or dynamically partitioned between the main thread and the helper threads, and between multiple helper threads.
  • the compiler may “walk through” the helper threads in a bottom-up order and communicates the resource utilization in a data structure, such as a resource table shown in FIG. 12 .
  • the parent helper thread which may be the main thread, utilizes this information and ensures that its resources don't overlap with the thread resources.
  • the compiler can kill previously created threads.
  • FIG. 10 is a block diagram illustrating an exemplary configuration of threads according to one embodiment.
  • exemplary configuration 1000 includes a main thread 1001 (e.g., a parent thread) and three helper threads (e.g., child threads) 1002 - 1004 , which may be spawned from the main thread 1001 , while thread 1003 may be spawned from thread 1002 (e.g., helper thread 1002 is a parent thread of helper thread 1003 ).
  • the helper threads are not limited to three helper threads, more or less helper threads may be included.
  • the helper threads may be spawned by a spawn instruction and the thread execution may resumes after the spawn instruction.
  • the threads are created by the compiler during a thread creation phase, such as those operations shown in FIGS. 5-8 .
  • the compiler creates the threads in the thread creation phase and allocates resources for the threads in a subsequent thread resource allocation phase.
  • Dynamically and typically, a helper thread is spawned when its parent thread stalls.
  • Exemplary configuration 1000 may happen during a page fault or a level 3 (L 3 ) cache miss.
  • main thread 1001 when main thread 1001 needs a register, it writes a value to register RIO before it spawns helper thread 1002 and uses register R 10 after the helper thread 1002 terminates.
  • helper thread 1002 nor any of its children (in the example, helper thread 1003 is the only children of helper thread 1002 , and helper threads 1002 and 1004 are children of the main thread 1001 ) can write to register R 10 . Otherwise they would destroy the value in the main thread 1001 . This would result in incorrect program execution.
  • the compiler may partition the resources statically or dynamically.
  • the compiler allocates resources for the helper threads and the main thread in a bottom-up order.
  • FIG. 11 is a block diagram illustrating an exemplary pseudo code for allocating resources for the threads according to one embodiment. That is, in the exemplary algorithm 1100 , the compiler allocates all resources for the helper threads in a bottom-up order (block 1101 ) and thereafter allocates resources for the main thread (block 1102 ) based on the resources used by the helper threads to avoid resource conflicts.
  • the resources used the threads are assumed to be the hardware registers. However, similar concepts may be applied to other resources apparent to one with ordinary skill in the art, such as memory or interrupt.
  • the compiler partitions the registers dynamically by walking bottom up from the lead thread of a thread chain.
  • helper thread 1003 is a leaf thread in the first thread chain including helper thread 1002 .
  • Helper thread 1004 is a leaf thread in the second thread chain.
  • the compiler records the register allocation in each helper thread in a data structure, such as a resource table similar to the exemplary resource table 1200 of FIG. 12 . Then the parent thread reads the resource allocation of its children thread and does its allocation and reports it in its resource table.
  • FIG. 12 is a block diagram illustrating an exemplary resource data structure according to one embodiment.
  • Exemplary data structure 1200 may be implemented as a table stored in a memory and accessible by a compiler.
  • exemplary data structure 1200 may be implemented in a database.
  • exemplary data structure 1200 includes, but not limited to, written resources 1202 and live-in resources used by the respective thread identified via thread ID 1201 . Other configurations may exist.
  • helper thread 1003 e.g., the thread having the most bottom order in a bottom-up scheme
  • the live-in values are V 5 and V 6 and assuming they are assigned to registers R 2 and R 3 respectively.
  • V 7 gets register R 4 assigned and V 9 gets register R 5 assigned.
  • helper thread 1002 the compiler replaces V 5 with R 2 and V 6 with R 3 during the allocation and marks register R 4 and R 5 (written in helper thread 1003 ) as live at the spawn instruction. This prevents register usage of R 4 or R 5 across the spawn point of helper thread 1003 and thus prevents a resource conflict between helper thread 1002 and helper thread 1003 .
  • the live-in values are V 3 and V 4 and are assigned to register R 6 and R 7 respectively.
  • the written registers are the live-in registers for helper thread 1003 (e.g., R 2 and R 3 ), the written registers in helper thread 1003 (e.g., R 4 and R 5 ) and the registers written in helper thread 1002 (e.g., R 8 and R 9 ).
  • the compiler allocates the registers for helper thread 1004 . When the registers are allocated for all the helper threads, it allocates the registers for the main thread 1001 .
  • the compiler when the compiler runs out of registers, it can delete one or more helper threads within the chain. This can happen for example, when the main thread runs out of registers, because the helper thread chain is too deep or a single helper thread needs too many registers and the main thread has to spill/fill registers.
  • the compiler can apply heuristics to either allow certain number of spills or delete the entire helper thread chain or some threads in the thread chain.
  • An alternative to deleting helper thread is to explicitly configure the weight of context save/restore, so that upon context switch, the parent's live registers that could be written by the helper thread's execution can be saved automatically by the hardware. Even though this context switch is relatively expensive, potentially such case is infrequent case. Moreover, such fine-grain context switch is still of much low overhead compared to full-context switch as used in most OS-enabled thread switch or a traditional hardware based full-context thread switch.
  • FIG. 13 is a flow diagram illustrating an exemplary process for allocating resources for threads according to one embodiment.
  • Exemplary process 1300 may be performed by a processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination of both.
  • exemplary process 1300 includes selecting, during a compilation of a code having one or more threads executable in a data processing system, a current thread having a most bottom order, determining resources allocated to one or more child threads spawned from the current thread, and allocating resources for the current thread in consideration of the resources allocated to the current thread's one or more child threads to avoid resource conflicts between the current thread and its one or more child threads.
  • processing logic identifies one or more threads, including a main thread and its helper threads, and selects a thread having the most bottom order as a current thread.
  • the threads may be identified using a thread dependency graph created during the thread creation phase of the compilation.
  • the processing logic retrieves resource information of any child thread, which may be spawned from the current thread.
  • the resources information may be obtained from a data structure corresponding to the child threads, such as resource table 1200 of FIG. 12 .
  • the processing logic may delete one or more threads from the chain and restart over again (block 1309 ).
  • the processing logic allocates resources for the current thread in consideration of resources used by its child threads without causing resource conflicts. Thereafter, at block 1305 , the processing logic updates the resources allocated to the current thread in the associated resource table, such as resource table 1200 . The above processes continue until no more helper threads (e.g., child threads of the main thread) remained (blocks 1306 and 1308 ). Finally, at block 1307 , the processing logic allocates resources for the main thread (e.g., a parent thread for all helper threads) based on the resource information of all the helper threads without causing resource conflicts. Other operations may be included.
  • the main thread e.g., a parent thread for all helper threads
  • a Processor with Hyper-Threading Technology Threading 2 logical processors. Trace cache 12k micro-ops. 8-way associative. 6 micro-ops per line. L1 D cache 8k bytes. 4-way associative. 64-byte line size. 2-cycle integer access. 4-cycle FP access. L2 unified 256k bytes. 8-way associative. cache 128-byte line size. 7-cycle access latency.
  • Load buffers 48 Store buffers 24
  • the variety of benchmark tools include at least one of the following: Benchmark Description Input Set nbody_walker Traverses nearest bodies 20k bodies from any node in Nbody graph mst Computes Minimal 3k nodes Spanning Tree for data clustering em3d Solves electromagnetic 20k 5- propagation in 3D degree nodes health Hierarchical database 5 levels modeling health care system mcf Integer programming Lite algorithm used for bus scheduling
  • FIG. 14A is a chart illustrating an improvement of performance by the helper thread on nbody_walker benchmark utility.
  • FIG. 14B is a chart illustrating a speedup result of nbody_walker at a given value of synchronization period.
  • FIG. 14C is a chart illustrating an automatic process versus a manual process with respect to a variety of benchmark.
  • FIG. 14D is chart illustrating an improvement of an automatic process over a manual process using nbody_walker at a given synchronization period.

Abstract

Methods and apparatuses for thread management for multi-threading are described herein. In one embodiment, exemplary process includes selecting, during a compilation of code having one or more threads executable in a data processing system, a current thread having a most bottom order, determining resources allocated to one or more child threads spawned from the current thread, and allocating resources for the current thread in consideration of the resources allocated to the current thread's one or more child threads to avoid resource conflicts between the current thread and its one or more child threads. Other methods and apparatuses are also described.

Description

    FIELD
  • Embodiments of the invention relate to information processing system; and more specifically, to thread management for multi-threading.
  • BACKGROUND
  • Memory latency has become the critical bottleneck to achieving high performance on modern processors. Many large applications today are memory intensive, because their memory access patterns are difficult to predict and their working sets are becoming quite large. Despite continued advances in cache design and new developments in prefetching techniques, the memory bottleneck problem still persists. This problem worsens when executing pointer-intensive applications, which tend to defy conventional stride-based prefetching techniques.
  • One solution is to overlap memory stalls in one program with the execution of useful instructions from another program, thus effectively improving system performance in terms of overall throughput. Improving throughput of multitasking workloads on a single processor has been the primary motivation behind the emerging simultaneous multithreading (SMT) techniques. An SMT processor can issue instructions from multiple hardware contexts, or logical processors (also referred to as hardware threads), to the functional units of a super-scalar processor in the same cycle. SMT achieves higher overall throughput by increasing overall instruction-level parallelism available to the architecture via the exploitation of the natural parallelism between independent threads during each cycle.
  • SMT can also improve the performance of applications that are multithreaded. However, SMT does not directly improve the performance, in terms of reducing latency, of single-threaded applications. Since the majority of desktop applications in the traditional PC environment are still single-threaded, it is important to investigate if and how SRI resources can be exploited to enhance single-threaded code performance by reducing its latency. In addition, the current compiler typically cannot automatically allocate resources for the threads it created.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The invention may best be understood by referring to the following description and accompanying drawings that are used to illustrate embodiments of the invention. In the drawings:
  • FIG. 1 illustrates a computer system having multi-threading capability according to one embodiment.
  • FIG. 2 illustrates a computer system having multi-threading capability according to an alternative embodiment.
  • FIG. 3 illustrates a computer system having a compiler capable of generating a helper thread according to one embodiment.
  • FIG. 4A illustrates a typical symmetric multi-threading process.
  • FIG. 4B illustrates an asymmetric multi-thread process according to one embodiment.
  • FIG. 5 is flow diagram illustrating an exemplary process for executing one or more helper threads according to one embodiment.
  • FIG. 6 is a block diagram illustrating exemplary software architecture of a multi-threading system according to one embodiment.
  • FIG. 7 is a flow diagram illustrating an exemplary process for generating a helper thread according to one embodiment.
  • FIG. 8 is a flow diagram illustrating an exemplary process for parallelization analysis according to one embodiment.
  • FIGS. 9A-9C show pseudo code for an application, a main thread, and a helper thread according to one embodiment.
  • FIG. 10 is a block diagram illustrating an exemplary thread configuration according to one embodiment.
  • FIG. 11 is a block diagram illustrating an exemplary pseudo code for allocating resources for the threads according to one embodiment.
  • FIG. 12 is a block diagram illustrating an exemplary resource data structure containing resource information for the threads according to one embodiment.
  • FIG. 13 is a flow diagram illustrating an exemplary process for allocating resources for threads according to one embodiment.
  • FIGS. 14A-14D show results of a variety benchmark tests using embodiments of techniques.
  • DETAILED DESCRIPTION
  • Methods and apparatuses for compiler-creating helper threads for multi-threading systems are described. According to one embodiment, a compiler, also referred to as AutoHelper, that implements thread-based prefetching helper threads on a multi-threading system, such as, for example, the Intel Pentium™ 4 Hyper-Threading systems, available from Intel Corporation. In one embodiment, the compiler automates the generation of helper threads for Hyper-Threading processors. The techniques focus at identifying and generating helper threads of minimal sizes that can be executed to achieve timely and effective data prefetching, while incurring minimal communication overhead. A runtime system is also implemented to efficiently manage the helper threads and the synchronization between threads. Consequently, helper threads are able to issue timely prefetches for the sequential pointer-intensive applications.
  • In addition, hardware resources such as register contexts may be managed for helper threads within a compiler. Specifically, the register set may be statically or dynamically partitioned between main thread and helper threads, and between multiple helper threads. As a result, the live-in/live-out register copies via memory for threads may be avoided and the threads may be destroyed at compile-time, when the compiler runs out of resources, or at runtime when infrequent cases of certain main thread event occurs.
  • In the following description, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known circuits, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.
  • Some portions of the detailed descriptions which follow are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
  • It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing” or “computing” or “calculating” or “determining” or “displaying” or the like, refer to the action and processes of a computer system, or similar data processing device, that manipulates and transforms data represented as physical (e.g. electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
  • Embodiments of the present invention also relate to apparatuses for performing the operations described herein. An apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs) such as Dynamic RAM (DRAM), erasable programmable ROMs (EPROMs), electrically erasable programmable ROMs (EEPROMs), magnetic or optical cards, or any type of media suitable for storing electronic instructions, and each of the above storage components is coupled to a computer system bus.
  • The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the methods. The structure for a variety of these systems will appear from the description below. In addition, embodiments of the present invention are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the embodiments of the invention as described herein.
  • A machine-readable medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer). For example, a machine-readable medium includes read only memory (“ROM”); random access memory (“RAM”); magnetic disk storage media; optical storage media; flash memory devices; electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.); etc.
  • FIG. 1 is a block diagram of an exemplary computer which may be used with an embodiment. For example, exemplary system 100 shown in FIG. 1 may perform the processes shown in FIGS. 5-8. Exemplary system 100 may be a multi-threading system, such as an Intel Pentium™ 4 Hyper-Threading system. Exemplary system 100 may be a simultaneous multithreading (SMT) or chip multiprocessing (CMP) enabled system.
  • Note that while FIG. 1 illustrates various components of a computer system, it is not intended to represent any particular architecture or manner of interconnecting the components, as such details are not germane to the present invention. It will also be appreciated that network computers, handheld computers, cell phones, and other data processing systems which have fewer components or perhaps more components may also be used with the present invention.
  • As shown in FIG. 1, the computer system 100, which is a form of a data processing system, includes a bus 102 which is coupled to a microprocessor 103 and a ROM 107, a volatile RAM 105, and a non-volatile memory 106. The microprocessor 103, which may be a Pentium processor from Intel Corporation or a PowerPC processor from Motorola, Inc., is coupled to cache memory 104 as shown in the example of FIG. 1. The bus 102 interconnects these various components together and also interconnects these components 103, 107, 105, and 106 to a display controller and display device 108, as well as to input/output (I/O) devices 110, which may be mice, keyboards, modems, network interfaces, printers, and other devices which are well-known in the art. Typically, the input/output devices 110 are coupled to the system through input/output controllers 109. The volatile RAM 105 is typically implemented as dynamic RAM (DRAM) which requires power continuously in order to refresh or maintain the data in the memory. The non-volatile memory 106 is typically a magnetic hard drive, a magnetic optical drive, an optical drive, or a DVD RAM or other type of memory system which maintains data even after power is removed from the system. Typically the non-volatile memory will also be a random access memory, although this is not required. While FIG. 1 shows that the non-volatile memory is a local device coupled directly to the rest of the components in the data processing system, it will be appreciated that the present invention may utilize a non-volatile memory which is remote from the system, such as a network storage device which is coupled to the data processing system through a network interface such as a modem or Ethernet interface. The bus 102 may include one or more buses connected to each other through various bridges, controllers, and/or adapters, as is well-known in the art. In one embodiment, the I/O controller 109 includes a USB (Universal Serial Bus) adapter for controlling USB peripherals or a PCI controller for controlling PCI devices, which may be included in IO devices 110. In a further embodiment, I/O controller 109 includes an IEEE-1394 controller for controlling IEEE-1394 devices, also known as FireWire devices.
  • According to one embodiment, processor 103 may include one or more logical hardware contexts, also referred to as logical processors, for handling multiple threads simultaneously, including a main thread, also referred to as a non-speculative thread, and one or more helper threads, also referred to as speculative threads, of an application. Processor 103 may be a Hyper Threading processor, such as a Pentium 4 or a Xeon processor capable of performing multithreading processes from Intel Corporation. During an execution of an application, the main thread and one or more helper threads are executed in parallel. The helper threads are speculatively executed associated with, but somewhat independent to, the main thread to perform some precomputations, such as speculative prefetches of addresses or data, for the main thread to reduce the memory latency incurred by the main thread.
  • According to one embodiment, the code of the helper threads (e.g., the source code and the binary executable code) are generated by a compiler, such as AutoHelper compiler available from Intel Corporation, loaded and executed in a memory, such as volatile RAM 105, by an operating system (OS) executed by a processor, such as processor 103. The operating system running within the exemplary system 100 may be a Windows operating system from Microsoft Corporation or a Mac OS from Apple Computer. Alternatively, the operating system may be a Linux or Unix operating system. Other operating systems, such as embedded real-time operating systems, may be utilized.
  • Current Hyper-Threading processors typically provide two hardware contexts, or logical processors. To improve the performance of a single-threaded application, Hyper-Threading technology can utilize its second context to perform prefetching for the main thread. Having a separate context allows the helper threads' execution to be decoupled from the control flow of the main thread, unlike software prefetching. By running far ahead of the main thread to perform long-range prefetches, the helper threads can trigger prefetches early, and eliminate or reduce the cache miss penalties experienced by the main thread.
  • With AutoHelper, a compiler is able to automatically generate prefetching helper threads for Hyper-Threading machines. The helper threads aim at bringing the latency-hiding benefit of multithreading to sequential workloads. Unlike threads produced by the conventional parallelizing compilers, the helper threads only prefetch for the main thread, which does not reuse the computed results from the helper threads. According to on embodiment, the program correctness is still maintained by the main thread's execution, while the helper threads do not affect program correctness and are used solely for performance improvement. This attribute permits the use of more aggressive forms of optimization in generating helper threads. For example, when the main thread does not need help, certain optimizations may be performed, which are not possible with conventional throughput threading paradigm.
  • In one embodiment, if it is predicted that a helper is not needed for a certain period of time, the helper may terminate and release all the resources associate with the helper to main thread. According to another embodiment, if it is predicted that a helper may be needed shortly, the helper may be in a pause mode, which still consumes some resources on Hyper-Threading hardware. Exponential back-off (via halting) will be invoked if the helper stays in the pause mode too long (e.g., exceeding a programmable timeout period). According to a further embodiment, if the compiler cannot predict when the helper thread will be needed, the helper may be in a snooze mode and may relinquish the occupied processor resources to the main thread.
  • Furthermore, according to one embodiment, performance monitoring and on-the-fly adjustments are made possible under helper-threading paradigm, because the helper thread does not contribute to the semantics of the main program. When a main thread needs a helper, it will wake up the main thread. For example, with respect to a run-away helper or a run-behind thread, one of the processes described above may be invoked to adjust the run-away helper thread.
  • FIG. 2 is a block diagram illustrating one embodiment of a computing system 200 capable of performing the disclosed techniques. In one embodiment, the computing system 200 includes a processor 204 and a memory 202. Memory 202 may store instructions 210 and data 212 for controlling the operation of the processor 204. The processor 204 may include a front end 221 that supplies instruction information to an execution core 230. The front end 221 may supply the instruction information to the processor core 204 in program order.
  • For at least one embodiment, the front end 221 includes a fetch/decode unit 222 that includes logically independent sequencers 220 for each of a plurality of thread contexts. The logically independent sequencer(s) 220 may include marking logic 280 to mark the instruction information for speculative threads as being “speculative.” One skilled in the art will recognize that, for an embodiment implemented in a multiple processor multithreading environment, only one sequencer 220 may be included in the fetch/decode unit 222.
  • As used herein, the term “instruction information” is meant to refer to instructions that can be understood and executed by the execution core 230. Instruction information may be stored in a cache 225. The cache 225 may be implemented as an execution instruction cache or an execution trace cache. For embodiments that utilize an execution instruction cache, “instruction information” includes instructions that have been fetched from an instruction cache and decoded. For embodiments that utilize a trace cache, the term “instruction information” includes traces of decoded micro-operations. For embodiments that utilize neither an execution instruction cache nor trace cache, “instruction information” also includes raw bytes for instructions that may store in an instruction cache such as I cache 244.
  • FIG. 3 is a block diagram illustrating an exemplary system containing a compiler to generate one or more helper threads according to one embodiment. Referring to FIG. 3, exemplary processing system 300 includes a memory system 302 and a processor 304. Memory system 302 may store instructions 310 and data 312 for controlling the operation of the processor 304. For example, instructions 310 may include a compiler program 308 that, when executed, causes the processor 304 to compile a program that resides in the memory system 302. Memory 302 holds the program to be compiled, intermediate forms of the program, and a resulting compiled program. For at least one embodiment, the compiler program 308 includes instructions to generate code for one or more helper threads with respect to a main thread.
  • Memory system 302 is intended as a generalized representation of memory and may include a variety of forms of memory, such as a hard drive, CD-ROM, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM) and related circuitry. Memory system 302 may store instructions 310 and/or data 312 represented by data signals that may be executed by processor 304. The instructions 310 and/or data 312 may include code for performing any or all of the techniques discussed herein.
  • Specifically, compiler 308 may include a delinquent load identifier 320 that, when executed by the processor 304, identifies one or more delinquent load regions of a main thread. The compiler 308 may also include a parallelization analyzer 324 that, when executed by the processor 304, performs one or more parallelization analysis for the helper threads. Also, the compiler 308 may include a slicer 322 that identifies one or more slices to be executed by a helper thread in order to perform speculative precomputation. The compiler 308 may further include a code generator 328 that, when executed by the processor 304, generates the code (e.g., source and executable code) for the helper threads.
  • Executing helper threads in an SMT machine is a form of asymmetric multithreading, as shown in FIG. 4B according to one embodiment. Traditional parallel programming models provide symmetric multithreading, as shown in FIG. 4A. In contrast, the helper threads, such as helper threads 451-454 in FIG. 4B execute as user-level threads (fibers) with lightweight thread invocation and switching. Furthermore, symmetric multithreading requires well-tuned data decomposition across symmetric threads, such as threads 401-404 in FIG. 4A. In the helper thread model, according to one embodiment, the main thread runs the sequential code that operates on the entire data set, without incurring data decomposition overhead. Without decomposing the data, the compiler instead focuses on providing multiple helpers for timely prefetches for the main thread's data.
  • FIG. 5 is a flow diagram illustrating an exemplary process for executing a helper thread according to one embodiment. Exemplary process 500 may be performed by a processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination of both. In one embodiment, exemplary process 500 includes executing a main thread of an application in a multi-threading system, and spawning one or more helper threads from the main thread to perform one or more computations for the main thread when the main thread enters a region having one or more delinquent loads, code of the one or more helper thread being created during a compilation of the main thread.
  • Referring to FIG. 5, at block 501, the processing logic creates an internal thread pool to maintain a list of logical thread contexts which may be used by one or more helper threads. At block 502, a new thread team may be created before a main thread enters a delinquent load region (e.g., precomputation region) which may be identified by a compiler. In one embodiment, the new thread team initially contains only the calling thread. According to one embodiment, the compiler may insert a statement, such as start_helper statement, before the main thread enters the region to activate one or more helper threads. At block 503, when the main thread enters the region, the main thread spawns (via a function call, such as invoke_helper) one or more helper threads which are created using the resources from the thread pool to perform one or more precomputations, such as prefetching addresses and data, for the main thread. According to one embodiment, if no logical processor is available for executing the spawned helper threads, the helper threads may be created and placed in a run queue for the thread team for subsequent execution. In one embodiment, the run queue may be associated with a time-out. The request to invoke a helper is simply dropped (e.g., terminated) after the time-out period expires, assuming that the prefetch will no longer be timely. This is different from traditional task-queue model for parallel programming, where each task needs to be executed.
  • At block 504, at least a portion of the code within the region of the main thread is executed using in part the data (e.g., prefetched or precomputed) provided by the one or more helper threads. According to one embodiment, the results computed by a helper thread are not integrated into the main thread. The benefit of a helper thread lies in its side effects of prefetching, not in reusing its computation results. This allows the compiler to aggressively optimize the code generation for helper threads. The main thread handles the correctness issue, while the helper threads target the performance of a program. This also allows the helper thread invoking statement, such as invoke_helper, to drop requests whenever deemed appropriate. Finally, non-faulting instructions, such as the prefetch instructions, may be used to avoid disruptions to the main thread if exceptions are signaled in a helper thread.
  • At block 505, the one or more helper threads associated with the main thread are terminated (via a function call, such as finish_helper) when the main thread is about to exit the delinquent load region and the resources, such as logical thread contexts, associated with the terminated helper threads are released back to the thread pool. This enables future requests to immediately recycle the logical thread contexts from the thread pool. Other operations apparent to those with ordinary skill in the art may be included.
  • Hyper-Threading technology is well suited for supporting the execution of one or more helper threads. According to one embodiment, in each processor cycle, instructions from either of the logical processors can be scheduled and executed simultaneously on shared execution resources. This allows helper threads to issue timely prefetches. In addition, the entire on-chip cache hierarchy is shared between the logical processors, which is useful for helper threads to effectively prefetch for the main thread at all levels of the cache hierarchy. Furthermore, although the physical execution resources are shared between the logical processors, the architecture state is duplicated in a Hyper-Threading processor. The execution of helper threads will not alter the architecture state in the logical processor executing the main thread.
  • However, on Hyper-Threading technology enabled machines, helper threads can still impact the execution of main thread due to the writes to memory. Because helper threads share memory with the main thread, the execution of helper threads should be guaranteed not to write to the data structures of the main thread. In one embodiment, the compiler (e.g., AutoHelper) provides memory protection between the main thread and the helper threads. The compiler removes stores to non-local variables in the helper threads.
  • FIG. 6 is a block diagram illustrating an exemplary architecture of a compiler according to one embodiment. In one embodiment, exemplary architecture 600 includes, among others, a front end module 601, profiler 602, interprocedural analysis and optimization module 603, compiler 604, global scalar optimization module 605, and backend module 606. In one embodiment, front end module 601 provides a common intermediate representation, such as IL0 representation from Intel Corporation, for source codes written in a variety of programming languages, such as C/C++ and Fortran. As a result, the compiler, such as AutoHelper 604 is applicable irrespective of the source languages and of the target platforms. Profiler 602 performs a profiling run to examine the characteristics of the representation. Interprocedural analysis module 603 may exposes optimization opportunities across procedure call boundaries. Thereafter, the compiler 604 (e.g., AutoHelper) is invoked to generate code for one or more helper threads. Global scalar optimization module 605 applies, using partial redundancy elimination to minimize the number of times an expression is evaluated. Finally, backend module 606 generates binary code for the helper threads for a variety of platforms, such as IA-32 or Itanium platform from Intel Corporation. Other components apparent to those with ordinary skill in the art may be included.
  • Unlike a conventional approach, AutoHelper (e.g., the compiler) eliminates the profile-instrumentation pass to make the tool easier to use. According to one embodiment, the compiler can directly analyze the output from profiling results, such as those generated by Intel's VTune™ Performance Analyzer, which is enabled for Hyper-Threading technology. Because it is a middle-end pass instead of a post-pass tool, the compiler is able to utilize several product-quality analyses, such as array dependence analysis and global scalar optimization, etc. These analyses, invoked after the compiler, perform aggressive optimizations on the helper threads' code.
  • According to one embodiment, the compiler generates one or more helper threads to precompute and prefetch the address accessed by a load that misses the cache frequently, also referred to as a delinquent load. The compiler also generates one or more triggers in the main thread that spawns one or more helper threads. The compiler implements the trigger as an invoking function, such as the invoke_helper function call. Once the trigger is reached, the load is expected to appear later in the instruction stream of the main thread, hence the speculatively executed helper threads can reduce the number of cache misses in the main thread.
  • FIG. 7 is flow diagram illustrating an exemplary process performed by a compiler, such as AutoHelper, according to one embodiment. Exemplary process 700 may be performed by a processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination of both. In one embodiment, exemplary process 700 starts at block 701, to identifying delinquent loads using, for example, the VTune tool from Intel Corporation, to perform parallelization analysis for helper threads (block 702), to generate code for helper threads (block 703), and to allocate resources, such as hardware registers or memories for each helper threads and the main thread (block 704), which will be described in details further below.
  • According to one embodiment, the compiler identifies the most delinquent loads in an application source code using one or more run-time profiles. Traditional compilers collect the profiles in two steps: profile-instrumentation and profile-generation. However, because cache miss is not an architecture feature that is exposed to the compilers, profile-instrumentation pass does not permit instrumentation of cache misses for the compiler to identify delinquent loads. The profiles for each cache hierarchy are collected via a utility, such as the VTune™ Analyzer from Intel Corporation. In one embodiment, the application may be executed with debugging information in a separate profiling run prior to the compiler. During the profiling run, cache misses are sampled and the hardware counters are accumulated for each static load in the application.
  • The compiler identifies the candidates for thread-based prefetching. In a particular embodiment, the VTune™ summarizes the cache behavior on a per-load basis. Because the binary for the profiling run is compiled with the debug information (e.g., debug symbols), it is possible to correlate the profiles back to source line numbers and the statements. Certain loads that contribute more than a predetermined threshold may be identified as delinquent loads. In a particular embodiment, the top loads that contribute to 90% of cache misses are denoted as delinquent loads.
  • In addition to identifying delinquent load instructions, the compiler generates helper threads that compute the addresses of delinquent loads accurately. In one embodiment, separate code for helper threads is generated. The separation between the main thread and the helper thread's code prevents transformations on a helper thread's code from affecting the main thread. In one embodiment, the compiler uses multi-entry threading, instead of conventional out-lining, in the Intel product compiler to generate separate codes for helper threads.
  • Furthermore, according to one embodiment, the compiler performs multi-entry threading at the granularity of a compiler-selected code region, denoted as precomputation region. This region encompasses a set of delinquent loads and defines the scope for speculative precomputation. In one embodiment, the implementation usually targets loop regions, because loops are usually the hot spots in program execution, and the delinquent loads are the loads that were executed many times, usually in a loop.
  • FIG. 8 is flow diagram illustrating an exemplary process for parallelization analysis according to one embodiment. Exemplary process 800 may be performed by a processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination of both. Referring to FIG. 8, at block 801, the processing logic builds a dependent graph that captures both data and control dependencies of the main thread. According to one embodiment, in order to filter out unrelated code and thus reduce the size of a helper thread's code, the compiler first builds a graph that captures both data and control dependences. The effectiveness and legality of filtering rely on the compiler's ability to accurately disambiguate memory references. As a result, a memory disambiguation module in the compiler is invoked to disambiguate pointers to dynamically allocated objects. Because a pointer could be a global variable or a function parameter, the points-to analysis performed by the compiler is interprocedural, if the compiler compiles in the whole-program mode. In one embodiment, in order to build the dependence graph more accurately, a series of array dependence tests may be performed, so that each element in an array is disambiguated in building the dependence graph, if all the array accesses are finite expressions. Otherwise, approximation is used. Furthermore, each field in a structure may be disambiguated.
  • Referring back to FIG. 8, at block 802, the processing logic performs a slicing operation on the main thread using the dependent graph. During slicing, according to one embodiment, the compiler first identifies the load addresses of delinquent loads as slice criteria, which specify the intermediate slicing results. After building the dependence graph, the compiler computes the program slices of the identified slice criteria. The program slices of the slice criteria are defined as the set of instructions that contribute to the computation of the addresses for memory prefetches executed by the one or more helper threads. Slicing can reduce the code to only the instructions relevant to the computation of an address, thus allows the helper threads to run quicker and ahead of the main thread. The compiler only needs to copy instructions in a slice to the helper thread's code.
  • According to one embodiment, slicing in the compiler extracts a minimal sequence of instructions to produce the addresses of delinquent loads by transitively traversing the dependence edges backwards. The leaf nodes on the dependence graph of the resulting slices can be converted to prefetch instructions, because no further instructions are dependent on those leaf nodes. Those prefetch instructions executed by a processor, such as the Pentium™ 4 from Intel Corporation, are both non-blocking and non-faulting. Different prefetch instructions exist for bringing data into different levels of cache in the memory hierarchy.
  • According to one embodiment, slicing operations may be performed with respect to a given code region. Traversal on the dependence graph in a given region must terminate when it reaches code outside of that region. Thus, slicing must be terminated during traversal instead of after traversal, because the graph traversal may span to the outside of a region and then back to the inside of a region. Simply collecting the slices according to regions after the traversal may lose precision.
  • In a further embodiment, the compiler slices each delinquent loads instruction one by one. To minimize the duplication of code in helper threads and reduce the overhead of thread invocation and synchronization, the compiler merges slices into one helper thread if they are in the same precomputation region.
  • Referring back to FIG. 8, at block 803, the processing logic performs scheduling across the threads to overlap multiple prefetches. In one embodiment, since Hyper-Threading processors support out-of-order execution with large scheduling windows, the processors can look for independent instructions beyond the current executing instruction when it waits on a pending cache miss. This aspect of out-of-order execution can provide substantial performance gain over an in-order processor and reduce the need for chaining speculative precomputation. Furthermore, the compiler selects basic speculative precomputation for Hyper-Threading processors. Namely, only one helper thread is scheduled at a time to save the thread spawning and communication overhead. Another benefit from using basic speculative precomputation is that it does not inundate the memory system on our Hyper-Threading processors as fast as chaining speculative precomputation does. When the out-of-order processor looks for independent instructions for execution, those instructions can generate too many load requests and saturate the memory system. When the helper threads issue prefetching requests, a large number of outstanding misses could rapidly fill up the miss buffer and, as a result, stall the processor. Thus, the compiler needs to be judicious in spawning helper threads. Finally, to ensure timely prefetching, the compiler pins down the single helper thread and the main thread on respective logical processors.
  • Referring back to FIG. 8, at block 804, processing logic selects a communication scheme for the threads. In one embodiment, the compiler provides a module that computes live-ness information for any given slice, or any subset of program. Liveness information provides estimates on the communication cost. The information is used to select the precomputation region that provides good trade-off between communication and computation. The liveness information may help find triggers or the points at which the backward slicing ends.
  • Because the typical Hyper-Threading processors issue three micro-ops per processor cycle and use some hard-partitioned resources, the compiler has to be judicious as not to let helper threads slow down the main thread's execution, especially if the main thread issues three micro-ops for execution per cycle already. For the loop nest encompassing delinquent loads, the compiler makes trade-off between re-computation and communication in choosing the loop level for performing speculative precomputation. For each loop level, starting from the innermost one, according to one embodiment, the compiler selects one of the communication-based scheme and computation-based scheme.
  • According to one embodiment, the communication-based scheme communicates the live-in values from the main thread to the helper thread in each iteration, so the helper thread does not need to re-compute the live-in values. The compiler will select this scheme if there exists an inner loop encompassing most delinquent loads and if slicing for the inner loop significantly decreases the size of a helper thread. However, this scheme will be disabled if the communication cost for the inner loop level is very large. The compiler will give smaller estimate of communication cost, if the live-in values are computed early and the number of live-ins is small.
  • Communication-based scheme will create multiple communication points between the main thread and its helper thread at runtime. Communication-based scheme is important for Hyper-Threading processors, because relying on only one communication point by re-computing the slice in the helper thread may create too much resource contention between threads. This scheme is similar to constructing a do-across loop in that the main thread initiates the next iteration after it finishes computing the live-in values for that iteration. The scheme trades communication for less computation.
  • According to one embodiment, the computation-based scheme assumes only one communication point between two threads to pass in the live-in values in the beginning. Afterwards, the helper thread needs to compute everything it needs to generate accurate prefetch addresses. The compiler will select this scheme if there is no inner loop, or if slicing for this loop level does not significantly increases the size of a helper thread. Computation-based scheme gives the helper thread more independence in execution, once the single communication point is reached.
  • According to one embodiment, to select the loop level for speculative precomputation, the compiler selects the outermost loop that benefits from communication-based scheme. Hence the scheme-selection algorithm described above can terminate once it finds a loop with communication-based scheme. If the compiler does not find any loop with communication-based scheme, the outermost loop will be the targeted region for speculative precomputation. After the compiler selects the precomputation regions and their communication schemes, locating good trigger points in the main thread would ensure timely prefetches, while minimizing the communication between the main thread and the helper threads. Liveness information helps locate triggers, which are the points at which the backward slicing ends. Slicing beyond the precomputation region ends when the number of live-ins increases.
  • Referring back to FIG. 8, at block 805, the processing logic determines a synchronization period for the threads to synchronize with each other during the execution. According to one embodiment, the synchronization period is used to express the distance between a helper thread and the main thread. Typically, the helper thread performs all of its precomputation in units of synchronization period. This both minimizes communication and limits the possibility of producing run-away helpers. Because the compiler computes the value of synchronization period and generates synchronization code accordingly, special hardware support, such as Outstanding Slice Counter, is no longer needed.
  • If the synchronization period is too large, the prefetch induced by the helper thread could not only displace temporally important data to be used by the main thread but also potentially displace earlier prefetched data that have not been used by the main thread. On the other hand, if the synchronization period is too small, the prefetch could be too late to be useful. To decide on the value of synchronization period, according to one embodiment, the compiler first computes the difference between the length of the slice and the length of program schedule in the main thread. If the difference is small, the run-ahead distance induced by the helper thread in one iteration is consequently small. Multiple iterations may be needed by the helper thread to maintain enough run-ahead distance. Hence, the compiler increases the synchronization period if the difference is small, and vice versa.
  • Thereafter, the compiler generates code for the main thread and the helper thread during a code generation stage. During the code generation stage, the compiler builds a thread graph as the interface between the analysis phase and code generation phase. Each graph node denotes a sequence of instructions, or a code region. The invocation edge between the nodes denotes the thread-spawning relationship, which is important for specifying chaining helper threads. Having a thread graph enables code reuse because, according to one embodiment, the compiler also allows the user to insert pragmas in the source program to specify the code for helper threads and the live-ins. Both the pragma-based approach and the automatic approach share the same graph abstraction. As a result, the helper thread code generation module may be shared.
  • The helper thread code generation leverages multi-entry threading technology in the compiler to generate helper thread code. In contrast to the conventional, well-known outlining, the compiler does not create a separate compilation unit (or routine) for the helper thread. Instead, the compiler generates a threaded entry and a threaded return for in the helper thread code. The compiler keeps all newly generated helper thread codes intact or inlined within the same user-defined routine without splitting them into independent subroutines. This method provides later compiler optimizations with more opportunities for performing optimization on the newly generated helper threads. Fewer instructions in the helper thread means less resource contention on a hyper-threaded processor. This demonstrates that using helper threads for hiding latency incurs fewer instructions and less resource contention than the traditional symmetric multithreading model, which is important especially because the hyper-threaded processor issues three micro-ops per processor cycle and has some hard-partitioned resources.
  • According to one embodiment, the generated codes for helper threads will be re-ordered and optimized by the later on phases in the compiler such as partial dead-store elimination (PDSE), partial redundancy elimination (PRE), and other scalar optimizations. In that sense, the helper thread code needs to be optimized to minimize the resource contention. due to the helper thread. However, those further optimizations may remove prefetching code as well. Therefore, the leaf delinquent loads may be converted to the volatile-assign statements in the compiler. The leaf node in the dependence graph of a slice implies that no further instructions in the helper thread depend on the loaded value. Hence, the destination of the volatile-assign statement is changed to a register temp in the representation to speed up the resulting code. Using volatile-assign may prevent all later on compiler global optimizations from removing generated prefetches for delinquent loads.
  • According to one embodiment, the compiler aims at ensuring the helper thread to run neither too far ahead nor behind the main thread using a self-counting mechanism. According to one embodiment, value X is pre-set for run-ahead distance control. The X can be modified through a compiler switch by users, or based on program analysis of the length of slice (or helper code) and the length of main code. In one embodiment, the compiler generates mc (M-counter) with an initial value X for main thread and hc (H-counter) with an initial value 0 for helper thread, and the compiler generates the counter M and H for counting the sync-up periods in main and helper code. The idea is that the all four counters (mc, M, hc, H) perform self-counting. The helper thread has no inference to main thread. If the helper thread runs too far ahead of main thread, it will issue a wait, if the helper thread runs behind main thread, it will perform a catch-up.
  • In a particular embodiment, for every X loop-iterations, the main thread issues a post to ensure that the helper is not waiting and can go ahead to perform non_faulting_load. At this point, if the helper thread waits for the main thread after issuing a number of non_faulting_loads in chunks of sync-up period, it will wake up to perform non_faulting_loads. In another particular embodiment, for every X loop-iterations, the helper thread examines whether its hc counter is greater main thread's mc counter and the hc counter is greater a sync-up period H*X of the helper thread, if so, the helper will issue a wait and go to sleep. This prevents the helper thread from running too far ahead of the main thread. In a further embodiment, before iterating over another chunk of sync-up period, the helper thread examines whether its hc counter is smaller than the main thread's mc counter. If so, the helper thread has fallen behind, and must “catch-up and jump ahead” by updating its counter hc and H and all capture private and live-in variable from the main thread. FIGS. 9A-9C are diagrams illustrating exemplary pseudo code of an application, a main thread, and a helper thread according to one embodiment. Referring to FIGS. 9A-9C, the compiler compiles a source code 901 of an application and generates code for a main thread 902 and a helper thread 903 using at least one of the aforementioned techniques. It will be appreciated that the code 901-903 are not limited to C/C++. Other programming languages, such as Fortran or Assembly, may be used.
  • After the code for the helper threads have been created, the compiler may further allocate, statically or dynamically, resources for each helper thread and the main thread to ensure that there is no resource conflict between the main thread and the helper threads, and among the helper threads. Hardware resources, such as register contexts, may be managed for helper threads within the compiler. Specifically, the register set may be statically or dynamically partitioned between the main thread and the helper threads, and between multiple helper threads. As a result, the live-in/live-out register copies via memory for threads may be avoided and the threads may be destroyed at compile-time, when the compiler runs out of resources, or at runtime when infrequent cases of certain main thread event occurs.
  • According to one embodiment, the compiler may “walk through” the helper threads in a bottom-up order and communicates the resource utilization in a data structure, such as a resource table shown in FIG. 12. The parent helper thread, which may be the main thread, utilizes this information and ensures that its resources don't overlap with the thread resources. When the thread resources penalize the main execution thread, for example by forcing the main thread to spill/fill registers, the compiler can kill previously created threads.
  • FIG. 10 is a block diagram illustrating an exemplary configuration of threads according to one embodiment. In this embodiment, exemplary configuration 1000 includes a main thread 1001 (e.g., a parent thread) and three helper threads (e.g., child threads) 1002-1004, which may be spawned from the main thread 1001, while thread 1003 may be spawned from thread 1002 (e.g., helper thread 1002 is a parent thread of helper thread 1003). It will be appreciated that the helper threads are not limited to three helper threads, more or less helper threads may be included. The helper threads may be spawned by a spawn instruction and the thread execution may resumes after the spawn instruction.
  • The threads are created by the compiler during a thread creation phase, such as those operations shown in FIGS. 5-8. According to one embodiment, the compiler creates the threads in the thread creation phase and allocates resources for the threads in a subsequent thread resource allocation phase. Dynamically and typically, a helper thread is spawned when its parent thread stalls. Exemplary configuration 1000 may happen during a page fault or a level 3 (L3) cache miss.
  • It is crucial that a thread can only share incoming registers (or resources in general) with a parent thread. For example, referring to FIG. 10, when main thread 1001 needs a register, it writes a value to register RIO before it spawns helper thread 1002 and uses register R10 after the helper thread 1002 terminates. Neither the helper thread 1002 nor any of its children (in the example, helper thread 1003 is the only children of helper thread 1002, and helper threads 1002 and 1004 are children of the main thread 1001) can write to register R10. Otherwise they would destroy the value in the main thread 1001. This would result in incorrect program execution. To avoid this resource conflict, according to one embodiment, the compiler may partition the resources statically or dynamically.
  • According to one embodiment, the compiler allocates resources for the helper threads and the main thread in a bottom-up order. FIG. 11 is a block diagram illustrating an exemplary pseudo code for allocating resources for the threads according to one embodiment. That is, in the exemplary algorithm 1100, the compiler allocates all resources for the helper threads in a bottom-up order (block 1101) and thereafter allocates resources for the main thread (block 1102) based on the resources used by the helper threads to avoid resource conflicts.
  • For the purposes of illustration, the resources used the threads are assumed to be the hardware registers. However, similar concepts may be applied to other resources apparent to one with ordinary skill in the art, such as memory or interrupt. Referring to FIG. 10, the compiler partitions the registers dynamically by walking bottom up from the lead thread of a thread chain. In this example, helper thread 1003 is a leaf thread in the first thread chain including helper thread 1002. Helper thread 1004 is a leaf thread in the second thread chain. The compiler records the register allocation in each helper thread in a data structure, such as a resource table similar to the exemplary resource table 1200 of FIG. 12. Then the parent thread reads the resource allocation of its children thread and does its allocation and reports it in its resource table.
  • FIG. 12 is a block diagram illustrating an exemplary resource data structure according to one embodiment. Exemplary data structure 1200 may be implemented as a table stored in a memory and accessible by a compiler. Alternatively, exemplary data structure 1200 may be implemented in a database. In one embodiment, exemplary data structure 1200 includes, but not limited to, written resources 1202 and live-in resources used by the respective thread identified via thread ID 1201. Other configurations may exist.
  • Referring to FIGS. 10 and 12, according to one embodiment, at the beginning, the registers of helper thread 1003 (e.g., the thread having the most bottom order in a bottom-up scheme) are allocated. The live-in values are V5 and V6 and assuming they are assigned to registers R2 and R3 respectively. Also, V7 gets register R4 assigned and V9 gets register R5 assigned. The resource table for helper thread 1003 includes live-in=((V5, R2), (V6, R3)) and register written=(R4, R5), as shown in FIG. 12. In helper thread 1002, the compiler replaces V5 with R2 and V6 with R3 during the allocation and marks register R4 and R5 (written in helper thread 1003) as live at the spawn instruction. This prevents register usage of R4 or R5 across the spawn point of helper thread 1003 and thus prevents a resource conflict between helper thread 1002 and helper thread 1003. For helper thread 1002, the live-in values are V3 and V4 and are assigned to register R6 and R7 respectively. When V8 and V20 are assigned to registers R8 and R9 respectively, the resource table for helper thread 1002 includes live_in=((V3, R6), (V4, R7)) and written registers=(R2, R3, R4, R5, R8, shown in FIG. 12. The written registers are the live-in registers for helper thread 1003 (e.g., R2 and R3), the written registers in helper thread 1003 (e.g., R4 and R5) and the registers written in helper thread 1002 (e.g., R8 and R9). Then the compiler allocates the registers for helper thread 1004. When the registers are allocated for all the helper threads, it allocates the registers for the main thread 1001.
  • In addition, according to one embodiment, when the compiler runs out of registers, it can delete one or more helper threads within the chain. This can happen for example, when the main thread runs out of registers, because the helper thread chain is too deep or a single helper thread needs too many registers and the main thread has to spill/fill registers. The compiler can apply heuristics to either allow certain number of spills or delete the entire helper thread chain or some threads in the thread chain. An alternative to deleting helper thread is to explicitly configure the weight of context save/restore, so that upon context switch, the parent's live registers that could be written by the helper thread's execution can be saved automatically by the hardware. Even though this context switch is relatively expensive, potentially such case is infrequent case. Moreover, such fine-grain context switch is still of much low overhead compared to full-context switch as used in most OS-enabled thread switch or a traditional hardware based full-context thread switch.
  • Furthermore, when there is a conflict for live-in registers, for example, if helper thread 1003 overwrote a live-in register (e.g., mov v5=. . . ) and this register is also used in helper thread 1002 after the spawn of helper thread 1003, there would be a resource conflict for the register assigned to v5 (in this example, register R2). To handle this information, the compiler would use availability analysis and insert compensation code, such as inserting a mov v5′=v5 instruction before spawning helper thread 1003 and replacing v5 by v5′ after the spawn.
  • FIG. 13 is a flow diagram illustrating an exemplary process for allocating resources for threads according to one embodiment. Exemplary process 1300 may be performed by a processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination of both. In one embodiment, exemplary process 1300 includes selecting, during a compilation of a code having one or more threads executable in a data processing system, a current thread having a most bottom order, determining resources allocated to one or more child threads spawned from the current thread, and allocating resources for the current thread in consideration of the resources allocated to the current thread's one or more child threads to avoid resource conflicts between the current thread and its one or more child threads.
  • Referring to FIG. 13, at block 1301, processing logic identifies one or more threads, including a main thread and its helper threads, and selects a thread having the most bottom order as a current thread. The threads may be identified using a thread dependency graph created during the thread creation phase of the compilation. At block 1302, the processing logic retrieves resource information of any child thread, which may be spawned from the current thread. The resources information may be obtained from a data structure corresponding to the child threads, such as resource table 1200 of FIG. 12. At block 1303, if there is no more resources available, the processing logic may delete one or more threads from the chain and restart over again (block 1309). If there is more resource available, at block 1304, the processing logic allocates resources for the current thread in consideration of resources used by its child threads without causing resource conflicts. Thereafter, at block 1305, the processing logic updates the resources allocated to the current thread in the associated resource table, such as resource table 1200. The above processes continue until no more helper threads (e.g., child threads of the main thread) remained (blocks 1306 and 1308). Finally, at block 1307, the processing logic allocates resources for the main thread (e.g., a parent thread for all helper threads) based on the resource information of all the helper threads without causing resource conflicts. Other operations may be included.
  • The above described techniques have been tested against a variety of benchmark tools based on a system similar to the following configurations:
    A Processor with Hyper-Threading Technology
    Threading
     2 logical processors.
    Trace cache  12k micro-ops. 8-way associative.
     6 micro-ops per line.
    L1 D cache  8k bytes. 4-way associative. 64-byte line
    size.
     2-cycle integer access. 4-cycle FP access.
    L2 unified 256k bytes. 8-way associative.
    cache 128-byte line size. 7-cycle access latency.
    Load buffers  48
    Store buffers  24
  • The variety of benchmark tools include at least one of the following:
    Benchmark Description Input Set
    nbody_walker Traverses nearest bodies 20k bodies
    from any node in Nbody
    graph
    mst Computes Minimal  3k nodes
    Spanning Tree for data
    clustering
    em3d Solves electromagnetic 20k 5-
    propagation in 3D degree
    nodes
    health Hierarchical database  5 levels
    modeling health care system
    mcf Integer programming Lite
    algorithm used for bus
    scheduling

    FIG. 14A is a chart illustrating an improvement of performance by the helper thread on nbody_walker benchmark utility. FIG. 14B is a chart illustrating a speedup result of nbody_walker at a given value of synchronization period. FIG. 14C is a chart illustrating an automatic process versus a manual process with respect to a variety of benchmark. FIG. 14D is chart illustrating an improvement of an automatic process over a manual process using nbody_walker at a given synchronization period.
  • Thus, methods and apparatuses for thread management for multi-threading have been described. In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of the invention as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense.

Claims (20)

1. A method, comprising:
selecting, during a compilation of code having one or more threads executable in a data processing system, a current thread having a most bottom order;
determining resources allocated to one or more child threads spawned from the current thread; and
allocating resources for the current thread in consideration of the resources allocated to the current thread's one or more child threads to avoid resource conflicts between the current thread and its one or more child threads.
2. The method of claim 1, wherein the resources include at least one of hardware registers and memory used by the respective thread.
3. The method of claim 1, wherein the resources allocated to the one or more child threads are recorded in a data structure accessible by the current thread.
4. The method of claim 1, further comprising updating resource information in a data structure regarding the resources allocated to the current thread, the data structure being accessible by a parent thread of the current thread.
5. The method of claim 1, further comprising repeating the selecting, determining, and allocating in a bottom-up order until each of the one or more threads has been processed.
6. The method of claim 5, further comprising allocate resources for a main thread that is a parent thread of the one or more threads after each of the one or more threads has been processed, the resources of the main thread are allocated in view of resources allocated to the one or more threads.
7. The method of claim 1, further comprising:
determining whether there are resources remaining in the data processing system prior to the allocating the resources for the current thread; and
deleting at least one child thread of the current thread; and
allocating the resources for the current thread using the resources associated with the at least one deleted child thread.
8. A machine-readable medium having executable code to cause a machine to perform a method, the method comprising:
selecting, during a compilation of code having one or more threads executable in a data processing system, a current thread having a most bottom order;
determining resources allocated to one or more child threads spawned from the current thread; and
allocating resources for the current thread in consideration of the resources allocated to the current thread's one or more child threads to avoid resource conflicts between the current thread and its one or more child threads.
9. The machine-readable medium of claim 8, wherein the resources include at least one of hardware registers and memory used by the respective thread.
10. The machine-readable medium of claim 8, wherein the resources allocated to the one or more child threads are recorded in a data structure accessible by the current thread.
11. The method of claim 1, further comprising updating resource information in a data structure regarding the resources allocated to the current thread, the data structure being accessible by a parent thread of the current thread.
12. The machine-readable medium of claim 8, wherein the method further comprises repeating the selecting, determining, and allocating in a bottom-up order until each of the one or more threads has been processed.
13. The machine-readable medium of claim 12, wherein the method further comprises allocating resources for a main thread that is a parent thread of the one or more threads after each of the one or more threads has been processed, the resources of the main thread are allocated in view of resources allocated to the one or more threads.
14. The machine-readable medium of claim 8, wherein the method further comprises:
determining whether there are resources remaining in the data processing system prior to the allocating the resources for the current thread; and
deleting at least one child thread of the current thread; and
allocating the resources for the current thread using the resources associated with the at least one deleted child thread.
15. A data processing system, comprising:
a processor capable of performing multi-threading operations;
a memory coupled to the processor; and
a process executed by the processor from the memory to cause the processor to
select, during a compilation of code having one or more threads executable in a data processing system, a current thread having a most bottom order,
determine resources allocated to one or more child threads spawned from the current thread, and
allocate resources for the current thread in consideration of the resources allocated to the current thread's one or more child threads to avoid resource conflicts between the current thread and its one or more child threads.
16. The data processing system of claim 15, wherein the process further causes the processor to update resource information in a data structure regarding the resources allocated to the current thread, the data structure being accessible by a parent thread of the current thread.
17. The data processing system of claim 16, wherein the process further causes the processor to repeat the selecting, determining, and allocating in a bottom-up order until each of the one or more threads has been processed.
18. The data processing system of claim 17, wherein the process further causes the processor to allocate resources for a main thread that is a parent thread of the one or more threads after each of the one or more threads has been processed, the resources of the main thread are allocated in view of resources allocated to the one or more threads.
19. The data processing system of claim 15, wherein the process further causes the processor to:
determine whether there are resources remaining in the data processing system prior to the allocating the resources for the current thread; and
delete at least one child thread of the current thread; and
allocate the resources for the current thread using the resources associated with the at least one deleted child thread.
20. The data processing system of claim 15, wherein the resources include at least one of hardware registers and memory used by the respective thread.
US10/676,581 2003-09-30 2003-09-30 Methods and apparatuses for thread management of mult-threading Abandoned US20050071841A1 (en)

Priority Applications (8)

Application Number Priority Date Filing Date Title
US10/676,581 US20050071841A1 (en) 2003-09-30 2003-09-30 Methods and apparatuses for thread management of mult-threading
US10/779,193 US7398521B2 (en) 2003-09-30 2004-02-13 Methods and apparatuses for thread management of multi-threading
CN200480027177A CN100578453C (en) 2003-09-30 2004-09-29 Methods and apparatuses for thread management of multi-threading
JP2006527169A JP4528300B2 (en) 2003-09-30 2004-09-29 Multithreading thread management method and apparatus
DE602004026750T DE602004026750D1 (en) 2003-09-30 2004-09-29 METHOD AND DEVICES FOR THREAD MANAGEMENT OF MULTIPLE-THREADS
EP04785288A EP1668500B1 (en) 2003-09-30 2004-09-29 Methods and apparatuses for thread management of multi-threading
AT04785288T ATE465446T1 (en) 2003-09-30 2004-09-29 METHOD AND DEVICE FOR THREAD MANAGEMENT OF MULTIPLE THREADS
PCT/US2004/032075 WO2005033936A1 (en) 2003-09-30 2004-09-29 Methods and apparatuses for thread management of multi-threading

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US10/676,581 US20050071841A1 (en) 2003-09-30 2003-09-30 Methods and apparatuses for thread management of mult-threading

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US10/779,193 Continuation-In-Part US7398521B2 (en) 2003-09-30 2004-02-13 Methods and apparatuses for thread management of multi-threading

Publications (1)

Publication Number Publication Date
US20050071841A1 true US20050071841A1 (en) 2005-03-31

Family

ID=34377426

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/676,581 Abandoned US20050071841A1 (en) 2003-09-30 2003-09-30 Methods and apparatuses for thread management of mult-threading
US10/779,193 Expired - Fee Related US7398521B2 (en) 2003-09-30 2004-02-13 Methods and apparatuses for thread management of multi-threading

Family Applications After (1)

Application Number Title Priority Date Filing Date
US10/779,193 Expired - Fee Related US7398521B2 (en) 2003-09-30 2004-02-13 Methods and apparatuses for thread management of multi-threading

Country Status (7)

Country Link
US (2) US20050071841A1 (en)
EP (1) EP1668500B1 (en)
JP (1) JP4528300B2 (en)
CN (1) CN100578453C (en)
AT (1) ATE465446T1 (en)
DE (1) DE602004026750D1 (en)
WO (1) WO2005033936A1 (en)

Cited By (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040107421A1 (en) * 2002-12-03 2004-06-03 Microsoft Corporation Methods and systems for cooperative scheduling of hardware resource elements
US20040243767A1 (en) * 2003-06-02 2004-12-02 Cierniak Michal J. Method and apparatus for prefetching based upon type identifier tags
US20050081207A1 (en) * 2003-09-30 2005-04-14 Hoflehner Gerolf F. Methods and apparatuses for thread management of multi-threading
US20050138091A1 (en) * 2003-12-22 2005-06-23 Jean-Pierre Bono Prefetching and multithreading for improved file read performance
US20050154861A1 (en) * 2004-01-13 2005-07-14 International Business Machines Corporation Method and data processing system having dynamic profile-directed feedback at runtime
US20060155963A1 (en) * 2005-01-13 2006-07-13 Bohrer Patrick J Assist thread for injecting cache memory in a microprocessor
US20070118843A1 (en) * 2005-11-18 2007-05-24 Sbc Knowledge Ventures, L.P. Timeout helper framework
US20070124736A1 (en) * 2005-11-28 2007-05-31 Ron Gabor Acceleration threads on idle OS-visible thread execution units
WO2007115429A1 (en) * 2006-03-31 2007-10-18 Intel Corporation Managing and supporting multithreaded resources for native code in a heterogeneous managed runtime environment
US20070255604A1 (en) * 2006-05-01 2007-11-01 Seelig Michael J Systems and methods to automatically activate distribution channels provided by business partners
US20070261053A1 (en) * 2006-05-06 2007-11-08 Portal Player, Inc. System for multi threaded multi processor sharing of asynchronous hardware units
US20070294695A1 (en) * 2006-06-19 2007-12-20 Craig Jensen Method, system, and apparatus for scheduling computer micro-jobs to execute at non-disruptive times
WO2008040081A1 (en) * 2006-10-05 2008-04-10 Waratek Pty Limited Job scheduling amongst multiple computers
US20080086734A1 (en) * 2006-10-10 2008-04-10 Craig Jensen Resource-based scheduler
US20080086733A1 (en) * 2006-10-10 2008-04-10 Diskeeper Corporation Computer micro-jobs
US7472256B1 (en) 2005-04-12 2008-12-30 Sun Microsystems, Inc. Software value prediction using pendency records of predicted prefetch values
US20090006257A1 (en) * 2007-06-26 2009-01-01 Jeffrey Jay Scheel Thread-based software license management
US20090037906A1 (en) * 2007-08-02 2009-02-05 International Business Machines Corporation Partition adjunct for data processing system
US20090037941A1 (en) * 2007-08-02 2009-02-05 International Business Machines Corporation Multiple partition adjunct instances interfacing multiple logical partitions to a self-virtualizing input/output device
US20090064152A1 (en) * 2007-08-30 2009-03-05 International Business Machines Corporation Systems, methods and computer products for cross-thread scheduling
US20090164759A1 (en) * 2007-12-19 2009-06-25 International Business Machines Corporation Execution of Single-Threaded Programs on a Multiprocessor Managed by an Operating System
US20090164755A1 (en) * 2007-12-19 2009-06-25 International Business Machines Corporation Optimizing Execution of Single-Threaded Programs on a Multiprocessor Managed by Compilation
US20090199170A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Helper Thread for Pre-Fetching Data
US20090199181A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Use of a Helper Thread to Asynchronously Compute Incoming Data
US20100174757A1 (en) * 2009-01-02 2010-07-08 International Business Machines Corporation Creation of date window for record selection
US20100287550A1 (en) * 2009-05-05 2010-11-11 International Business Machines Corporation Runtime Dependence-Aware Scheduling Using Assist Thread
US20100293359A1 (en) * 2008-02-01 2010-11-18 Arimilli Ravi K General Purpose Register Cloning
US20100299496A1 (en) * 2008-02-01 2010-11-25 Arimilli Ravi K Thread Partitioning in a Multi-Core Environment
EP2287737A2 (en) * 2009-08-11 2011-02-23 Clarion Co., Ltd. Data processor and data processing method
US20110055484A1 (en) * 2009-09-03 2011-03-03 International Business Machines Corporation Detecting Task Complete Dependencies Using Underlying Speculative Multi-Threading Hardware
US20110067014A1 (en) * 2009-09-14 2011-03-17 Yonghong Song Pipelined parallelization with localized self-helper threading
US20110131559A1 (en) * 2008-05-12 2011-06-02 Xmos Limited Compiling and linking
US20110131558A1 (en) * 2008-05-12 2011-06-02 Xmos Limited Link-time resource allocation for a multi-threaded processor architecture
US20110219222A1 (en) * 2010-03-05 2011-09-08 International Business Machines Corporation Building Approximate Data Dependences with a Moving Window
US20120017221A1 (en) * 2005-06-13 2012-01-19 Hankins Richard A Mechanism for Monitoring Instruction Set Based Thread Execution on a Plurality of Instruction Sequencers
US8413151B1 (en) 2007-12-19 2013-04-02 Nvidia Corporation Selective thread spawning within a multi-threaded processing system
US8447933B2 (en) 2007-03-06 2013-05-21 Nec Corporation Memory access control system, memory access control method, and program thereof
US8612730B2 (en) 2010-06-08 2013-12-17 International Business Machines Corporation Hardware assist thread for dynamic performance profiling
US8615770B1 (en) 2008-08-29 2013-12-24 Nvidia Corporation System and method for dynamically spawning thread blocks within multi-threaded processing systems
US8959497B1 (en) * 2008-08-29 2015-02-17 Nvidia Corporation System and method for dynamically spawning thread blocks within multi-threaded processing systems
US20160070473A1 (en) * 2014-09-08 2016-03-10 Apple Inc. Method to enhance programming performance in multilevel nvm devices
CN106201853A (en) * 2015-04-30 2016-12-07 阿里巴巴集团控股有限公司 Method of testing and device
CN106407197A (en) * 2015-07-28 2017-02-15 北京京东尚科信息技术有限公司 Data traversing method and device
US20170220386A1 (en) * 2013-11-20 2017-08-03 International Business Machines Corporation Computing session workload scheduling and management of parent-child tasks
US20170372448A1 (en) * 2016-06-28 2017-12-28 Ingo Wald Reducing Memory Access Latencies During Ray Traversal
US20180101410A1 (en) * 2012-09-14 2018-04-12 International Business Machines Corporation Management of resources within a computing environment
CN111190961A (en) * 2019-12-18 2020-05-22 航天信息股份有限公司 Dynamic optimization multithreading data synchronization method and system
US10802882B2 (en) * 2018-12-13 2020-10-13 International Business Machines Corporation Accelerating memory access in a network using thread progress based arbitration
US11188593B1 (en) * 2018-12-28 2021-11-30 Pivotal Software, Inc. Reactive programming database interface
CN114090270A (en) * 2022-01-21 2022-02-25 武汉中科通达高新技术股份有限公司 Thread management method and device, electronic equipment and computer readable storage medium
US11314718B2 (en) * 2019-11-21 2022-04-26 International Business Machines Corporation Shared disk buffer pool update and modification

Families Citing this family (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7707543B2 (en) * 2004-11-24 2010-04-27 Siemens Aktiengesellschaft Architecture for a computer-based development environment with self-contained components and a threading model
CN101091177B (en) * 2004-12-31 2010-05-26 英特尔公司 Parallelization of bayesian network structure learning
US20060212450A1 (en) * 2005-03-18 2006-09-21 Microsoft Corporation Temporary master thread
US20070094213A1 (en) * 2005-07-14 2007-04-26 Chunrong Lai Data partitioning and critical section reduction for Bayesian network structure learning
US20070094214A1 (en) * 2005-07-15 2007-04-26 Li Eric Q Parallelization of bayesian network structure learning
CN100430898C (en) * 2006-12-20 2008-11-05 金魁 Application system for high grade multiple line distance management
US7584346B1 (en) * 2007-01-25 2009-09-01 Sun Microsystems, Inc. Method and apparatus for supporting different modes of multi-threaded speculative execution
US8321840B2 (en) * 2007-12-27 2012-11-27 Intel Corporation Software flow tracking using multiple threads
CN101482831B (en) * 2008-01-08 2013-05-15 国际商业机器公司 Method and equipment for concomitant scheduling of working thread and worker thread
US20100153934A1 (en) * 2008-12-12 2010-06-17 Peter Lachner Prefetch for systems with heterogeneous architectures
KR101572879B1 (en) 2009-04-29 2015-12-01 삼성전자주식회사 Dynamic parallel system and method for parallel application program
US8056080B2 (en) * 2009-08-31 2011-11-08 International Business Machines Corporation Multi-core/thread work-group computation scheduler
JP5541491B2 (en) * 2010-01-07 2014-07-09 日本電気株式会社 Multiprocessor, computer system using the same, and multiprocessor processing method
US8667253B2 (en) 2010-08-04 2014-03-04 International Business Machines Corporation Initiating assist thread upon asynchronous event for processing simultaneously with controlling thread and updating its running status in status register
US8413158B2 (en) * 2010-09-13 2013-04-02 International Business Machines Corporation Processor thread load balancing manager
US8713290B2 (en) * 2010-09-20 2014-04-29 International Business Machines Corporation Scaleable status tracking of multiple assist hardware threads
US8793474B2 (en) 2010-09-20 2014-07-29 International Business Machines Corporation Obtaining and releasing hardware threads without hypervisor involvement
US8561070B2 (en) * 2010-12-02 2013-10-15 International Business Machines Corporation Creating a thread of execution in a computer processor without operating system intervention
US8832672B2 (en) * 2011-01-28 2014-09-09 International Business Machines Corporation Ensuring register availability for dynamic binary optimization
US9575903B2 (en) 2011-08-04 2017-02-21 Elwha Llc Security perimeter
US9465657B2 (en) * 2011-07-19 2016-10-11 Elwha Llc Entitlement vector for library usage in managing resource allocation and scheduling based on usage and priority
US9460290B2 (en) 2011-07-19 2016-10-04 Elwha Llc Conditional security response using taint vector monitoring
US9298918B2 (en) 2011-11-30 2016-03-29 Elwha Llc Taint injection and tracking
US9798873B2 (en) 2011-08-04 2017-10-24 Elwha Llc Processor operable to ensure code integrity
US9443085B2 (en) 2011-07-19 2016-09-13 Elwha Llc Intrusion detection using taint accumulation
US9471373B2 (en) 2011-09-24 2016-10-18 Elwha Llc Entitlement vector for library usage in managing resource allocation and scheduling based on usage and priority
US9558034B2 (en) 2011-07-19 2017-01-31 Elwha Llc Entitlement vector for managing resource allocation
CN102629192A (en) * 2012-04-20 2012-08-08 西安电子科技大学 Instruction packet for on-chip multi-core concurrent multithreaded processor and operation method of instruction packet
CN103218453A (en) * 2013-04-28 2013-07-24 南京龙渊微电子科技有限公司 Method and device for splitting file
WO2015027403A1 (en) * 2013-08-28 2015-03-05 Hewlett-Packard Development Company, L.P. Testing multi-threaded applications
US9772867B2 (en) * 2014-03-27 2017-09-26 International Business Machines Corporation Control area for managing multiple threads in a computer
US9396044B2 (en) * 2014-04-25 2016-07-19 Sony Corporation Memory efficient thread-level speculation
US9348644B2 (en) 2014-10-08 2016-05-24 International Business Machines Corporation Application-level dispatcher control of application-level pseudo threads and operating system threads
US9529568B1 (en) * 2014-12-19 2016-12-27 Amazon Technologies, Inc. Systems and methods for low interference logging and diagnostics
US20170031724A1 (en) * 2015-07-31 2017-02-02 Futurewei Technologies, Inc. Apparatus, method, and computer program for utilizing secondary threads to assist primary threads in performing application tasks
CN106445703A (en) * 2016-09-22 2017-02-22 济南浪潮高新科技投资发展有限公司 Method for solving concurrent dirty read prevention in data transmission
CN106547612B (en) * 2016-10-18 2020-10-20 深圳怡化电脑股份有限公司 Multitasking method and device
CN109766131B (en) * 2017-11-06 2022-04-01 上海宝信软件股份有限公司 System and method for realizing intelligent automatic software upgrading based on multithreading technology
CN108345505B (en) * 2018-02-02 2022-08-30 珠海金山网络游戏科技有限公司 Multithreading resource management method and system
US10754706B1 (en) * 2018-04-16 2020-08-25 Microstrategy Incorporated Task scheduling for multiprocessor systems
CN110879748B (en) * 2018-09-06 2023-06-13 阿里巴巴集团控股有限公司 Shared resource allocation method, device and equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6233599B1 (en) * 1997-07-10 2001-05-15 International Business Machines Corporation Apparatus and method for retrofitting multi-threaded operations on a computer by partitioning and overlapping registers
US20030037290A1 (en) * 2001-08-15 2003-02-20 Daniel Price Methods and apparatus for managing defunct processes
US20050081207A1 (en) * 2003-09-30 2005-04-14 Hoflehner Gerolf F. Methods and apparatuses for thread management of multi-threading
US20050165671A1 (en) * 2000-03-31 2005-07-28 Meade Stephen M. Online trading system and method supporting heirarchically-organized trading members
US7036124B1 (en) * 1998-06-10 2006-04-25 Sun Microsystems, Inc. Computer resource management for competing processes
US7313795B2 (en) * 2003-05-27 2007-12-25 Sun Microsystems, Inc. Method and system for managing resource allocation in non-uniform resource access computer systems
US7328242B1 (en) * 2001-11-09 2008-02-05 Mccarthy Software, Inc. Using multiple simultaneous threads of communication
US7415699B2 (en) * 2003-06-27 2008-08-19 Hewlett-Packard Development Company, L.P. Method and apparatus for controlling execution of a child process generated by a modified parent process

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6363410B1 (en) * 1994-12-13 2002-03-26 Microsoft Corporation Method and system for threaded resource allocation and reclamation
JPH1097435A (en) * 1996-09-20 1998-04-14 Nec Corp Resource allocation system
US6567839B1 (en) * 1997-10-23 2003-05-20 International Business Machines Corporation Thread switch control in a multithreaded processor system
JP2003015892A (en) * 2001-06-29 2003-01-17 Casio Comput Co Ltd Information terminal equipment and application management program
US7509643B2 (en) * 2003-03-24 2009-03-24 Sun Microsystems, Inc. Method and apparatus for supporting asymmetric multi-threading in a computer system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6233599B1 (en) * 1997-07-10 2001-05-15 International Business Machines Corporation Apparatus and method for retrofitting multi-threaded operations on a computer by partitioning and overlapping registers
US7036124B1 (en) * 1998-06-10 2006-04-25 Sun Microsystems, Inc. Computer resource management for competing processes
US20050165671A1 (en) * 2000-03-31 2005-07-28 Meade Stephen M. Online trading system and method supporting heirarchically-organized trading members
US20030037290A1 (en) * 2001-08-15 2003-02-20 Daniel Price Methods and apparatus for managing defunct processes
US7328242B1 (en) * 2001-11-09 2008-02-05 Mccarthy Software, Inc. Using multiple simultaneous threads of communication
US7313795B2 (en) * 2003-05-27 2007-12-25 Sun Microsystems, Inc. Method and system for managing resource allocation in non-uniform resource access computer systems
US7415699B2 (en) * 2003-06-27 2008-08-19 Hewlett-Packard Development Company, L.P. Method and apparatus for controlling execution of a child process generated by a modified parent process
US20050081207A1 (en) * 2003-09-30 2005-04-14 Hoflehner Gerolf F. Methods and apparatuses for thread management of multi-threading

Cited By (96)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7337442B2 (en) * 2002-12-03 2008-02-26 Microsoft Corporation Methods and systems for cooperative scheduling of hardware resource elements
US20040107421A1 (en) * 2002-12-03 2004-06-03 Microsoft Corporation Methods and systems for cooperative scheduling of hardware resource elements
US20040243767A1 (en) * 2003-06-02 2004-12-02 Cierniak Michal J. Method and apparatus for prefetching based upon type identifier tags
US7398521B2 (en) * 2003-09-30 2008-07-08 Intel Corporation Methods and apparatuses for thread management of multi-threading
US20050081207A1 (en) * 2003-09-30 2005-04-14 Hoflehner Gerolf F. Methods and apparatuses for thread management of multi-threading
US7206795B2 (en) * 2003-12-22 2007-04-17 Jean-Pierre Bono Prefetching and multithreading for improved file read performance
US20050138091A1 (en) * 2003-12-22 2005-06-23 Jean-Pierre Bono Prefetching and multithreading for improved file read performance
US20050154861A1 (en) * 2004-01-13 2005-07-14 International Business Machines Corporation Method and data processing system having dynamic profile-directed feedback at runtime
US7448037B2 (en) * 2004-01-13 2008-11-04 International Business Machines Corporation Method and data processing system having dynamic profile-directed feedback at runtime
US20060155963A1 (en) * 2005-01-13 2006-07-13 Bohrer Patrick J Assist thread for injecting cache memory in a microprocessor
US8230422B2 (en) * 2005-01-13 2012-07-24 International Business Machines Corporation Assist thread for injecting cache memory in a microprocessor
US7472256B1 (en) 2005-04-12 2008-12-30 Sun Microsystems, Inc. Software value prediction using pendency records of predicted prefetch values
US8887174B2 (en) * 2005-06-13 2014-11-11 Intel Corporation Mechanism for monitoring instruction set based thread execution on a plurality of instruction sequencers
US20120017221A1 (en) * 2005-06-13 2012-01-19 Hankins Richard A Mechanism for Monitoring Instruction Set Based Thread Execution on a Plurality of Instruction Sequencers
US7774779B2 (en) * 2005-11-18 2010-08-10 At&T Intellectual Property I, L.P. Generating a timeout in a computer software application
US20070118843A1 (en) * 2005-11-18 2007-05-24 Sbc Knowledge Ventures, L.P. Timeout helper framework
US9003421B2 (en) * 2005-11-28 2015-04-07 Intel Corporation Acceleration threads on idle OS-visible thread execution units
US20070124736A1 (en) * 2005-11-28 2007-05-31 Ron Gabor Acceleration threads on idle OS-visible thread execution units
WO2007115429A1 (en) * 2006-03-31 2007-10-18 Intel Corporation Managing and supporting multithreaded resources for native code in a heterogeneous managed runtime environment
US8087018B2 (en) 2006-03-31 2011-12-27 Intel Corporation Managing and supporting multithreaded resources for native code in a heterogeneous managed runtime environment
US20080244578A1 (en) * 2006-03-31 2008-10-02 Zach Yoav Managing and Supporting Multithreaded Resources For Native Code in a Heterogeneous Managed Runtime Environment
US20070255604A1 (en) * 2006-05-01 2007-11-01 Seelig Michael J Systems and methods to automatically activate distribution channels provided by business partners
US9754265B2 (en) 2006-05-01 2017-09-05 At&T Intellectual Property I, L.P. Systems and methods to automatically activate distribution channels provided by business partners
US8726279B2 (en) * 2006-05-06 2014-05-13 Nvidia Corporation System for multi threaded multi processor sharing of asynchronous hardware units
US20070261053A1 (en) * 2006-05-06 2007-11-08 Portal Player, Inc. System for multi threaded multi processor sharing of asynchronous hardware units
US9727372B2 (en) 2006-06-19 2017-08-08 Invisitasking Llc Scheduling computer jobs for execution
US8239869B2 (en) 2006-06-19 2012-08-07 Condusiv Technologies Corporation Method, system and apparatus for scheduling computer micro-jobs to execute at non-disruptive times and modifying a minimum wait time between the utilization windows for monitoring the resources
US20070294695A1 (en) * 2006-06-19 2007-12-20 Craig Jensen Method, system, and apparatus for scheduling computer micro-jobs to execute at non-disruptive times
WO2008040081A1 (en) * 2006-10-05 2008-04-10 Waratek Pty Limited Job scheduling amongst multiple computers
US8615765B2 (en) 2006-10-10 2013-12-24 Condusiv Technologies Corporation Dividing a computer job into micro-jobs
US20080086734A1 (en) * 2006-10-10 2008-04-10 Craig Jensen Resource-based scheduler
US9588809B2 (en) 2006-10-10 2017-03-07 Invistasking LLC Resource-based scheduler
US8056083B2 (en) * 2006-10-10 2011-11-08 Diskeeper Corporation Dividing a computer job into micro-jobs for execution
US20080086733A1 (en) * 2006-10-10 2008-04-10 Diskeeper Corporation Computer micro-jobs
TWI403901B (en) * 2007-03-06 2013-08-01 Nec Corp Memory access controlling system, memory access controlling method, and program thereof
US8447933B2 (en) 2007-03-06 2013-05-21 Nec Corporation Memory access control system, memory access control method, and program thereof
US10452820B2 (en) * 2007-06-26 2019-10-22 International Business Machines Corporation Thread-based software license management
US20090006257A1 (en) * 2007-06-26 2009-01-01 Jeffrey Jay Scheel Thread-based software license management
US8219988B2 (en) 2007-08-02 2012-07-10 International Business Machines Corporation Partition adjunct for data processing system
US8219989B2 (en) 2007-08-02 2012-07-10 International Business Machines Corporation Partition adjunct with non-native device driver for facilitating access to a physical input/output device
US20090037906A1 (en) * 2007-08-02 2009-02-05 International Business Machines Corporation Partition adjunct for data processing system
US20090037941A1 (en) * 2007-08-02 2009-02-05 International Business Machines Corporation Multiple partition adjunct instances interfacing multiple logical partitions to a self-virtualizing input/output device
US20090037907A1 (en) * 2007-08-02 2009-02-05 International Business Machines Corporation Client partition scheduling and prioritization of service partition work
US20090037908A1 (en) * 2007-08-02 2009-02-05 International Business Machines Corporation Partition adjunct with non-native device driver for facilitating access to a physical input/output device
US9317453B2 (en) 2007-08-02 2016-04-19 International Business Machines Corporation Client partition scheduling and prioritization of service partition work
US8176487B2 (en) * 2007-08-02 2012-05-08 International Business Machines Corporation Client partition scheduling and prioritization of service partition work
US8645974B2 (en) 2007-08-02 2014-02-04 International Business Machines Corporation Multiple partition adjunct instances interfacing multiple logical partitions to a self-virtualizing input/output device
US8495632B2 (en) 2007-08-02 2013-07-23 International Business Machines Corporation Partition adjunct for data processing system
US20090064152A1 (en) * 2007-08-30 2009-03-05 International Business Machines Corporation Systems, methods and computer products for cross-thread scheduling
US9223580B2 (en) * 2007-08-30 2015-12-29 International Business Machines Corporation Systems, methods and computer products for cross-thread scheduling
US20090164759A1 (en) * 2007-12-19 2009-06-25 International Business Machines Corporation Execution of Single-Threaded Programs on a Multiprocessor Managed by an Operating System
US20090164755A1 (en) * 2007-12-19 2009-06-25 International Business Machines Corporation Optimizing Execution of Single-Threaded Programs on a Multiprocessor Managed by Compilation
US8312455B2 (en) * 2007-12-19 2012-11-13 International Business Machines Corporation Optimizing execution of single-threaded programs on a multiprocessor managed by compilation
US8413151B1 (en) 2007-12-19 2013-04-02 Nvidia Corporation Selective thread spawning within a multi-threaded processing system
US8544006B2 (en) * 2007-12-19 2013-09-24 International Business Machines Corporation Resolving conflicts by restarting execution of failed discretely executable subcomponent using register and memory values generated by main component after the occurrence of a conflict
US20090199170A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Helper Thread for Pre-Fetching Data
US8601241B2 (en) 2008-02-01 2013-12-03 International Business Machines Corporation General purpose register cloning
US20100293359A1 (en) * 2008-02-01 2010-11-18 Arimilli Ravi K General Purpose Register Cloning
US20090199181A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Use of a Helper Thread to Asynchronously Compute Incoming Data
US8775778B2 (en) 2008-02-01 2014-07-08 International Business Machines Corporation Use of a helper thread to asynchronously compute incoming data
US20100299496A1 (en) * 2008-02-01 2010-11-25 Arimilli Ravi K Thread Partitioning in a Multi-Core Environment
US8359589B2 (en) * 2008-02-01 2013-01-22 International Business Machines Corporation Helper thread for pre-fetching data
US8707016B2 (en) 2008-02-01 2014-04-22 International Business Machines Corporation Thread partitioning in a multi-core environment
US8826258B2 (en) * 2008-05-12 2014-09-02 Xmos Limited Compiling and linking
US20110131559A1 (en) * 2008-05-12 2011-06-02 Xmos Limited Compiling and linking
US8578354B2 (en) * 2008-05-12 2013-11-05 Xmos Limited Link-time resource allocation for a multi-threaded processor architecture
US20110131558A1 (en) * 2008-05-12 2011-06-02 Xmos Limited Link-time resource allocation for a multi-threaded processor architecture
US8959497B1 (en) * 2008-08-29 2015-02-17 Nvidia Corporation System and method for dynamically spawning thread blocks within multi-threaded processing systems
US8615770B1 (en) 2008-08-29 2013-12-24 Nvidia Corporation System and method for dynamically spawning thread blocks within multi-threaded processing systems
US20100174757A1 (en) * 2009-01-02 2010-07-08 International Business Machines Corporation Creation of date window for record selection
US8583700B2 (en) 2009-01-02 2013-11-12 International Business Machines Corporation Creation of date window for record selection
US8214831B2 (en) 2009-05-05 2012-07-03 International Business Machines Corporation Runtime dependence-aware scheduling using assist thread
US8464271B2 (en) 2009-05-05 2013-06-11 International Business Machines Corporation Runtime dependence-aware scheduling using assist thread
US20100287550A1 (en) * 2009-05-05 2010-11-11 International Business Machines Corporation Runtime Dependence-Aware Scheduling Using Assist Thread
EP2287737A2 (en) * 2009-08-11 2011-02-23 Clarion Co., Ltd. Data processor and data processing method
US20110055484A1 (en) * 2009-09-03 2011-03-03 International Business Machines Corporation Detecting Task Complete Dependencies Using Underlying Speculative Multi-Threading Hardware
US8468539B2 (en) 2009-09-03 2013-06-18 International Business Machines Corporation Tracking and detecting thread dependencies using speculative versioning cache
US20110067014A1 (en) * 2009-09-14 2011-03-17 Yonghong Song Pipelined parallelization with localized self-helper threading
US8561046B2 (en) * 2009-09-14 2013-10-15 Oracle America, Inc. Pipelined parallelization with localized self-helper threading
US20110219222A1 (en) * 2010-03-05 2011-09-08 International Business Machines Corporation Building Approximate Data Dependences with a Moving Window
US8667260B2 (en) 2010-03-05 2014-03-04 International Business Machines Corporation Building approximate data dependences with a moving window
US8612730B2 (en) 2010-06-08 2013-12-17 International Business Machines Corporation Hardware assist thread for dynamic performance profiling
US20180101410A1 (en) * 2012-09-14 2018-04-12 International Business Machines Corporation Management of resources within a computing environment
US10489209B2 (en) * 2012-09-14 2019-11-26 International Business Machines Corporation Management of resources within a computing environment
US10831551B2 (en) * 2013-11-20 2020-11-10 International Business Machines Corporation Computing session workload scheduling and management of parent-child tasks using a blocking yield API to block and unblock the parent task
US20170220386A1 (en) * 2013-11-20 2017-08-03 International Business Machines Corporation Computing session workload scheduling and management of parent-child tasks
US9423961B2 (en) * 2014-09-08 2016-08-23 Apple Inc. Method to enhance programming performance in multilevel NVM devices
US20160070473A1 (en) * 2014-09-08 2016-03-10 Apple Inc. Method to enhance programming performance in multilevel nvm devices
CN106201853A (en) * 2015-04-30 2016-12-07 阿里巴巴集团控股有限公司 Method of testing and device
CN106407197A (en) * 2015-07-28 2017-02-15 北京京东尚科信息技术有限公司 Data traversing method and device
US20170372448A1 (en) * 2016-06-28 2017-12-28 Ingo Wald Reducing Memory Access Latencies During Ray Traversal
US10802882B2 (en) * 2018-12-13 2020-10-13 International Business Machines Corporation Accelerating memory access in a network using thread progress based arbitration
US11188593B1 (en) * 2018-12-28 2021-11-30 Pivotal Software, Inc. Reactive programming database interface
US11314718B2 (en) * 2019-11-21 2022-04-26 International Business Machines Corporation Shared disk buffer pool update and modification
CN111190961A (en) * 2019-12-18 2020-05-22 航天信息股份有限公司 Dynamic optimization multithreading data synchronization method and system
CN114090270A (en) * 2022-01-21 2022-02-25 武汉中科通达高新技术股份有限公司 Thread management method and device, electronic equipment and computer readable storage medium

Also Published As

Publication number Publication date
DE602004026750D1 (en) 2010-06-02
ATE465446T1 (en) 2010-05-15
CN1853166A (en) 2006-10-25
JP2007506199A (en) 2007-03-15
JP4528300B2 (en) 2010-08-18
EP1668500B1 (en) 2010-04-21
US20050081207A1 (en) 2005-04-14
US7398521B2 (en) 2008-07-08
WO2005033936A1 (en) 2005-04-14
EP1668500A1 (en) 2006-06-14
CN100578453C (en) 2010-01-06

Similar Documents

Publication Publication Date Title
US8612949B2 (en) Methods and apparatuses for compiler-creating helper threads for multi-threading
EP1668500B1 (en) Methods and apparatuses for thread management of multi-threading
Chen et al. The Jrpm system for dynamically parallelizing Java programs
Tian et al. Copy or discard execution model for speculative parallelization on multicores
US9189233B2 (en) Systems, apparatuses, and methods for a hardware and software system to automatically decompose a program to multiple parallel threads
Renau et al. Tasking with out-of-order spawn in TLS chip multiprocessors: Microarchitecture and compilation
US8909902B2 (en) Systems, methods, and apparatuses to decompose a sequential program into multiple threads, execute said threads, and reconstruct the sequential execution
US10621092B2 (en) Merging level cache and data cache units having indicator bits related to speculative execution
US20070022422A1 (en) Facilitating communication and synchronization between main and scout threads
US20070079298A1 (en) Thread-data affinity optimization using compiler
Chen et al. TEST: a tracer for extracting speculative threads
Estebanez et al. A survey on thread-level speculation techniques
Ying et al. T4: Compiling sequential code for effective speculative parallelization in hardware
Matějka et al. Combining PREM compilation and static scheduling for high-performance and predictable MPSoC execution
US20120226892A1 (en) Method and apparatus for generating efficient code for scout thread to prefetch data values for a main thread
Sarkar et al. Compiler techniques for reducing data cache miss rate on a multithreaded architecture
Luo et al. Dynamically dispatching speculative threads to improve sequential execution
Barua et al. Cost-driven thread coarsening for GPU kernels
Wang et al. Smarq: Software-managed alias register queue for dynamic optimizations
Ying Scaling sequential code with hardware-software co-design for fine-grain speculative parallelization
Ranjan et al. P-slice based efficient speculative multithreading
WO2014003974A1 (en) Systems, apparatuses, and methods for a hardware and software system to automatically decompose a program to multiple parallel threads
Bhattacharyya Do inputs matter? Using data-dependence profiling to evaluate thread level speculation in the BlueGene/Q
Saad Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory
Saad Ibrahim Extracting Parallelism from Legacy Sequential Code Using Transactional Memory

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTEL CORPORATION, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HOFLEHNER, GEROLF F.;LIAO, SHIH-WEI;TIAN, XINMIN;AND OTHERS;REEL/FRAME:014572/0309;SIGNING DATES FROM 20030919 TO 20030924

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION