US20020083310A1 - Method and apparatus for predicting loop exit branches - Google Patents

Method and apparatus for predicting loop exit branches Download PDF

Info

Publication number
US20020083310A1
US20020083310A1 US09/169,866 US16986698A US2002083310A1 US 20020083310 A1 US20020083310 A1 US 20020083310A1 US 16986698 A US16986698 A US 16986698A US 2002083310 A1 US2002083310 A1 US 2002083310A1
Authority
US
United States
Prior art keywords
loop
branch
counter
branches
branch instructions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US09/169,866
Other versions
US6438682B1 (en
Inventor
Dale Morris
Mircea Poplingher
Tse-Yu Yeh
Michael Paul Corwin
Wenliang Chen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute for Development of Emerging Architectures LLC
Original Assignee
Institute for Development of Emerging Architectures LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute for Development of Emerging Architectures LLC filed Critical Institute for Development of Emerging Architectures LLC
Priority to US09/169,866 priority Critical patent/US6438682B1/en
Assigned to INSTITUTE FOR THE DEVELOPMENT OF EMERGING ARCHITECTURES, L.L.C. THE reassignment INSTITUTE FOR THE DEVELOPMENT OF EMERGING ARCHITECTURES, L.L.C. THE ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: POPLINGHER, MITCHELL A.
Assigned to INSTITUTE FOR THE DEVELOPMENT OF EMERGING ARCHITECTURES, L.L.C., THE reassignment INSTITUTE FOR THE DEVELOPMENT OF EMERGING ARCHITECTURES, L.L.C., THE ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: YEH, TSE-YUH, MORRIS, DALE, CORWIN, MICHEAL P.
Publication of US20020083310A1 publication Critical patent/US20020083310A1/en
Application granted granted Critical
Publication of US6438682B1 publication Critical patent/US6438682B1/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/32Address formation of the next instruction, e.g. by incrementing the instruction counter
    • G06F9/322Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address
    • G06F9/325Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address for loops, e.g. loop detection or loop counter

Definitions

  • the present invention relates to the field of microprocessors, and in particular, to systems and methods for branch prediction in microprocessors.
  • Advanced processors employ pipelining techniques to execute instructions at very high speeds.
  • the overall machine is organized as a pipeline consisting of several cascaded stages of hardware. Instruction processing is divided into a sequence of operations, and each operation is performed by hardware in a corresponding pipeline stage (“pipe stage”). Independent operations from several instructions may be processed simultaneously by different pipe stages, increasing the instruction throughput of the pipeline.
  • a pipelined processor includes multiple execution resources in each pipe stage, the throughput of the processor can exceed one instruction per clock cycle. To make full use of this instruction execution capability, the execution resources of the processor must be provided with sufficient instructions from the correct execution path.
  • Branch instructions pose major challenges to keeping the pipeline filled with instructions from the correct execution path.
  • control flow of the processor jumps to a new code sequence, and instructions from the new code sequence are transferred to the pipeline.
  • Branch execution typically occurs in the back end of the pipeline, while instructions are fetched at the front end of the pipeline. If changes in the control flow are not anticipated correctly, several pipe stages worth of instructions may be fetched from the wrong execution path by the time the branch is resolved. When this occurs, the instructions must be flushed from the pipeline, leaving idle pipe stages (bubbles) until the processor refills the pipeline with instructions from the correct execution path.
  • processors incorporate branch prediction modules at the front ends of their pipelines.
  • the branch prediction module forecasts whether the branch instruction will be taken when it is executed at the back end of the pipeline. If the branch is predicted taken, the branch prediction module indicates a target address to which control of the processor is predicted to jump.
  • a fetch module which is also located at the front end of the pipeline, fetches instructions beginning at the indicated target address.
  • Branch instructions are employed extensively in loops to execute a series of instructions (“the loop body”), repeatedly.
  • Modulo-scheduled loops are loops that are organized in a pipelined manner to improve execution efficiency.
  • a branch condition is tested following each iteration and control is returned to the first instruction of the loop body if the branch condition is met.
  • the last iteration of the loop occurs when the branch condition is not met, in which case control of the processor passes (“falls through”) to the instruction that follows the loop branch.
  • the loop branch is taken for all but the final iteration of the top loop.
  • Top loops terminate when the loop branch is not taken.
  • Another type of loop employs a branch at a location other than the end of the loop body. In this case, the loop branch is not taken for all but the final iteration of the loop. Exit loops terminate when the loop branch is taken.
  • Loops are very common programming structures, and branch prediction systems are typically designed to predict the loop branch conditions correctly for the bulk of the loop iterations.
  • the branch prediction system may be set up to automatically predict top loop branches as taken and exit loop branches as not taken. This strategy provides accurate branch predictions for all but the last iteration of each loop, when the loop condition changes.
  • mispredicting the loop branch on just the terminal iteration can have a significant impact on the overall performance of the processor. This is especially true where the loop is nested within an outer loop, when the loop count is small, or when the loop body is small.
  • the misprediction penalty associated with the terminal iteration of the inner loop is repeated for each iteration of the outer loop. In the latter cases, the misprediction penalty may exceed the total number of cycles necessary to execute the loop.
  • the present invention addresses these and other limitations associated with available branch prediction systems.
  • the present invention provides a system and method for predicting loop branches, including the loop branch that terminates the loop.
  • a loop prediction system includes a counter module, a control module, and an end_of_loop (EOL) module.
  • the counter tracks the number of loop branches that are in process.
  • the control module determines when loop termination approaches, and switches the counter to track the number of loop branches that remain to be issued.
  • the EOL module compares the number of loop branches that remain to be issued with a threshold value and generates a resteer signal when a match is detected.
  • the counter is a dual mode counter that tracks the number of loop branches in process in a first mode and uses this number to track the number of loop branches that remain to be issued in the second mode.
  • the counter includes a first counter to track the number of loop branches in process and a second counter to track the number of loop branches that remain to be issued.
  • FIG. 1 is a block diagram of a processor pipeline including a loop branch prediction system in accordance with the present invention.
  • FIG. 2 is a block diagram of one embodiment of the loop prediction system of FIG. 1.
  • FIG. 3A is circuit diagram of one embodiment of the loop prediction system of FIG. 2.
  • FIG. 3B is a circuit diagram of another embodiment of the loop prediction system of FIG. 2.
  • FIG. 4 is an overview of a method for predicting loop branches in accordance with the present invention.
  • FIG. 5 is a flowchart of one embodiment of the method shown in FIG. 4.
  • the present invention provides a loop branch prediction system that allows the terminal branch of a loop to be accurately predicted at the front end of the processor pipeline. This is accomplished by monitoring loop branch instructions that are in-flight (issued but not yet retired) and available loop data to determine the number of loop branches that are still to be issued. This number is updated to reflect loop branches as they issue and compared with one or more threshold values. When the number reaches a threshold value, termination of the loop is indicated.
  • a default loop branch prediction is over-ridden when the threshold value is reached, and the fetch module is resteered to the instruction that follows the loop.
  • the default branch prediction for a top loop branch is, for example, that the branch is “taken” (“TK”). This is overridden to “not taken” (“NT”) when the threshold value is reached.
  • the threshold value may correspond to zero, one, or two loop branches, depending on the type of loop involved and timing constraints for the processor pipeline.
  • FIG. 1 is a block diagram of a processor pipeline 100 that includes branch prediction system 180 suitable for use with the present invention.
  • Pipeline 100 is represented as a series of pipeline (“pipe”) stages 101 - 10 x to indicate when different resources operate on a given instruction.
  • the last stage in FIG. 1 is labeled 10 x to indicate that one or more pipe stages (not shown) my separate stage 10 x from stage 104 .
  • signals propagate from left to right, so that the response of circuitry in, e.g., pipe stage 101 on CLK cycle N is propagated to the circuitry of pipe stage 102 on CLK cycle N+1.
  • Staging latches 128 control the flow of signals between pipe stages 101 - 10 x .
  • Other embodiments of the present invention may employ different relative configurations of branch prediction elements and staging latches 128 .
  • the staging latches at the inputs of MUX 150 may be replaced by a single staging latch at its output.
  • the present invention is independent of which relative configuration is employed.
  • Loop branch prediction system 190 is shown as part of branch prediction system 180 , which also includes a first branch prediction structure (BPS 1 ) 120 , a second branch prediction structure (BPS 2 ) 130 , and a branch decode module 160 .
  • a branch execution unit (BRU) 170 , an instruction cache 110 , an instruction pointer (IP) MUX 150 , and an instruction register 140 are also shown in FIG. 1.
  • the disclosed embodiment of loop prediction system 190 employs signals from BPS 1 120 , decode logic 160 , and BRU 170 to anticipate the final iteration of a loop and to resteer processor pipeline 100 to the instruction that follows the loop.
  • IP MUX 150 couples a selected IP to I-cache 110 , BPS 1 120 , and BPS 2 130 .
  • I-cache 110 , BPS 1 120 and BPS 2 130 perform their respective look-up procedures to determine whether they have an entry corresponding to the received IP.
  • data at the associated entry (the instruction pointed to by the IP) is forwarded to the next stage in pipeline 100 .
  • branch prediction information is coupled back to IP MUX 150 and branch decode module 160 is notified.
  • BPS 1 120 and BPS 2 130 are two structures in a branch prediction hierarchy that is designed to provide rapid resteering of pipeline 100 .
  • BPS 1 120 accommodates branch prediction information for a limited number of loop branch instructions.
  • An embodiment of BPS 1 120 having four fully associative entries indexed by partial IPs may support single cycle (zero bubble) resteers.
  • the target addresses of selected top loop branches may be stored in BPS 1 120 to resteer pipeline 100 on the repeated iterations of the loop body.
  • An embodiment of BPS 2 130 may store predicted resolution and target address information for 64 entries in a four way set associative configuration.
  • the present invention does not require a particular branch prediction hierarchy as long as target addresses can be provided for timely pipeline resteers.
  • a single storage structure for branch prediction information may be employed in place of PBS 1 120 and BPS 2 130 .
  • An advantage of the hierarchy in the disclosed embodiment is that it reduces the some of the timing constraints imposed on loop branch predictions.
  • Branch decode module 160 maintains the branch prediction information in BPS 1 120 and BPS 2 130 and provides information to loop predictor 190 on the types of instructions in buffer 140 . Decode module 160 may also implement checks on various branch related information to facilitate uninterrupted processing of branch-related instructions. Branch-related instructions include various types of branch instructions as well as instruction that deliver prediction information to BPS 1 120 and BPS 2 130 . Decode module 160 includes logic to decode branch-related instructions in buffer 140 and update BPS 1 120 , BPS 2 120 (BR structures), and loop predictor 190 accordingly.
  • Buffer 140 provides instructions received from, e.g., I-cache 110 to resources in the back end of pipeline 200 .
  • These resources include BRU 170 , which executes selected branch-related instructions and generates information to update the architectural state of the processor when and if the instruction is retired.
  • BRU 170 provides data for maintaining a loop counter (LC) and an epilog counter (EC) to track the status of loops in process.
  • LC loop counter
  • EC epilog counter
  • LC is initialized to a value indicating the number of times the counted loop will be iterated.
  • EC is initialized to a value indicating the number of stages in the software pipeline. Initial values of EC and/or LC may be determined by the compiler and provided to the processor through loop instructions.
  • LC is decremented on each iteration of the loop, reaching zero when the last loop branch, i.e. the last loop iteration, is detected. This signals the start of the epilog.
  • EC is decremented as instructions are drained from the stages of the software pipeline on subsequent clock cycles. All instructions in the final iteration of the loop are complete when EC is zero.
  • LC and EC may thus be used to determine when a modulo-scheduled counted loop is about to terminate.
  • a threshold value of LC may be used to determine when loop termination approaches.
  • the epilog begins when a predicate associated with the loop condition becomes zero. Loop termination for “while” loops may thus be indicated by the loop predicate and/or changes in EC.
  • BRU 170 is at the back end of pipeline 200 and branch prediction system 180 is at the front end of pipeline 200 , it is not sufficient to monitor LC and EC to predict the termination of a loop.
  • the final loop branch instruction will retire (and LC and/or EC will be updated) multiple clock cycles after pipeline 100 should have been resteered to the instruction sequence that follows the loop.
  • a successful loop prediction scheme provides a termination prediction while loop branch instructions are still in process in pipeline 100 . The largest performance benefit is obtained when the loop termination can be predicted soon after the final loop branch has entered pipeline 100 .
  • FIG. 2 is block diagram of one embodiment of loop predictor 190 of FIG. 1.
  • the disclosed embodiment of loop predictor 190 includes a counter 210 , an end_of_loop (EOL) module 230 , and a control module 240 .
  • Counter 210 includes circuitry to track the number of loop branch instructions that are in process (N_IN_FLT) and the number of loop branch instructions yet to issue (N_TO_ISSUE).
  • N_IN_FLT includes all loop branch instructions that have been loaded into buffer 140 but have not yet been retired. These may be tracked by incrementing N_IN_FLT when a loop branch is issued at the front end of pipeline 100 and decrementing N_IN_FLT when a loop branch is retired at the back end of pipeline 100 .
  • a signal L_BR is asserted to counter 210 when a loop branch is issued, and a signal BR_RET is asserted to counter 210 when a loop branch retires.
  • counter 210 begins tracking N_TO_ISSUE as the loop approaches its terminal iteration, signaling entry into termination mode.
  • N_TO_ISSUE may be determined by the difference between an expected number of loop branches still to be retired (N_TO_RET) and N_IN_FLT as termination mode is reached. Thereafter, N_TO_ISSUE is decremented for each additional loop branch issued, e.g. each time L_BR is asserted.
  • counter 210 may be a dual mode counter in which N_IN_FLT is tracked in a first mode and N_TO_ISSUE is tracked in a second, e.g. termination, mode (FIG. 3B).
  • counter 210 may include separate counters to track N_IN_FLT and N_TO_ISSUE (FIG. 3A).
  • a switch between counting modes is triggered when the terminal iteration of a loop is approached.
  • the point at which the switch occurs may depend on the type of loop involved.
  • the approach of the terminal iteration for a counted loop may be indicated by a value of LC below a threshold value.
  • the approach of termination for a modulo-scheduled counted loop e.g. CEXIT or CTOP, may be indicated by a value of LC and/or EC below a threshold value.
  • For modulo-scheduled while loops, e.g. WEXIT or WTOP, approach of the terminal iteration may be indicated by a value of EC below a threshold value and/or by a change in the state of the loop predicate.
  • a predicted number of loop iterations may be used to determine when the terminal iteration is being approached.
  • processor 100 may store a number of iterations for recent loops. When one of these loops is encountered again, the difference between the current number of iterations and the predicted number of iterations (based on the previous encounter) may be compared with a threshold value. In this embodiment, termination mode is indicated when the difference falls below the threshold value.
  • the counter is switched to termination mode when the terminal iterations of the loop approaches.
  • termination counter 214 is activated.
  • the value of N_IN_FLT is used to initialize N_TO_ISSUE.
  • the two counter implementation of counter 210 is discussed in conjunction with FIG. 3A.
  • the dual mode implementation of counter 210 is discussed in conjunction with FIG. 3B.
  • counter 210 is initialized to N_TO_ISSUE when termination mode is entered, using the current values of N_IN_FLT and N_TO_RET.
  • N_TO_RET may be derived, for example, from LC and/or EC.
  • N_TO_ISSUE is adjusted to reflect any loop new loop branch instructions that enter pipeline 100 .
  • the adjusted value represents the expected number of loop branches still to be issued before the termination of the loop.
  • EOL module 230 is coupled to monitor N_TO_ISSUE.
  • EOL module 230 compares N_TO_ISSUE to one or more threshold values and generates a resteer signal when a match occurs.
  • the threshold value used may depend on a number of factors, such as the type of loop being monitored and the timing necessary to resteer pipeline 100 .
  • the resteer address is just the address of the instruction that follows the loop branch in sequence. For one embodiment of the invention, resteer is accomplished by over-riding the default (branch taken) target address indicated by BPS 1 120 .
  • Control module 240 initiates tracking of N_IN_FLT, N_TO_RET, and triggers EOL module 230 as required.
  • control module 240 monitors instructions entering buffer 140 and initializes N_IN_FLT when a loop-start signal (L_INI) is asserted.
  • L_INI loop-start signal
  • EC is typically initialized at the start of a modulo-scheduled loop by a MOV_TO_EC instruction.
  • LC may also be initialized at this time by a MOV_TO_LC instruction.
  • L_INI is asserted to control module 240 when a MOV_TO_EC or MOV_TO_LC instruction is detected in buffer 140 , depending on the loop type being monitored.
  • L_INI may also be asserted on the first occurrence of a loop branch following a flush of the back end stages of pipeline 100 . In this case, N_IN_FLT is reset to zero.
  • Control module 240 also receives a signal, L_TERM, which is asserted in response to the approach of a terminal iteration of a loop. For one embodiment, control module 240 deactivates in-flight counter 212 and activates EOL module 230 when L_TERM is asserted. For another embodiment, control module 240 switches counter modes (to termination mode) and activates EOL module 230 when L_TERM is asserted.
  • L_TERM a signal
  • FIG. 3A is a schematic diagram showing one embodiment of a loop predictor pipeline 300 in accordance with the present invention.
  • Loop prediction pipeline 300 is divided into pipeline stages (“pipe stages”) 301 and 302 to indicate when various elements operate.
  • Loop predictor pipeline 300 is illustrated with exemplary embodiments of counter 210 , EOL module 230 , and control module 240 .
  • the exemplary embodiment of counter 201 includes in-flight counter 212 and termination counter 214 .
  • control module 240 activates in-flight counter 212 and EOL module 230 in response to signals from various components of pipeline 100 .
  • Control module 240 includes first and second OR gates 342 , 344 , and an AND gate 348 with an inverted input.
  • OR gate 342 asserts a CNTR_ON signal to in-flight counter 212 when L_INI is asserted.
  • OR gate 344 and AND 348 assert a termination mode signal (TMODE) when L_TERM is asserted and L_INI is deasserted, e.g. when a loop that is in progress approaches termination. T_MODE is deasserted when L_INI is reasserted.
  • TODE termination mode signal
  • In-flight counter 212 is initialized by CNTR_ON to track the number of loop branches that are in process.
  • in-flight counter 212 employs first and second MUXs 310 , 312 , respectively, and first adder 314 to track the number of valid loop branches loaded into, e.g., buffer 240 .
  • MUX 310 couples zeroes to a first input of adder 314 until CNTR_ON is asserted, after which it couples the output of in-flight counter 212 (N_IN_FLT) to the first input of adder 314 .
  • the second input of adder 314 is driven by a hit signal (L_BR) from BPS 1 120 , which increments N_IN_FLT when a loop branch hits in BPS 1 120 .
  • BPS 2 120 may be used to generate L_BR to in-flight counter 212 , provided it can be done within the timing constraints of pipeline 300 .
  • the incremented value of N_IN_FLT is coupled to one input of MUX 312 , the other input of which receives an unincremented version N_IN_FLT (bypassed from MUX 310 ).
  • MUX 312 couples the incremented or unincremented value of N_IN_FLT to a second adder 316 , according to whether or not a valid loop branch is detected in pipe stage 302 . This is indicated by BR_VLD, which may be set and reset by branch decoder 160 to confirm that the hit in BPS 1 120 was generated by a valid loop branch.
  • a second adder 316 receives N_IN_FLT at its first input and a branch retirement signal (BR_RET) at its second input.
  • BR_RET is asserted each time a loop branch is retired. It may be generated, for example, by BRU 170 or associated retirement logic.
  • Second adder 316 decrements N_IN_FLT when a loop branch is retired (BR_RET asserted), while first adder 314 and MUX 312 increment N_IN_FLT when a valid loop branch is issued.
  • N_IN_FLT thus represents the number of loop branches issued but not yet retired in pipeline 100 .
  • Control module 240 updates N_IN_FLT in this manner until L_TERM is asserted, causing loop predictor 160 to enter termination mode (T_MODE asserted).
  • T_MODE termination mode
  • the latest value of N_IN_FLT is provided to terminal counter 214 , which uses it to determine a number of loop branches yet to be issued (N_TO_ISSUE).
  • adder 314 and MUX 312 of in-flight counter 312 couple LOOP_BR unaltered to terminal counter 314 , where it is used to update N_TO_ISSUE.
  • termination counter 314 receives the current value of N_IN_FLT along with an indication of the number of iterations of the loop still to be retired (N_TO_RET). Termination counter 314 adjusts N_TO_RET to reflect the number of loop branches in flight (N_IN_FLT), providing a signal (N_TO_ISSUE) that represents the number of loop branches still to be issued. Thereafter, N_TO_ISSUE is decremented by counter 312 each time a valid loop branch (BR_VLD) reaches buffer 140 . N_TO_ISSUE is used by EOL module 230 to detect the terminal iteration of the loop.
  • the disclosed embodiment of termination counter 314 includes a MUX 324 and an adder 328 .
  • One input of adder 328 receives N_IN_FLT from in-flight counter 212 when termination mode is entered. Thereafter, it receives an indication of each valid loop branch that reaches buffer 140 .
  • MUX 324 couples N_TO_RET to adder 328 , which subtracts N_IN_FLT to provide N_TO_ISSUE. Thereafter (when L_TERM is deasserted), MUX 324 couples the output of termination counter 314 (N_TO_ISSUE) to adder 328 , which adjusts it to reflect any additional loop branches that have reached buffer 140 in the interim.
  • EOL module 230 receives N_TO_ISSUE and compares it with one or more selected threshold values.
  • the threshold values indicate when to initiate a resteer signal in anticipation of the end of the loop.
  • threshold values of 0, 1, and 2 are compared with N_TO_ISSUE.
  • EOL module 230 generates a resteer signal (RESTEER), when N_TO_ISSUE matches one of the threshold value.
  • the disclosed embodiment of EOL module 330 includes three comparators 331 - 333 , four AND gates, 334 , 335 , 336 , 337 , and OR gate 338 .
  • Comparators 301 - 303 compare the threshold values 0, 1, and 2, respectively, with the current value of N_TO_ISSUE. Their outputs are coupled to inputs of AND gates 334 - 336 , respectively, which are enabled by T_MODE.
  • AND gate 336 must also be enabled by LOOP_BR, which is asserted when a loop branch is detected in pipe stage 302 . For selected loop branch configurations, AND gate 336 eliminates timing constraints that would otherwise be present when two loop branches occur in close succession.
  • OR gate 338 asserts a signal (MATCH) to AND 337 when any of the threshold values has been reached.
  • the output of AND 337 is a signal (END) that is asserted when L_BR and MATCH are asserted concurrently.
  • the effect of asserting END may depend on the type of loop being processed.
  • the branch prediction provided by BPS 1 for CLOOP, CTOP and WTOP loops is TK.
  • Asserting END may alter the predicted direction to NT, or it may trigger branch decoder 160 to ignore the predicted TK direction and resteer pipeline 100 to the fall through address.
  • a resteer module in branch decoder 160 may provide the resteer address to IP MUX 250 when END is asserted.
  • the branch prediction provided by BPS 1 is NT.
  • Asserting END may alter it to TK, or it may otherwise trigger a resteer to the branch target address.
  • FIG. 3B shows another embodiment of loop prediction pipeline 300 ′ in accordance with the present invention.
  • Loop prediction pipeline 300 ′ employs a single counter 350 having logic to enable two different counting modes.
  • the functions of in-flight counter 312 and termination counter 314 are incorporated in a counter 350 that is capable of operating in two modes, in-flight mode and termination mode.
  • Control module 240 and EOL module 230 are substantially the same as in FIG. 3A. The following discussion focuses on operation of dual mode counter 350 .
  • Dual mode counter 350 includes a MUX 354 , MUX control logic 358 , first and second adders 360 , 362 , and increment/decrement blocks 368 , 370 .
  • MUX control logic monitors T_MODE, BR_RET, L_TERM, BR_VLD, and L_BR signals, and selects an output for MUX 354 from one of its inputs, according to the states of the monitored signals.
  • the output of MUX 354 may represent N_TO_ISSUE or N_IN_FLT, depending on the mode in which counter 350 is operating.
  • MUX 354 receives as inputs (1) logical zero, (2) a copy of its output, (3) a decremented copy of its output; (4) an incremented copy of its output, (5) an output of adder 360 , and (6) an output of adder 364 .
  • the output of adder 360 provides the difference between N_TO_RET and the current value at the output of MUX 354 , e.g. N_IN_FLT.
  • the output of adder 364 provides the difference between N_TO_RET and an incremented copy of the output of MUX 354 .
  • One of the adder output values is selected to determine N_TO_ISSUE when counter 350 transitions from its first mode to its second mode.
  • MUX control module 358 triggers MUX 354 to provide 0 at its output until CNTR_ON is asserted, at which point counter 350 enters a first mode (in-flight mode).
  • first mode counter 350 tracks N_IN_FLT at its output 352 by incrementing (via block 370 ) or decrementing (via block 368 ) the value at output 352 depending on the states of signals L_BR, BR_VLD, and BR_RET. For example, when a valid branch enters register 140 , L_BR is asserted, BR_VLD, and the incremented value is provided to output 352 . When a branch retires, BR_RET is asserted, and the decremented value is provided to output 352 .
  • counter 350 switches to a second mode (termination mode).
  • MUX control module 358 causes MUX 354 to couple the output of adder 360 or adder 364 to counter output 352 .
  • the value is the difference between N_TO_RET and N_IN_FLT or N_TO_RET and an incremented value of N_IN_FLT.
  • the first represents the number of loop branches still to be issued when there is no loop branch in pipe stage 301 .
  • the second represents the number of loop branches still to be issued when there is loop branch in pipe stage 301 .
  • Table 1 The various inputs to MUX 354 and the conditions under which they are selected are summarized in Table 1.
  • C represents the value at the output of MUX 354 . This value is N_IN_FLT when counter 350 is in first mode.
  • FIG. 4 is an overview of a method 400 for predicting loop branches in accordance with the present invention.
  • Method 400 is initiated 410 when the start of a loop is detected. This may be done, for example, by monitoring one or more counters that are used to track the status of loops and initiating method 400 when one of these counters is initialized.
  • loop branches are tracked 420 through various stages of the process pipeline. In one embodiment of the invention, loop branches that have been issued to various execution resources and loop branches that have been retired are tracked separately. The number of loop branches remaining to be issued is then determined 430 from the tracked loop branches and available loop data. The loop branches remaining to be issued are compared 440 against one or more threshold values. If the comparison generates a match, a resteer signal is generated 450 . Otherwise, method 400 continues tracking 420 loop branches.
  • FIG. 5 represents one embodiment of method 400 .
  • a first counter is initiated 520 .
  • the first counter tracks the number of loop branches that have been issued but not yet retired, e.g. N_IN_FLT. For one embodiment, this is accomplished by incrementing the first counter each time a loop branch is fetched to an instruction buffer and decrementing the counter each time a loop branch is retired.
  • a branch termination signal is checked 540 to determine whether loop is close to its final iteration. This may be determined, for example, by monitoring the EC counter and asserting L_TERM when EC indicates that the loop pipeline is starting to empty.
  • the number of loop branches still to be issued is determined 550 . For one embodiment, this is done by reducing the number of loop branches still to be retired (N_TO_RET) by the number of loop branches in process (N_N_FLT) and thereafter updating N_TO_RET as additional loop branches are issued, e.g. L_BR is asserted.
  • the issued loop branches can be monitored in the front part of the pipeline. Consequently, the number of loop branches still to be issued is useful for predicting the end of the loop, since pipeline resteering is handled in the front end of the pipeline. In the disclosed embodiment, this is accomplished by comparing 560 the number of loop branches remaining to be issued with one or more threshold values. If a match is detected 560 , a resteer signal is generated and the predicted target address is overwritten by the resteer address. If no match is detected 560 , determining step 550 is repeated. In the disclosed embodiment, steps 550 and 560 represent termination mode.
  • the system employs a counter to track the number of in-flight loop branches and the number of loop branches that remains to be issued. The number of remaining loop branches is compared with one or more threshold numbers and a resteer signal is generated when a match is detected.
  • a control module deactivates the first counter and activates the second counter and the comparison logic when the branch nears termination.

Abstract

A loop branch prediction system is provided to predict a final iteration of a loop and resteer an associated fetch module to an appropriate target address. The loop prediction system includes a counter and an end of loop (EOL) module. In one mode, the counter tracks loop branches in process. When a termination condition is detected, the counter switches to a second mode to track the number of loop branches still to be issued. The EOL module compares the number of loop branches still to be issued with one or more threshold values and generates a resteer signal when a match is detected.

Description

    BACKGROUND OF THE INVENTION
  • 1. Field of the Invention [0001]
  • The present invention relates to the field of microprocessors, and in particular, to systems and methods for branch prediction in microprocessors. [0002]
  • 2. Background Art [0003]
  • Advanced processors employ pipelining techniques to execute instructions at very high speeds. On such processors, the overall machine is organized as a pipeline consisting of several cascaded stages of hardware. Instruction processing is divided into a sequence of operations, and each operation is performed by hardware in a corresponding pipeline stage (“pipe stage”). Independent operations from several instructions may be processed simultaneously by different pipe stages, increasing the instruction throughput of the pipeline. Where a pipelined processor includes multiple execution resources in each pipe stage, the throughput of the processor can exceed one instruction per clock cycle. To make full use of this instruction execution capability, the execution resources of the processor must be provided with sufficient instructions from the correct execution path. [0004]
  • Branch instructions pose major challenges to keeping the pipeline filled with instructions from the correct execution path. When a branch instruction is executed and the branch condition met, control flow of the processor jumps to a new code sequence, and instructions from the new code sequence are transferred to the pipeline. Branch execution typically occurs in the back end of the pipeline, while instructions are fetched at the front end of the pipeline. If changes in the control flow are not anticipated correctly, several pipe stages worth of instructions may be fetched from the wrong execution path by the time the branch is resolved. When this occurs, the instructions must be flushed from the pipeline, leaving idle pipe stages (bubbles) until the processor refills the pipeline with instructions from the correct execution path. [0005]
  • To reduce the number of pipeline bubbles, processors incorporate branch prediction modules at the front ends of their pipelines. When a branch instruction enters the front end of the pipeline, the branch prediction module forecasts whether the branch instruction will be taken when it is executed at the back end of the pipeline. If the branch is predicted taken, the branch prediction module indicates a target address to which control of the processor is predicted to jump. A fetch module, which is also located at the front end of the pipeline, fetches instructions beginning at the indicated target address. [0006]
  • Branch instructions are employed extensively in loops to execute a series of instructions (“the loop body”), repeatedly. Modulo-scheduled loops are loops that are organized in a pipelined manner to improve execution efficiency. For one type of loop (top loop), a branch condition is tested following each iteration and control is returned to the first instruction of the loop body if the branch condition is met. The last iteration of the loop occurs when the branch condition is not met, in which case control of the processor passes (“falls through”) to the instruction that follows the loop branch. Thus, the loop branch is taken for all but the final iteration of the top loop. Top loops terminate when the loop branch is not taken. Another type of loop (exit loop) employs a branch at a location other than the end of the loop body. In this case, the loop branch is not taken for all but the final iteration of the loop. Exit loops terminate when the loop branch is taken. [0007]
  • Loops are very common programming structures, and branch prediction systems are typically designed to predict the loop branch conditions correctly for the bulk of the loop iterations. For example, the branch prediction system may be set up to automatically predict top loop branches as taken and exit loop branches as not taken. This strategy provides accurate branch predictions for all but the last iteration of each loop, when the loop condition changes. [0008]
  • Given the ubiquity of loop structures, mispredicting the loop branch on just the terminal iteration can have a significant impact on the overall performance of the processor. This is especially true where the loop is nested within an outer loop, when the loop count is small, or when the loop body is small. In the first case, the misprediction penalty associated with the terminal iteration of the inner loop is repeated for each iteration of the outer loop. In the latter cases, the misprediction penalty may exceed the total number of cycles necessary to execute the loop. [0009]
  • The present invention addresses these and other limitations associated with available branch prediction systems. [0010]
  • SUMMARY OF THE INVENTION
  • The present invention provides a system and method for predicting loop branches, including the loop branch that terminates the loop. [0011]
  • In accordance with the present invention, a loop prediction system includes a counter module, a control module, and an end_of_loop (EOL) module. The counter tracks the number of loop branches that are in process. The control module determines when loop termination approaches, and switches the counter to track the number of loop branches that remain to be issued. The EOL module compares the number of loop branches that remain to be issued with a threshold value and generates a resteer signal when a match is detected. [0012]
  • For one embodiment of the invention, the counter is a dual mode counter that tracks the number of loop branches in process in a first mode and uses this number to track the number of loop branches that remain to be issued in the second mode. For another embodiment of the invention, the counter includes a first counter to track the number of loop branches in process and a second counter to track the number of loop branches that remain to be issued.[0013]
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The present invention may be understood with reference to the following drawings in which like elements are indicated by like numbers. These drawings are provided to illustrate selected embodiments of the present invention and are not intended to limit the scope of the invention. [0014]
  • FIG. 1 is a block diagram of a processor pipeline including a loop branch prediction system in accordance with the present invention. [0015]
  • FIG. 2 is a block diagram of one embodiment of the loop prediction system of FIG. 1. [0016]
  • FIGS. 3A is circuit diagram of one embodiment of the loop prediction system of FIG. 2. [0017]
  • FIG. 3B is a circuit diagram of another embodiment of the loop prediction system of FIG. 2. [0018]
  • FIG. 4 is an overview of a method for predicting loop branches in accordance with the present invention. [0019]
  • FIG. 5 is a flowchart of one embodiment of the method shown in FIG. 4.[0020]
  • DETAILED DISCUSSION OF THE INVENTION
  • The following discussion sets forth numerous specific details to provide a thorough understanding of the invention. However, those of ordinary skill in the art, having the benefit of this disclosure, will appreciate that the invention may be practiced without these specific details. In addition, various well known methods, procedures, components, and circuits have not been described in detail in order to focus attention on the features of the present invention. [0021]
  • One of the difficulties of predicting the termination of a loop is that the branch instructions that control looping and update various loop status counters are resolved at the back end of the processor pipeline. Given the number of pipe stages in contemporary processors, timing constraints preclude any direct use of these loop status counters and other architectural data to anticipate a loop's termination and resteer the pipeline appropriately. To be effective, loop terminations and the consequent pipeline resteers should be predicted from information available at the front end of the pipeline, where the instruction fetch module can be resteered soon after the last loop branch enters the pipeline. [0022]
  • The present invention provides a loop branch prediction system that allows the terminal branch of a loop to be accurately predicted at the front end of the processor pipeline. This is accomplished by monitoring loop branch instructions that are in-flight (issued but not yet retired) and available loop data to determine the number of loop branches that are still to be issued. This number is updated to reflect loop branches as they issue and compared with one or more threshold values. When the number reaches a threshold value, termination of the loop is indicated. [0023]
  • For one embodiment, a default loop branch prediction is over-ridden when the threshold value is reached, and the fetch module is resteered to the instruction that follows the loop. The default branch prediction for a top loop branch is, for example, that the branch is “taken” (“TK”). This is overridden to “not taken” (“NT”) when the threshold value is reached. The threshold value may correspond to zero, one, or two loop branches, depending on the type of loop involved and timing constraints for the processor pipeline. [0024]
  • FIG. 1 is a block diagram of a processor pipeline 100 that includes [0025] branch prediction system 180 suitable for use with the present invention. Pipeline 100 is represented as a series of pipeline (“pipe”) stages 101-10 x to indicate when different resources operate on a given instruction. The last stage in FIG. 1 is labeled 10 x to indicate that one or more pipe stages (not shown) my separate stage 10 x from stage 104. Except as noted, signals propagate from left to right, so that the response of circuitry in, e.g., pipe stage 101 on CLK cycle N is propagated to the circuitry of pipe stage 102 on CLK cycle N+1.
  • Staging latches [0026] 128 control the flow of signals between pipe stages 101-10 x. Other embodiments of the present invention may employ different relative configurations of branch prediction elements and staging latches 128. For example, the staging latches at the inputs of MUX 150 may be replaced by a single staging latch at its output. The present invention is independent of which relative configuration is employed.
  • Loop [0027] branch prediction system 190 is shown as part of branch prediction system 180, which also includes a first branch prediction structure (BPS1) 120, a second branch prediction structure (BPS2) 130, and a branch decode module 160. A branch execution unit (BRU) 170, an instruction cache 110, an instruction pointer (IP) MUX 150, and an instruction register 140 are also shown in FIG. 1. The disclosed embodiment of loop prediction system 190 employs signals from BPS1 120, decode logic 160, and BRU 170 to anticipate the final iteration of a loop and to resteer processor pipeline 100 to the instruction that follows the loop.
  • [0028] IP MUX 150 couples a selected IP to I-cache 110, BPS1 120, and BPS2 130. On receipt of the IP, I-cache 110, BPS1 120 and BPS2 130 perform their respective look-up procedures to determine whether they have an entry corresponding to the received IP. When an IP hits, e.g. matches, an entry in I-cache 110, data at the associated entry (the instruction pointed to by the IP) is forwarded to the next stage in pipeline 100. When an instruction hits in BPS1 120 or BPS2 130, branch prediction information is coupled back to IP MUX 150 and branch decode module 160 is notified.
  • In the disclosed embodiment of [0029] branch prediction system 180, BPS1 120 and BPS2 130 are two structures in a branch prediction hierarchy that is designed to provide rapid resteering of pipeline 100. For one embodiment, BPS1 120 accommodates branch prediction information for a limited number of loop branch instructions. An embodiment of BPS1 120 having four fully associative entries indexed by partial IPs may support single cycle (zero bubble) resteers. The target addresses of selected top loop branches may be stored in BPS1 120 to resteer pipeline 100 on the repeated iterations of the loop body. An embodiment of BPS2 130 may store predicted resolution and target address information for 64 entries in a four way set associative configuration.
  • The present invention does not require a particular branch prediction hierarchy as long as target addresses can be provided for timely pipeline resteers. For example, a single storage structure for branch prediction information may be employed in place of [0030] PBS1 120 and BPS2 130. An advantage of the hierarchy in the disclosed embodiment is that it reduces the some of the timing constraints imposed on loop branch predictions.
  • [0031] Branch decode module 160 maintains the branch prediction information in BPS1 120 and BPS2 130 and provides information to loop predictor 190 on the types of instructions in buffer 140. Decode module 160 may also implement checks on various branch related information to facilitate uninterrupted processing of branch-related instructions. Branch-related instructions include various types of branch instructions as well as instruction that deliver prediction information to BPS1 120 and BPS2 130. Decode module 160 includes logic to decode branch-related instructions in buffer 140 and update BPS1 120, BPS2 120 (BR structures), and loop predictor 190 accordingly.
  • [0032] Buffer 140 provides instructions received from, e.g., I-cache 110 to resources in the back end of pipeline 200. These resources include BRU 170, which executes selected branch-related instructions and generates information to update the architectural state of the processor when and if the instruction is retired. For example, BRU 170 provides data for maintaining a loop counter (LC) and an epilog counter (EC) to track the status of loops in process. When a counted loop is detected, LC is initialized to a value indicating the number of times the counted loop will be iterated. For a modulo-scheduled (“software pipelined”) loop, EC is initialized to a value indicating the number of stages in the software pipeline. Initial values of EC and/or LC may be determined by the compiler and provided to the processor through loop instructions.
  • For example, in a modulo-scheduled counted loop, LC is decremented on each iteration of the loop, reaching zero when the last loop branch, i.e. the last loop iteration, is detected. This signals the start of the epilog. EC is decremented as instructions are drained from the stages of the software pipeline on subsequent clock cycles. All instructions in the final iteration of the loop are complete when EC is zero. LC and EC may thus be used to determine when a modulo-scheduled counted loop is about to terminate. For non-pipelined counted loops, a threshold value of LC may be used to determine when loop termination approaches. For modulo scheduled “while” loops, the epilog begins when a predicate associated with the loop condition becomes zero. Loop termination for “while” loops may thus be indicated by the loop predicate and/or changes in EC. [0033]
  • Because [0034] BRU 170 is at the back end of pipeline 200 and branch prediction system 180 is at the front end of pipeline 200, it is not sufficient to monitor LC and EC to predict the termination of a loop. Given the multiple stages of pipeline 200, the final loop branch instruction will retire (and LC and/or EC will be updated) multiple clock cycles after pipeline 100 should have been resteered to the instruction sequence that follows the loop. A successful loop prediction scheme provides a termination prediction while loop branch instructions are still in process in pipeline 100. The largest performance benefit is obtained when the loop termination can be predicted soon after the final loop branch has entered pipeline 100.
  • FIG. 2 is block diagram of one embodiment of [0035] loop predictor 190 of FIG. 1. The disclosed embodiment of loop predictor 190 includes a counter 210, an end_of_loop (EOL) module 230, and a control module 240. Counter 210 includes circuitry to track the number of loop branch instructions that are in process (N_IN_FLT) and the number of loop branch instructions yet to issue (N_TO_ISSUE).
  • For one embodiment of the invention, N_IN_FLT includes all loop branch instructions that have been loaded into [0036] buffer 140 but have not yet been retired. These may be tracked by incrementing N_IN_FLT when a loop branch is issued at the front end of pipeline 100 and decrementing N_IN_FLT when a loop branch is retired at the back end of pipeline 100. In the disclosed embodiment, a signal L_BR is asserted to counter 210 when a loop branch is issued, and a signal BR_RET is asserted to counter 210 when a loop branch retires.
  • For one embodiment of the invention, [0037] counter 210 begins tracking N_TO_ISSUE as the loop approaches its terminal iteration, signaling entry into termination mode. For example, N_TO_ISSUE may be determined by the difference between an expected number of loop branches still to be retired (N_TO_RET) and N_IN_FLT as termination mode is reached. Thereafter, N_TO_ISSUE is decremented for each additional loop branch issued, e.g. each time L_BR is asserted.
  • For one embodiment of the invention, counter [0038] 210 may be a dual mode counter in which N_IN_FLT is tracked in a first mode and N_TO_ISSUE is tracked in a second, e.g. termination, mode (FIG. 3B). For another embodiment of the invention, counter 210 may include separate counters to track N_IN_FLT and N_TO_ISSUE (FIG. 3A).
  • A switch between counting modes (or between counters) is triggered when the terminal iteration of a loop is approached. As noted above, the point at which the switch occurs may depend on the type of loop involved. For example, the approach of the terminal iteration for a counted loop may be indicated by a value of LC below a threshold value. The approach of termination for a modulo-scheduled counted loop, e.g. CEXIT or CTOP, may be indicated by a value of LC and/or EC below a threshold value. For modulo-scheduled while loops, e.g. WEXIT or WTOP, approach of the terminal iteration may be indicated by a value of EC below a threshold value and/or by a change in the state of the loop predicate. [0039]
  • For another embodiment of the invention, a predicted number of loop iterations may be used to determine when the terminal iteration is being approached. For example, processor [0040] 100 may store a number of iterations for recent loops. When one of these loops is encountered again, the difference between the current number of iterations and the predicted number of iterations (based on the previous encounter) may be compared with a threshold value. In this embodiment, termination mode is indicated when the difference falls below the threshold value.
  • In the dual mode implementation of [0041] counter 210, the counter is switched to termination mode when the terminal iterations of the loop approaches. In the two counter implementation of counter 210, termination counter 214 is activated. In both cases, the value of N_IN_FLT is used to initialize N_TO_ISSUE. The two counter implementation of counter 210 is discussed in conjunction with FIG. 3A. The dual mode implementation of counter 210 is discussed in conjunction with FIG. 3B.
  • For one embodiment, [0042] counter 210 is initialized to N_TO_ISSUE when termination mode is entered, using the current values of N_IN_FLT and N_TO_RET. N_TO_RET may be derived, for example, from LC and/or EC. Thereafter, N_TO_ISSUE is adjusted to reflect any loop new loop branch instructions that enter pipeline 100. The adjusted value represents the expected number of loop branches still to be issued before the termination of the loop.
  • [0043] EOL module 230 is coupled to monitor N_TO_ISSUE. EOL module 230 compares N_TO_ISSUE to one or more threshold values and generates a resteer signal when a match occurs. The threshold value used may depend on a number of factors, such as the type of loop being monitored and the timing necessary to resteer pipeline 100. When the loop terminates on a fall through branch, e.g. the loop branch is NT on the final iteration, the resteer address is just the address of the instruction that follows the loop branch in sequence. For one embodiment of the invention, resteer is accomplished by over-riding the default (branch taken) target address indicated by BPS1 120.
  • [0044] Control module 240 initiates tracking of N_IN_FLT, N_TO_RET, and triggers EOL module 230 as required. In one embodiment of the invention, control module 240 monitors instructions entering buffer 140 and initializes N_IN_FLT when a loop-start signal (L_INI) is asserted. For example, EC is typically initialized at the start of a modulo-scheduled loop by a MOV_TO_EC instruction. For counted loops, LC may also be initialized at this time by a MOV_TO_LC instruction. For one embodiment of the invention, L_INI is asserted to control module 240 when a MOV_TO_EC or MOV_TO_LC instruction is detected in buffer 140, depending on the loop type being monitored. L_INI may also be asserted on the first occurrence of a loop branch following a flush of the back end stages of pipeline 100. In this case, N_IN_FLT is reset to zero.
  • [0045] Control module 240 also receives a signal, L_TERM, which is asserted in response to the approach of a terminal iteration of a loop. For one embodiment, control module 240 deactivates in-flight counter 212 and activates EOL module 230 when L_TERM is asserted. For another embodiment, control module 240 switches counter modes (to termination mode) and activates EOL module 230 when L_TERM is asserted.
  • FIG. 3A is a schematic diagram showing one embodiment of a loop predictor pipeline [0046] 300 in accordance with the present invention. Loop prediction pipeline 300 is divided into pipeline stages (“pipe stages”) 301 and 302 to indicate when various elements operate. Loop predictor pipeline 300 is illustrated with exemplary embodiments of counter 210, EOL module 230, and control module 240. The exemplary embodiment of counter 201 includes in-flight counter 212 and termination counter 214.
  • In the disclosed embodiment, [0047] control module 240 activates in-flight counter 212 and EOL module 230 in response to signals from various components of pipeline 100. Control module 240 includes first and second OR gates 342, 344, and an AND gate 348 with an inverted input. OR gate 342 asserts a CNTR_ON signal to in-flight counter 212 when L_INI is asserted. OR gate 344 and AND 348 assert a termination mode signal (TMODE) when L_TERM is asserted and L_INI is deasserted, e.g. when a loop that is in progress approaches termination. T_MODE is deasserted when L_INI is reasserted.
  • In-[0048] flight counter 212 is initialized by CNTR_ON to track the number of loop branches that are in process. In particular, in-flight counter 212 employs first and second MUXs 310, 312, respectively, and first adder 314 to track the number of valid loop branches loaded into, e.g., buffer 240. MUX 310 couples zeroes to a first input of adder 314 until CNTR_ON is asserted, after which it couples the output of in-flight counter 212 (N_IN_FLT) to the first input of adder 314. The second input of adder 314 is driven by a hit signal (L_BR) from BPS1 120, which increments N_IN_FLT when a loop branch hits in BPS1 120. In an alternative embodiment, BPS2 120 may be used to generate L_BR to in-flight counter 212, provided it can be done within the timing constraints of pipeline 300.
  • The incremented value of N_IN_FLT is coupled to one input of [0049] MUX 312, the other input of which receives an unincremented version N_IN_FLT (bypassed from MUX 310). MUX 312 couples the incremented or unincremented value of N_IN_FLT to a second adder 316, according to whether or not a valid loop branch is detected in pipe stage 302. This is indicated by BR_VLD, which may be set and reset by branch decoder 160 to confirm that the hit in BPS1 120 was generated by a valid loop branch.
  • A [0050] second adder 316 receives N_IN_FLT at its first input and a branch retirement signal (BR_RET) at its second input. BR_RET is asserted each time a loop branch is retired. It may be generated, for example, by BRU 170 or associated retirement logic. Second adder 316 decrements N_IN_FLT when a loop branch is retired (BR_RET asserted), while first adder 314 and MUX 312 increment N_IN_FLT when a valid loop branch is issued. N_IN_FLT thus represents the number of loop branches issued but not yet retired in pipeline 100.
  • [0051] Control module 240 updates N_IN_FLT in this manner until L_TERM is asserted, causing loop predictor 160 to enter termination mode (T_MODE asserted). When termination mode is initiated, the latest value of N_IN_FLT is provided to terminal counter 214, which uses it to determine a number of loop branches yet to be issued (N_TO_ISSUE). In termination mode, adder 314 and MUX 312 of in-flight counter 312 couple LOOP_BR unaltered to terminal counter 314, where it is used to update N_TO_ISSUE.
  • When L_TERM is first asserted, [0052] termination counter 314 receives the current value of N_IN_FLT along with an indication of the number of iterations of the loop still to be retired (N_TO_RET). Termination counter 314 adjusts N_TO_RET to reflect the number of loop branches in flight (N_IN_FLT), providing a signal (N_TO_ISSUE) that represents the number of loop branches still to be issued. Thereafter, N_TO_ISSUE is decremented by counter 312 each time a valid loop branch (BR_VLD) reaches buffer 140. N_TO_ISSUE is used by EOL module 230 to detect the terminal iteration of the loop.
  • The disclosed embodiment of [0053] termination counter 314 includes a MUX 324 and an adder 328. One input of adder 328 receives N_IN_FLT from in-flight counter 212 when termination mode is entered. Thereafter, it receives an indication of each valid loop branch that reaches buffer 140. On assertion of L_TERM, MUX 324 couples N_TO_RET to adder 328, which subtracts N_IN_FLT to provide N_TO_ISSUE. Thereafter (when L_TERM is deasserted), MUX 324 couples the output of termination counter 314 (N_TO_ISSUE) to adder 328, which adjusts it to reflect any additional loop branches that have reached buffer 140 in the interim.
  • [0054] EOL module 230 receives N_TO_ISSUE and compares it with one or more selected threshold values. For one embodiment, the threshold values indicate when to initiate a resteer signal in anticipation of the end of the loop. Depending on the type of loop being predicted, threshold values of 0, 1, and 2 are compared with N_TO_ISSUE. EOL module 230 generates a resteer signal (RESTEER), when N_TO_ISSUE matches one of the threshold value.
  • The disclosed embodiment of EOL module [0055] 330 includes three comparators 331-333, four AND gates, 334, 335, 336, 337, and OR gate 338. Comparators 301-303 compare the threshold values 0, 1, and 2, respectively, with the current value of N_TO_ISSUE. Their outputs are coupled to inputs of AND gates 334-336, respectively, which are enabled by T_MODE. AND gate 336 must also be enabled by LOOP_BR, which is asserted when a loop branch is detected in pipe stage 302. For selected loop branch configurations, AND gate 336 eliminates timing constraints that would otherwise be present when two loop branches occur in close succession.
  • OR [0056] gate 338 asserts a signal (MATCH) to AND 337 when any of the threshold values has been reached. The output of AND 337 is a signal (END) that is asserted when L_BR and MATCH are asserted concurrently. The effect of asserting END may depend on the type of loop being processed. For one embodiment, the branch prediction provided by BPS1 for CLOOP, CTOP and WTOP loops is TK. Asserting END may alter the predicted direction to NT, or it may trigger branch decoder 160 to ignore the predicted TK direction and resteer pipeline 100 to the fall through address. For example, a resteer module in branch decoder 160 may provide the resteer address to IP MUX 250 when END is asserted. For the case of a CEXIT or WEXIT loop, the branch prediction provided by BPS1 is NT. Asserting END may alter it to TK, or it may otherwise trigger a resteer to the branch target address.
  • FIG. 3B shows another embodiment of loop prediction pipeline [0057] 300′ in accordance with the present invention. Loop prediction pipeline 300′ employs a single counter 350 having logic to enable two different counting modes. In this embodiment, the functions of in-flight counter 312 and termination counter 314 are incorporated in a counter 350 that is capable of operating in two modes, in-flight mode and termination mode. Control module 240 and EOL module 230 are substantially the same as in FIG. 3A. The following discussion focuses on operation of dual mode counter 350.
  • [0058] Dual mode counter 350 includes a MUX 354, MUX control logic 358, first and second adders 360, 362, and increment/decrement blocks 368, 370. MUX control logic monitors T_MODE, BR_RET, L_TERM, BR_VLD, and L_BR signals, and selects an output for MUX 354 from one of its inputs, according to the states of the monitored signals. The output of MUX 354 may represent N_TO_ISSUE or N_IN_FLT, depending on the mode in which counter 350 is operating.
  • [0059] MUX 354 receives as inputs (1) logical zero, (2) a copy of its output, (3) a decremented copy of its output; (4) an incremented copy of its output, (5) an output of adder 360, and (6) an output of adder 364. The output of adder 360 provides the difference between N_TO_RET and the current value at the output of MUX 354, e.g. N_IN_FLT. The output of adder 364 provides the difference between N_TO_RET and an incremented copy of the output of MUX 354. One of the adder output values is selected to determine N_TO_ISSUE when counter 350 transitions from its first mode to its second mode.
  • In operation, [0060] MUX control module 358 triggers MUX 354 to provide 0 at its output until CNTR_ON is asserted, at which point counter 350 enters a first mode (in-flight mode). In first mode, counter 350 tracks N_IN_FLT at its output 352 by incrementing (via block 370) or decrementing (via block 368) the value at output 352 depending on the states of signals L_BR, BR_VLD, and BR_RET. For example, when a valid branch enters register 140, L_BR is asserted, BR_VLD, and the incremented value is provided to output 352. When a branch retires, BR_RET is asserted, and the decremented value is provided to output 352.
  • When T_MODE is asserted, counter [0061] 350 switches to a second mode (termination mode). When T_MODE is asserted, MUX control module 358 causes MUX 354 to couple the output of adder 360 or adder 364 to counter output 352. The value is the difference between N_TO_RET and N_IN_FLT or N_TO_RET and an incremented value of N_IN_FLT. The first represents the number of loop branches still to be issued when there is no loop branch in pipe stage 301. The second represents the number of loop branches still to be issued when there is loop branch in pipe stage 301. The various inputs to MUX 354 and the conditions under which they are selected are summarized in Table 1.
    TABLE 1
    MUX INPUT FIRST MODE SECOND MODE
    0 MOV_TO_LC, MOV_TO_LC,
    MOV_TO EC, MOV TO_EC,
    Back End Flush Back End Flush
    C Non-loop events Non-loop events
    C − 1 BR_RET Asserted L_BR Asserted
    C + 1 L_BR Asserted NA
    N_TO_RET − C L_TERM Asserted & NA
    L_BR Not Asserted
    N_TO_RET − (C + 1) L_TERM & NA
    L_BR Asserted
  • Here, C represents the value at the output of [0062] MUX 354. This value is N_IN_FLT when counter 350 is in first mode.
  • FIG. 4 is an overview of a method [0063] 400 for predicting loop branches in accordance with the present invention. Method 400 is initiated 410 when the start of a loop is detected. This may be done, for example, by monitoring one or more counters that are used to track the status of loops and initiating method 400 when one of these counters is initialized. Following initiation, loop branches are tracked 420 through various stages of the process pipeline. In one embodiment of the invention, loop branches that have been issued to various execution resources and loop branches that have been retired are tracked separately. The number of loop branches remaining to be issued is then determined 430 from the tracked loop branches and available loop data. The loop branches remaining to be issued are compared 440 against one or more threshold values. If the comparison generates a match, a resteer signal is generated 450. Otherwise, method 400 continues tracking 420 loop branches.
  • FIG. 5 represents one embodiment of method [0064] 400. When a loop start is detected 510, a first counter is initiated 520. The first counter tracks the number of loop branches that have been issued but not yet retired, e.g. N_IN_FLT. For one embodiment, this is accomplished by incrementing the first counter each time a loop branch is fetched to an instruction buffer and decrementing the counter each time a loop branch is retired. In addition to tracking 530 in process loop branches, a branch termination signal is checked 540 to determine whether loop is close to its final iteration. This may be determined, for example, by monitoring the EC counter and asserting L_TERM when EC indicates that the loop pipeline is starting to empty.
  • When the loop approaches its [0065] final iteration 540, the number of loop branches still to be issued is determined 550. For one embodiment, this is done by reducing the number of loop branches still to be retired (N_TO_RET) by the number of loop branches in process (N_N_FLT) and thereafter updating N_TO_RET as additional loop branches are issued, e.g. L_BR is asserted.
  • The issued loop branches can be monitored in the front part of the pipeline. Consequently, the number of loop branches still to be issued is useful for predicting the end of the loop, since pipeline resteering is handled in the front end of the pipeline. In the disclosed embodiment, this is accomplished by comparing [0066] 560 the number of loop branches remaining to be issued with one or more threshold values. If a match is detected 560, a resteer signal is generated and the predicted target address is overwritten by the resteer address. If no match is detected 560, determining step 550 is repeated. In the disclosed embodiment, steps 550 and 560 represent termination mode.
  • There has thus been provided a system and method for predicting loop branches and, in particular, for predicting the termination of loop branches to eliminate a misprediction on the terminating branch. The system employs a counter to track the number of in-flight loop branches and the number of loop branches that remains to be issued. The number of remaining loop branches is compared with one or more threshold numbers and a resteer signal is generated when a match is detected. In one embodiment, a control module deactivates the first counter and activates the second counter and the comparison logic when the branch nears termination.[0067]

Claims (26)

What is claimed is:
1. A loop branch prediction system comprising:
a counter to track a number of loop branch instructions to be issued; and
an end_of_loop (EOL) module to generate a resteer signal when the number of loop branch instructions to be issued reaches a threshold value.
2. The loop branch prediction system of claim 1, wherein the counter tracks a number of loop branch instructions in-flight and determines the number of loop branch instructions to be issued from the number of loop branch instructions in-flight when a loop termination condition is detected.
3. The loop branch prediction system of claim 2, further comprising a control module to detect a loop termination condition and trigger the counter to determine the number of loop branch instructions to be issued.
4. The loop branch prediction system of claim 1, wherein the counter has a first mode for tracking a number of loop branch instructions in-flight and a second mode for tracking the number of loop branch instructions to be issued.
5. The loop branch prediction system of claim 4, further comprising a control module to detect a loop termination condition and trigger the counter to switch from the first mode to the second mode.
6. The loop branch prediction system of claim 5, wherein the counter is initialized in the second mode to a value determined from a difference between an expected number of loop branch instructions to be retired and the number of loop branch instruction in flight.
7. The loop branch prediction system of claim 1, wherein the counter includes a first counter to track a number of loop branch instructions in flight and a second counter to track the number of loop branches to be issued.
8. The loop branch prediction system of claim 7, wherein the second counter is initialized to a value representing a number of loop instructions to be issued when a termination condition is detected.
9. The loop branch prediction system of claim 8, wherein the number of loop branch instructions to be issued is a difference between the number of loop branch instructions in flight and an expected number of loop branch instructions loop when the termination condition is detected.
10. A processor comprising:
a branch execution system to execute loop branch instructions;
a counter to track a number of loop branch instructions yet to issue to the branch execution system; and
an end_of_loop detector to generate a resteer signal when the number of loop branch instructions yet to issue reaches a selected value.
11. The processor of claim 10, wherein the counter tracks a number of branch instructions in-flight in a first mode, tracks the number of branch instructions yet to issue in a second mode, and switches from the first mode to the second mode responsive to a termination signal.
12. The processor of claim 11, further including a control module to monitor various loop status indicators and trigger a mode switch in the counter according to the monitored loop status indicators.
13. The processor of claim 12, wherein the number tracked in the first mode is used to initialize the counter in the second mode, when a mode switch is triggered.
14. The processor of claim 10, further including a decoder to identify issued loop branch instructions to the counter.
15. The processor of claim 14, further including a resteer module to provide a fall through address on receipt of the resteer signal.
16. A method for predicting loop branches comprising:
counting a number of in-flight loop branches:
counting a number of retired loop branches;
determining a number of outstanding loop branches from the numbers of in-flight and retired loop branches;
asserting a loop branch resteer signal when the number of outstanding loop branches reaches a threshold value.
17. The method of claim 16, wherein counting the number of in-flight loop branches comprises counting the number of loop branches issued to an instruction queue.
18. The method of claim 16, wherein counting the number of retired loop branches comprises counting the number of loop branches retired by a branch execution unit.
19. The method of claim 16, wherein determining the number of outstanding loop branches comprises:
determining a number of remaining loop branches from the number of retired loop branches; and
subtracting the number of in-flight branches from the number of remaining branches.
18. A processor comprising:
a counter to track loop branch instructions in flight and loop branch instructions yet to issue;
a branch execution system to receive issued loop branch instructions, retire the received loop branch instructions, and track the retired loop branch instructions; and
an end-of-loop module to compare the loop branch instructions yet to issue with a threshold value and assert a resteer signal when a match is indicated.
19. The processor of claim 18, wherein the end-of-loop module compares the outstanding loop branch count to the threshold value and assert the resteer signal a match is detected.
20. The processor of claim 19, further comprising a termination detector to switch the counter between tracking track loop branch instructions in flight and loop branch instructions yet to issue on receipt of a loop termination signal from the branch execution system.
21. A processor comprising:
means for identifying a branch instruction associated with a loop;
means for executing the identified branch instruction; and
means for predicting termination of the loop using information from the identifying and processing means.
22. The processor of claim 21, wherein the predicting means comprises:
means for tracking a difference between a number of times the branch instruction is identified and a number of time the branch instruction is retired; and
means for comparing the difference with a threshold value to indicate loop termination when the difference and threshold value match.
23. The processor of claim 22, wherein the tracking means comprises:
a first counter to track the number of times the branch instruction is identified; and
a second counter to a number of times the branch instruction will be identified, using the number of times the branch has been identified and a total number of times the branch instruction is expected to be retired.
24. The processor of claim 23, wherein the tracking means further comprises a means for detecting loop termination to detect an end of loop condition and initialize the second counter when the end of loop condition is detected.
US09/169,866 1998-10-12 1998-10-12 Method and apparatus for predicting loop exit branches Expired - Lifetime US6438682B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US09/169,866 US6438682B1 (en) 1998-10-12 1998-10-12 Method and apparatus for predicting loop exit branches

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US09/169,866 US6438682B1 (en) 1998-10-12 1998-10-12 Method and apparatus for predicting loop exit branches

Publications (2)

Publication Number Publication Date
US20020083310A1 true US20020083310A1 (en) 2002-06-27
US6438682B1 US6438682B1 (en) 2002-08-20

Family

ID=22617530

Family Applications (1)

Application Number Title Priority Date Filing Date
US09/169,866 Expired - Lifetime US6438682B1 (en) 1998-10-12 1998-10-12 Method and apparatus for predicting loop exit branches

Country Status (1)

Country Link
US (1) US6438682B1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040221283A1 (en) * 2003-04-30 2004-11-04 Worley Christopher S. Enhanced, modulo-scheduled-loop extensions
US20070150705A1 (en) * 2005-12-28 2007-06-28 Intel Corporation Efficient counting for iterative instructions
US20090254738A1 (en) * 2008-03-25 2009-10-08 Taichi Sato Obfuscation device, processing device, method, program, and integrated circuit thereof
US20160092230A1 (en) * 2014-09-29 2016-03-31 Via Alliance Semiconductor, Ltd. Loop predictor-directed loop buffer
CN106293639A (en) * 2015-06-26 2017-01-04 三星电子株式会社 Use the High Performance Zero bubble conditional branch prediction of micro-branch target buffer
US20170039071A1 (en) * 2015-08-05 2017-02-09 International Business Machines Corporation Method for branch prediction
US10613867B1 (en) * 2017-07-19 2020-04-07 Apple Inc. Suppressing pipeline redirection indications
US10990404B2 (en) * 2018-08-10 2021-04-27 Arm Limited Apparatus and method for performing branch prediction using loop minimum iteration prediction
US20210200550A1 (en) * 2019-12-28 2021-07-01 Intel Corporation Loop exit predictor
US20220283811A1 (en) * 2021-03-03 2022-09-08 Microsoft Technology Licensing, Llc Loop buffering employing loop characteristic prediction in a processor for optimizing loop buffer performance
CN115113934A (en) * 2022-08-31 2022-09-27 腾讯科技(深圳)有限公司 Instruction processing method, apparatus, program product, computer device and medium

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7190251B2 (en) * 1999-05-25 2007-03-13 Varatouch Technology Incorporated Variable resistance devices and methods
US6820250B2 (en) * 1999-06-07 2004-11-16 Intel Corporation Mechanism for software pipelining loop nests
US6671878B1 (en) * 2000-03-24 2003-12-30 Brian E. Bliss Modulo scheduling via binary search for minimum acceptable initiation interval method and apparatus
US6678820B1 (en) * 2000-03-30 2004-01-13 International Business Machines Corporation Processor and method for separately predicting conditional branches dependent on lock acquisition
US20030204840A1 (en) * 2002-04-30 2003-10-30 Youfeng Wu Apparatus and method for one-pass profiling to concurrently generate a frequency profile and a stride profile to enable data prefetching in irregular programs
US7454747B2 (en) * 2003-02-07 2008-11-18 Sun Microsystems, Inc. Determining maximum acceptable scheduling load latency using hierarchical search
US7010676B2 (en) 2003-05-12 2006-03-07 International Business Machines Corporation Last iteration loop branch prediction upon counter threshold and resolution upon counter one
US7290123B2 (en) * 2004-05-20 2007-10-30 Intel Corporation System, device and method of maintaining in an array loop iteration data related to branch entries of a loop detector
US7475231B2 (en) * 2005-11-14 2009-01-06 Texas Instruments Incorporated Loop detection and capture in the instruction queue
US7627742B2 (en) * 2007-04-10 2009-12-01 International Business Machines Corporation Method and apparatus for conserving power by throttling instruction fetching when a processor encounters low confidence branches in an information handling system
US9304776B2 (en) * 2012-01-31 2016-04-05 Oracle International Corporation System and method for mitigating the impact of branch misprediction when exiting spin loops
US10402200B2 (en) * 2015-06-26 2019-09-03 Samsung Electronics Co., Ltd. High performance zero bubble conditional branch prediction using micro branch target buffer

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3725547B2 (en) * 1994-12-02 2005-12-14 ヒュンダイ エレクトロニクス アメリカ インコーポレイテッド Limited run branch prediction

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040221283A1 (en) * 2003-04-30 2004-11-04 Worley Christopher S. Enhanced, modulo-scheduled-loop extensions
US20070150705A1 (en) * 2005-12-28 2007-06-28 Intel Corporation Efficient counting for iterative instructions
US20090254738A1 (en) * 2008-03-25 2009-10-08 Taichi Sato Obfuscation device, processing device, method, program, and integrated circuit thereof
US8225077B2 (en) * 2008-03-25 2012-07-17 Panasonic Corporation Obfuscation device for generating a set of obfuscated instructions, processing device, method, program, and integrated circuit thereof
US9891923B2 (en) * 2014-09-29 2018-02-13 Via Alliance Semiconductor Co., Ltd. Loop predictor-directed loop buffer
US20160092230A1 (en) * 2014-09-29 2016-03-31 Via Alliance Semiconductor, Ltd. Loop predictor-directed loop buffer
CN106293639A (en) * 2015-06-26 2017-01-04 三星电子株式会社 Use the High Performance Zero bubble conditional branch prediction of micro-branch target buffer
KR20170001602A (en) * 2015-06-26 2017-01-04 삼성전자주식회사 Front end of microprocessor and computer-implemented method using the same
KR102635965B1 (en) * 2015-06-26 2024-02-13 삼성전자주식회사 Front end of microprocessor and computer-implemented method using the same
US20170039072A1 (en) * 2015-08-05 2017-02-09 International Business Machines Corporation Method for branch prediction
US20170039071A1 (en) * 2015-08-05 2017-02-09 International Business Machines Corporation Method for branch prediction
US10613867B1 (en) * 2017-07-19 2020-04-07 Apple Inc. Suppressing pipeline redirection indications
US10990404B2 (en) * 2018-08-10 2021-04-27 Arm Limited Apparatus and method for performing branch prediction using loop minimum iteration prediction
US20210200550A1 (en) * 2019-12-28 2021-07-01 Intel Corporation Loop exit predictor
US20220283811A1 (en) * 2021-03-03 2022-09-08 Microsoft Technology Licensing, Llc Loop buffering employing loop characteristic prediction in a processor for optimizing loop buffer performance
CN115113934A (en) * 2022-08-31 2022-09-27 腾讯科技(深圳)有限公司 Instruction processing method, apparatus, program product, computer device and medium

Also Published As

Publication number Publication date
US6438682B1 (en) 2002-08-20

Similar Documents

Publication Publication Date Title
US6438682B1 (en) Method and apparatus for predicting loop exit branches
KR100244842B1 (en) Processor and method for speculatively executing an instruction loop
US7237098B2 (en) Apparatus and method for selectively overriding return stack prediction in response to detection of non-standard return sequence
US6304960B1 (en) Validating prediction for branches in a cluster via comparison of predicted and condition selected tentative target addresses and validation of branch conditions
US7278012B2 (en) Method and apparatus for efficiently accessing first and second branch history tables to predict branch instructions
US6178498B1 (en) Storing predicted branch target address in different storage according to importance hint in branch prediction instruction
US7299343B2 (en) System and method for cooperative execution of multiple branching instructions in a processor
US20020016907A1 (en) Method and apparatus for conditionally executing a predicated instruction
KR100404257B1 (en) Method and apparatus for verifying that instructions are pipelined in correct architectural sequence
JP5209633B2 (en) System and method with working global history register
US8909908B2 (en) Microprocessor that refrains from executing a mispredicted branch in the presence of an older unretired cache-missing load instruction
US8473727B2 (en) History based pipelined branch prediction
US6253315B1 (en) Return address predictor that uses branch instructions to track a last valid return address
US7941653B2 (en) Jump instruction having a reference to a pointer for accessing a branch address table
US20040064684A1 (en) System and method for selectively updating pointers used in conditionally executed load/store with update instructions
US20040186982A9 (en) Stalling Instructions in a pipelined microprocessor
JP2001060153A (en) Information processor
US20040225866A1 (en) Branch prediction in a data processing system
US20070061554A1 (en) Branch predictor for a processor and method of predicting a conditional branch
US20050144427A1 (en) Processor including branch prediction mechanism for far jump and far call instructions
US6920547B2 (en) Register adjustment based on adjustment values determined at multiple stages within a pipeline of a processor
CN113626084B (en) Method for optimizing TAGE branch prediction algorithm for instruction stream with oversized cycle number
US7343481B2 (en) Branch prediction in a data processing system utilizing a cache of previous static predictions
US20230393854A1 (en) Methods and circuitry for efficient management of local branch history registers
JP2002182906A (en) Instruction executing method, its device, branching predicting method and its device

Legal Events

Date Code Title Description
AS Assignment

Owner name: INSTITUTE FOR THE DEVELOPMENT OF EMERGING ARCHITEC

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:POPLINGHER, MITCHELL A.;REEL/FRAME:012529/0026

Effective date: 20011214

AS Assignment

Owner name: INSTITUTE FOR THE DEVELOPMENT OF EMERGING ARCHITEC

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MORRIS, DALE;YEH, TSE-YUH;CORWIN, MICHEAL P.;REEL/FRAME:012994/0538;SIGNING DATES FROM 20020531 TO 20020611

STCF Information on status: patent grant

Free format text: PATENTED CASE

CC Certificate of correction
FPAY Fee payment

Year of fee payment: 4

FPAY Fee payment

Year of fee payment: 8

FPAY Fee payment

Year of fee payment: 12