WO2009101563A1

WO2009101563A1 - Multiprocessing implementing a plurality of virtual processors

Info

Publication number: WO2009101563A1
Application number: PCT/IB2009/050505
Authority: WO
Inventors: Jan Hoogerbrugge
Original assignee: Nxp B.V.
Priority date: 2008-02-11
Filing date: 2009-02-09
Publication date: 2009-08-20

Abstract

Tasks are executed in a data processing circuit that comprises a plurality of physical processing cores (10). Each physical processing core (10) executes a groups of threads on a time multiplex basis. Each thread defines a virtual processing core, for taking on and executing tasks sequentially. A task assigning unit (14) determines for each task which of the virtual processing cores will execute the task. To select the virtual processing core, the task assigning unit (14) looks through the virtual processing cores at the underlying physical processing core (10), determining an aggregate count of threads (20) in one or more predetermined states in the respective group of threads (20) of the physical processing core (10). The aggregate counts are used to determine a priority among the virtual processing cores. The task assigning unit (14) selects a vacant thread (20) with a highest priority according to the aggregate counts. Each aggregate count may be a count of threads (20) that are blocked because they are waiting for access to a resource such as main memory.

Description

Multiprocessing implementing a plurality of virtual processors

FIELD OF THE INVENTION

The invention relates to a multiprocessing system and a method of processing data processing tasks in a multiprocessing system.

BACKGROUND

EP 1416377 describes a multi-processing system with a task dispatcher that dispatches tasks to different processors. Dispatching a task involves a signal from the dispatcher to a processor that is must start executing the task. When task has been dispatched from the task dispatcher, the receiving processor retrieves the instructions of the task if necessary, and starts executing the instructions. Typically, the task dispatcher selects processors that are "free", i.e. not executing a task and dispatches new tasks to these processors.

It is also known to define virtual processors, which are implemented by software threads on different physical processors. Concurrent processing by different virtual processors may be realized by allocating physical processors cyclically on a time-division multiplex basis to successive ones of a group of physical processors. In this case, tasks have to be dispatched to different virtual processors. EP 1416377 does not discuss assignment of different tasks to threads when there is a plurality of processing cores that can each execute a plurality of threads concurrently.

SUMMARY

Among others, it is an object to make it possible to improve efficiency of task execution in a data processing circuit with a plurality of processing cores that each implements a plurality of virtual processing cores.

A data processing circuit according to claim 1 is provided. Herein a number of virtual processing cores is implemented on a smaller number of physical processing cores. Each virtual processing core is implemented using a software thread for executing successive tasks on the virtual processing core successively. Each processing core executes a group of such threads. A task assigning unit assigns a new task to a selected one of the virtual processing cores for execution.

The task assigning unit uses a dynamic property of threads underlying the virtual processing cores to determine selection preferences, for example to define a priority order among vacant virtual processing cores so that a vacant virtual processing core with highest priority can be selected,. To determine this dynamic property, the task assigning unit "looks through" the virtual processing cores and uses an aggregate count for the physical processing core that execute the thread, by obtaining an aggregate count of threads in one or more predetermined states in the physical processing core. It has been found that execution efficiency of virtual processing cores can be increased by "looking through" the virtual processing cores in this way.

In an embodiment each aggregate count is a count of blocked threads in a respective physical processing core. Blocked threads are threads executing tasks that are waiting for a resource. In this embodiment the task assigning unit is configured to give selection preference to vacant threads executing on physical processing cores with higher count of blocked threads over vacant threads executing on physical processing cores with lower count of blocked threads. It has been found that execution efficiency is increased by using this type of property.

BRIEF DESCRIPTION OF THE DRAWING

These and other advantageous aspects will become apparent from a description of exemplary embodiments, using the following Figures: Fig. 1 shows a data processing circuit Fig. 2 shows a software architecture

DESCRIPTION OF EXEMPLARY EMBODIMENT

Fig. 1 shows a data processing circuit, comprising a plurality of physical processing cores 10, a resource circuit 12 and a task assigning unit 14. Processing cores 10 are coupled to resource circuit 12. Resource circuit 12 may comprise a main memory circuit, function specific computation circuits, input interface circuits, output interface circuits etc. (not shown) shared by the physical processing cores and coupled to the processing cores via one or more shared busses, one or more networks and/or dedicated connections. Task assigning unit 14 may be implemented using a programmable processor, programmed with a program that makes it perform the functions described in the following. Task assigning unit 14 is coupled to processing cores 10. Task assigning unit 14 may be coupled to resource circuit 12, for example to a memory circuit in resource circuit 12 wherein information about a collection of tasks is stored.

Fig. 2 shows a software architecture of the system, showing processing cores 10 containing threads 20, some of which have an associated task 22, 24. Furthermore a queue 26 of tasks 28 is shown that is waiting at task assigning unit 14 to be assigned to threads 20. (Software) threads, also called threads of executions, are known per se. As used herein, a thread is a set of instances of execution of instructions that are executed by a physical processing core in a sequence that is logically defined by the instructions and their order in the program or programs of which they are part. The sequence of execution of any particular thread may comprise sequence parts whose execution is separated from each other by execution of parts of other threads, with instances of execution of instructions that define no unique sequence relative to the particular thread. Typically each thread is defined to a processing core by a context accessible to the physical processing core 10, and instructions for the processing core that provide for transfer of control to instructions of tasks 22, 24, reception back of control from these instructions and various set-up functions.

The use of continuing threads 20 to execute successive tasks avoids the overhead of starting the tasks on their own, as temporary threads. The threads 20 continue to run on the processing cores 10 after completing tasks, each time taking up a next task without terminating the thread and restarting a new thread in between. The majority of tasks may be small in the sense that it is inefficient to move execution of these tasks from one processing core 10 to another or to start the task anew, because the execution time of the task is comparable to the time needed to move or start the task (for example if the execution time is less than ten times the time needed to move and/or start the task). For such small tasks the overhead is significantly reduce by using continuing threads to execute tasks.

In operation, each physical processing core 10 executes a plurality of processing threads 20 concurrently. Concurrent execution may be implemented e.g. on a time-multiplexed basis, but as far as processing cores 10 have a parallel processing capability, for example as part of pipelining, concurrency may also be implemented using by means of such parallel processing. The implementation of concurrent execution of multiple threads 20 is known per se. For example, it may involve context switching between contexts wherein execution involves different stored contexts for a plurality of threads 20. A context may include a program counter value and register contents, for example. Each thread 20 effectively defines a respective, different virtual processing core designed to take on tasks 22, 24, 28 successively. The plurality of threads 20 implements a corresponding plurality of such virtual processing cores. Each physical processing core 10 switches between executing threads 20 with different ones of the tasks 22, 24, so that each of the threads 20 runs part of the time.

Threads 20 can be in different execution states, indicated by letters R, W, A, B in the figure. A thread 20 can be in a running state R, wherein it is actually executed by processing core 10, or it in waiting state W, waiting for its turn to run on a processing core 10 in the time division multiplex scheme. The thread 20 may be in a requesting state A wherein it sends a request for a task to task assigning unit 14 to obtain a new task for executing, when it has finished a previous task. Also, thread 20 that has a task may be in a blocked state B, where it is blocked from running, for example when the task has to wait for a resource before execution can continue. Waiting for a resource may involve waiting for data from a main memory when a cache miss has occurred, waiting for a specialized circuit such as an I/O circuit, or a specialized computation circuit to become free or for such a specialized circuit to complete an operation.

When a thread 20 has finished a task 22, 24, it continues to exist, switches to the requesting state A, and signals to task assigning unit 14 to request a next task 28. Typically, each physical processing core 10 has a predetermined number of threads 20, of which at most one at a time is in the running state R, a first number of threads 20 is in the waiting state W, a second number of threads 20 is in the blocked state B e.g. because it is waiting for a resource, and a third number of threads 20 is in the requesting state A waiting for a new task 28.

Task assigning unit 14 receives new tasks 28 that may be assigned to any of the virtual processing cores for execution. In an embodiment, task assigning unit 14 maintains a queue 26 of such tasks 28, but alternatively a plurality of queues or a pool of tasks without one fixed order may be used. When task assigning unit 14 receives a request for a new task 28 from a thread 20, this indicates a "free" virtual processing core. When task assigning unit 14 has a new task 28 waiting for assignment, task assigning unit 14 may send the task to the requesting free virtual processing core.

Task assigning unit 14 needs to perform selection between virtual processing cores when more than one of such threads 20 have sent requests, because they are all in the requesting state A. Task assigning unit 14 does not arbitrarily select any vacant virtual processing core. Instead task assigning unit 14 uses dynamic properties of the virtual processing cores to give preference to certain vacant virtual processing cores, for example by defining a ranking of different vacant threads 20 dependent on the properties of the threads 20, and selecting a thread 20 with a highest ranking.

When task assigning unit 14 determines the properties to determine the preference, task assigning unit 14 "looks through" the properties of the virtual processing core and uses a property of the underlying physical processing core 10 that is shared with other virtual processing cores. In particular, an aggregate count of threads in predetermined selected states on a physical processing core 10 may be used. Thus, instead of using the properties of the virtual processing core, the properties of the physical processing core 10 are used. It has been found that this may improve execution efficiency.

In an embodiment, task assigning unit 14 performs resolution by assigning the new task 28 to a requesting thread 20 on a physical processing core 10 dependent on the number of threads 20 in the physical processing core 10 that are in the blocked state B. Thus, task assigning unit 14 "looks through" the properties of the virtual processing core and uses a property of the underlying physical processing core 10 that is shared with other virtual processing cores. A virtual processing core may be selected that is implemented using a thread 20 on a processing core 10 that has a highest count of threads 20 in the blocked state B, or at least has no lower count of threads 20 in the blocked state B than any other processing core 10. This has the advantage that processing time lost due to waiting for resources may be reduced. Simulations have shown that this increases execution efficiency.

It should be noted that it is not necessary to use a count of threads 20 in the blocked state that is up to date at the time of assignment of a new task to a thread 20. An earlier count or an averaged count may be used, wherein threads are counted that may in fact no longer be blocked at the time of selection of a vacant thread. Such a count is still predictive of future blocking. In an embodiment the selection of a thread 20 the instantaneous count of threads 20 in the blocked state at the time of selection may be used. In an alternative embodiment, the selection may be based on a sampled count of threads 20 in the blocked state that has been sampled at an arbitrary time point in a time interval preceding the selection. As a further alternative, the selection may be based on an average count of threads 20 in the blocked state, corresponding to the instantaneous count averaged over a predetermined time interval prior to the selection. By using non- instantaneous counts simpler circuits, satisfying less stringent timing requirements may be used. Averaging may increase the prediction accuracy of future blocking. In an embodiment, each processing core 10 is configured to send requests for new tasks 28 to task assigning unit 14 in combination with count values of threads 20 in the processing core 10 that are in the blocked state. In this embodiment each processing core 10 is configured to keep a "block count", of threads 20 in the blocked state, and optionally additional counts of threads 20 in different states and to send the block count with the request for a new task 28. In an alternative embodiment these block counts may be made accessible to task assigning unit 14 separately from requests, for use to resolve the assignment of a new task 28 to threads 20. In another embodiment task assigning unit 14 may be configured to determine the number of threads 20 in the blocked state B for each of physical processing cores 10. Task assigning unit 14 may be configured to keep information about each of the threads 20, indicating the task 22, 24, if any, executed by the thread 20 and the state of the thread 20. For this purpose, physical processing cores 10 may be configured to indicate state changes of threads 20 to task assigning unit 14, which are used by task assigning unit 14 to update its information about the tasks. In this embodiment, task assigning unit 14 may be configured to determine a count of blocked threads 20 from this information.

In a further embodiment the counts of blocked threads 20 used in combination with other parameters to control the selection of a thread 20 for a new task 28. Thus for example, task assigning unit 14 may use counts of threads 20 in a requesting state A in respective processing cores 10, in combination with the "block-count". When a plurality of processing cores 10 with requesting threads 20 and lowest block counts have equal block counts, task assigning unit 14 may be configured to select one of those processing cores 10 that has the most requesting threads 20, or at least no lower number of requesting threads 20 than any other of those processing cores 10.

In another embodiment, the priority of the block count and the count of requesting threads 20 may be reversed, the task assigning unit 14 may be configured to select one of the processing cores 10 with the most requesting threads 20 and, if a plurality of processing cores 10 has the same count of requesting threads 20, one of those processing cores 10 that has the highest block count or at least no lower number of requesting threads 20 than any other of those processing cores 10.

In embodiment each physical processing core 10 executes a predetermined number of threads 20 and no more. In this case, the blocked counts of blocked threads 20 may also be obtained by counting threads in all but the blocked state B.

Instead of using explicit requests for new tasks 28 from processing cores 10, task assigning unit 14 may poll information in processing cores 10 to determine which of processing cores 10 have threads 20 in a requesting state A. In this embodiment task assigning unit 14 may select from processing cores 10 that are detected to have threads in the requesting state, dependent on counts of threads 20 in the blocked state in these processing cores 10.

In another embodiment, a sum of the block count and the count of requesting threads 20 may be used, the task assigning unit 14 being configured to select one of the processing cores 10 with the highest sum or at least no lower sum than any other of the processing cores 10 with a requesting thread 20.

Any known prioritized selection scheme may be used to give preference to threads 20 during selection. In an embodiment, task assigning unit 14 obtains aggregate counts of threads 20 in selected states, such as the blocked state, for each of the physical processing cores and searches for a physical processing core 10 with a vacant thread 20 and a highest count, or at least no lower count than for any such physical processing core 10. Subsequently, one of the vacant threads on that physical processing core is selected to execute the new task.

However, preference can be given in many other ways. In another example of an embodiment, task assigning unit 14 performs a selection among the physical processing cores 10 on a round-robin basis, after adjusting the number of occurrences of different physical processing cores 10 in a round robin list from which physical processing cores 10 are selected. In this embodiment task assigning unit 14 obtains aggregate counts of threads 20 in selected states, such as the blocked state, for each of the physical processing cores 10 and adjusts the number of occurrences dependent on the counts, for example increasing the number of occurrences with increasing count of blocked threads. Round robin selection is repeated, if necessary until a physical processing core 10 with a vacant thread is selected. After selecting the physical processing core 10 on the round-robin basis one of the vacant threads on that physical processing core 10 is selected to execute the new task. A similar scheme may be used in combination with a weighted (pseudo-)random selection instead of round-robin selection, wherein the selection probability of different physical processing cores is adjusted according to the aggregate counts.

Task assigning unit 14 is a circuit configured to perform functions that have been described in the preceding. As noted, task assigning unit 14 may be implemented using a programmable processor circuit, programmed with a program that makes it perform such functions. Alternatively, a dedicated circuit may be used, designed to perform these functions. Such a dedicated circuit may comprise a buffer memory for storing a queue of new task identifiers, a demultiplexer for demultiplexing task identifiers from the buffer memory to selected processor cores 10 and a selection circuit to control the demultiplexer. The selection circuit may have inputs coupled to outputs of the physical processor cores 10 that supply signals indicating the present of a thread 20 in a requesting state and a count of threads in one or more predetermined states, such as a count of threads 20 in the blocked state in the physical processor cores 10. In this case, the selection circuit may control the demultiplexer to send a task identifiers of a top task in the buffer memory to one of the physical processor core that supplies a signal indicating a requesting thread and a highest count among the processor cores with a requesting thread.

Other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. A single processor or other unit may fulfil the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measured cannot be used to advantage. A computer program may be stored/distributed on a suitable medium, such as an optical storage medium or a solid-state medium supplied together with or as part of other hardware, but may also be distributed in other forms, such as via the Internet or other wired or wireless telecommunication systems. Any reference signs in the claims should not be construed as limiting the scope.

Claims

CLAIMS:

1. A data processing circuit for providing a plurality of virtual processing cores executing in a plurality of threads (20), the plurality of threads (20) being subdivided into groups of threads (20), the circuit comprising: a plurality of physical processing cores (10) each configured to execute the threads (20) of a respective one of the groups on a time multiplex basis, each thread (20) defining a respective virtual processing core for executing tasks sequentially; a task assigning unit (14), configured to assign a new task to a selected one of the virtual processing cores for execution, the task assigning unit (14) being configured to select the selected one of the virtual processing cores from among virtual processing cores defined by vacant ones of the threads (20) that are not used by any task, the task assigning unit (20) being configured to give selection preference to vacant ones of the threads dependent on respective aggregate counts for respective ones of the physical processing cores (10) on which the vacant ones of the threads (20) are executed, the respective aggregate counts being aggregate counts of threads (20) in one or more predetermined states (B) in the respective ones of the physical processing cores.

2. A data processing circuit according to claim 1, wherein the aggregate count is a count of blocked ones of the threads (20) in the respective group of threads of the respective one of physical processing cores (10), the blocked ones of the threads (20) being threads (20) executing tasks that are waiting for a resource, the task assigning unit (14) being configured to give selection preference to vacant ones of the threads (20) executing on physical processing cores (10) with higher count of blocked ones of the threads (20) over vacant ones of the threads (20) executing on physical processing cores with lower count of blocked ones of the threads (20).

3. A data processing circuit according to claim 2, wherein the task assigning unit (14) is configured to determine further counts, each of the vacant ones of the threads (20) in a respective one of the physical processing cores (10), the task assigning unit (14) being configured to give selection preference to vacant ones of the threads (20) dependent on combinations of the further counts and the counts of blocked ones of the threads (20) for the respective ones of the physical processing cores (10).

4. A data processing circuit according to claim 1, wherein the aggregate count for each respective one of the physical processing cores (10) is a respective count of blocked ones of the threads (20) and the vacant ones of the threads (20) in the respective groups of threads (20) of the physical processing core (10), the blocked ones of the threads (20) being threads (20) executing tasks that are waiting for a resource, the task assigning unit (14) being configured to give selection preference to vacant ones of the threads (20) executing on physical processing cores (10) with higher count of blocked and vacant ones of the threads (20) over vacant ones of the threads (20) executing on physical processing cores (10) with lower count of blocked and vacant ones of the threads (20).

5. A data processing circuit according to claim 1, wherein the task assigning unit (14) is configured to determine the aggregate counts by sampling states of the threads (20) and/or counts of threads (20) in the one or more predetermined states (B), during a predetermined time interval prior to selection among the vacant ones of the threads (20).

6. A data processing circuit according to claim 1, wherein the task assigning unit is configured to determine the count by averaging a number of threads (20) in the one or more predetermined state during a predetermined time interval prior to selection among the vacant ones of the threads (20).

7. A data processing circuit according to claim 1, wherein the threads (20) are configured to signal to the task assigning unit (14) when they are vacant.

8. A data processing circuit according to claim 1, wherein the physical processing cores are configured to signal a number of threads in the one or more predetermined states to the task assigning unit during operation.

9. A data processing circuit according to claim 1, comprising a resource circuit (12) comprising at least one of a main memory, a function specific computation circuit, an input interface circuit, and output interface circuit and an input-output interface circuit, the physical processing cores (10) being configured to block threads (20) that execute tasks waiting for grant and/or completion of access to the resource circuit (12).

10. A method of executing tasks in a data processing circuit that comprises a plurality of physical processing cores (10) executing threads subdivided into groups of threads, the method comprising: executing the threads (20) of each group on a time multiplex basis in a respective one of the physical processing cores (10); executing respective tasks, each using a virtual processing core implemented by a respective one of the threads (20); determining, at least for each physical processing core whose respective group of threads (20) includes a vacant one of the threads (20) not used by a task, an aggregate count of threads (20) in one or more predetermined states in the respective group of threads (20) of the physical processing core (10); selecting one of the vacant ones of the threads (20) from among the vacant ones of the threads (20), giving selection preference to vacant ones of the threads (20) dependent on the aggregate counts; when a new task is available, assigning the new task to the selected vacant one of the threads (20) for execution of the new task.

11. A method according to claim 10, comprising: detecting blocked ones of the threads (20) that are blocked from proceeding with execution of the tasks executed by the blocked ones of the threads (20); and wherein the aggregate counts are count of blocked ones of the threads (20) on respective ones of the physical processing cores (10), and selection preference is given to vacant ones of the threads executing on physical processing cores (10) with higher count of blocked ones of the threads (20) over vacant ones of the threads (20) executing on physical processing cores (10) with lower count of blocked ones of the threads (20).

12. A computer program product that comprises a program of instructions that, when executed by a programmable processor, causes the programmable processor to operate as a task assigning unit (14) in a data processing circuit with on virtual processing cores defined by threads (20), the threads (20) being organized in groups, each group executing on a respective physical processing core (10), the program of instructions being configured to cause programmable processor to determine, at least for each physical processing core (10) whose respective group of threads (20) includes a vacant one of the threads (20) not used by a task, an aggregate count of threads (20) one or more predetermined states (B) in the respective group of threads (20) of the physical processing core; select one of the vacant ones of the threads (20) from among the vacant ones of the threads (20), giving selection preference to vacant ones of the threads (20) dependent on the aggregate counts; when a new task is available, assign the new task to the selected vacant one of the threads (20) for execution of the new task.