WO2001016761A2

WO2001016761A2 - Efficient page allocation

Info

Publication number: WO2001016761A2
Application number: PCT/US2000/024216
Authority: WO
Inventors: Karlon K. West; Chris Miller
Original assignee: Times N Systems, Inc.
Priority date: 1999-08-31
Filing date: 2000-08-31
Publication date: 2001-03-08
Also published as: WO2001016737A2; CA2382728A1; AU7110000A; AU7108300A; WO2001016750A2; WO2001016750A3; WO2001016738A8; WO2001016740A2; WO2001016741A3; AU7108500A; EP1214652A2; WO2001016761A3; WO2001016738A3; WO2001016738A2; EP1214653A2; AU7100700A; WO2001016760A1; WO2001016743A3; CA2382927A1; WO2001016737A3

Abstract

Methods, systems and devices are described for efficient page allocation. A method includes: writing a bit-map representing a freedom state of a plurality of memory pages in a shared memory unit; and scrambling an order of said plurality of memory pages prior to writing said bit-map to reduce contention. The methods, systems and devices provide advantages because the speed and scalability of parallel processor systems is enhanced.

Description

EFFICIENT PAGE ALLOCATION

BACKGROUND OF THE INVENTION

1. Field of the Invention The invention relates generally to the field of computer systems where one or more CPU are connected to one or more RAM subsystems, or portions thereof. More particularly, the invention relates to computer science techniques that utilize efficient page allocation.

2. Discussion of the Related Art The clustering of workstations is a well-known art. In the most common cases, the clustering involves workstations that operate almost totally independently, utilizing the network only to share such services as a printer, license-limited applications, or shared files.

In more-closely-coupled environments, some software packages (such as NQS) allow a cluster of workstations to share work. In such cases the work arrives, typically as batch jobs, at an entry point to the cluster where it is queued and dispatched to the workstations on the basis of load.

In both of these cases, and all other known cases of clustering, the operating system and cluster subsystem are built around the concept of message-passing. The term message-passing means that a given workstation operates on some portion of a job until communications (to send or receive data, typically) with another workstation is necessary. Then, the first workstation prepares and communicates with the other workstation.

Another well-known art is that of clustering processors within a machine, usually called a Massively Parallel Processor or MPP, in which the techniques are essentially identical to those of clustered workstations. Usually, the bandwidth and latency of the interconnect network of an MPP are more highly optimized, but the system operation is the same.

In the general case, the passing of a message is an extremely expensive operation; expensive in the sense that many CPU cycles in the sender and receiver are consumed by the process of sending, receiving, bracketing, verifying, and routing the message, CPU cycles that are therefore not available for other operations. A highly streamlined message-passing subsystem can typically require 10,000 to 20,000 CPU cycles or more.

There are specific cases wherein the passing of a message requires significantly less overhead. However, none of these specific cases is adaptable to a general-purpose computer system.

Message-passing parallel processor systems have been offered commercially for years but have failed to capture significant market share because of poor performance and difficulty of programming for typical parallel applications. Message-passing parallel processor systems do have some advantages. In particular, because they share no resources, message-passing parallel processor systems are easier to provide with high-availability features. What is needed is a better approach to parallel processor systems.

There are alternatives to the passing of messages for closely-coupled cluster work. One such alternative is the use of shared memory for inter- processor communication.

Shared-memory systems, have been much more successful at capturing market share than message-passing systems because of the dramatically superior performance of shared-memory systems, up to about four-processor systems. In Search of Clusters, Gregory F. Pfister 2nd ed. (January 1998) Prentice Hall Computer Books, ISBN: 0138997098 describes a computing system with multiple processing nodes in which each processing node is provided with private, local memory and also has access to a range of memory which is shared with other processing nodes. The disclosure of this publication in its entirety is hereby expressly incorporated herein by reference for the purpose of indicating the background of the invention and illustrating the state of the art.

However, providing high availability for traditional shared-memory systems has proved to be an elusive goal. The nature of these systems, which share all code and all data, including that data which controls the shared operating systems, is incompatible with the separation normally required for high availability. What is needed is an approach to shared-memory systems that improves availability. Although the use of shared memory for inter-processor communication is a well-known art, prior to the teachings of U.S. Ser. No. 09/273,430, filed March 19, 1999, entitled Shared Memory Apparatus and Method for Multiprocessing Systems, the processors shared a single copy of the operating system. The problem with such systems is that they cannot be efficiently scaled beyond four to eight way systems except in unusual circumstances. All known cases of said unusual circumstances are such that the systems are not good price-performance systems for general-purpose computing.

The entire contents of U.S. Patent Applications 09/273,430, filed March 19, 1999 and PCT USOO/01262, filed January 18, 2000 are hereby expressly incorporated by reference herein for all purposes. U.S. Ser. No. 09/273,430, improved upon the concept of shared memory by teaching the concept which will herein be referred to as a tight cluster. The concept of a tight cluster is that of individual computers, each with its own CPU(s), memory, I/O, and operating system, but for which collection of computers there is a portion of memory which is shared by all the computers and via which they can exchange information. U.S. Ser. No. 09/273,430 describes a system in which each processing node is provided with its own private copy of an operating system and in which the connection to shared memory is via a standard bus. The advantage of a tight cluster in comparison to an SMP is "scalability" which means that a much larger number of computers can be attached together via a tight cluster than an SMP with little loss of processing efficiency.

What is needed are improvements to the concept of the tight cluster. What is also needed is an expansion of the concept of the tight cluster. In a typical computing system, every CPU can access all of RAM, either directly with Load and Store instructions, or indirectly, such as with a message passing scheme. When more than one CPU can access or manage the RAM subsystem or a portion thereof, certain accesses to that RAM must be synchronized to ensure mutually exclusive access to portions of the RAM subsystem. This in turn generates contention for those portions of the RAM subsystem, herein referred to as "pages" or "memory pages" by multiple CPUs and thereby reduces overall system performance. One problem in any shared-memory system is the allocation of free pages to the processors to use on an as-needed basis. Part of this problem is caused by the need to represent (and find) which pages are free. One technique used in the past is to form a page table with a pointer per free page (plus control words). One difficulty with this solution is that the table is a huge, sparse matrix so that not only does it consume a large memory space, but also traversing it to find empty pages requires large amounts of time.

Another technique known in the art is to arrange free pages in a linked list. The consumes less, albeit significant, space but requires more time for management thereof.

SUMMARY OF THE INVENTION A goal of the invention is to simultaneously satisfy the above-discussed requirements of improving and expanding the tight cluster concept which, in the case of the prior art, are not satisfied.

One embodiment of the invention is based on a method, comprising: writing a bit-map representing a freedom state of a plurality of memory pages in a shared memory unit; and scrambling an order of said plurality of memory pages prior to writing said bit-map to reduce contention. Another embodiment of the invention is based on An apparatus, comprising: a shared memory node including a plurality of shared memory pages; a first processing node coupled to said shared memory node; and a second processing node coupled to said shared memory node, wherein a first portion of said shared memory pages owned by said first processing node is coupled to a first separate memory bus and a second portion of said shared memory pages owned by said second processing node is coupled to a second separate memory bus to reduce contention. Another embodiment of the invention is based on an electronic media, comprising: a computer program adapted to write a bit-map representing a freedom state of a plurality of memory pages in a shared memory unit; and scramble an order of said plurality of memory pages prior to writing said bit-map to reduce contention. Another embodiment of the invention is based on a computer program comprising computer program means adapted to perform the steps of writing a bit-map representing a freedom state of a plurality of memory pages in a shared memory unit; and scrambling an order of said plurality of memory pages prior to writing said bit-map to reduce contention when said computer program is run on a computer. Another embodiment of the invention is based on a system, comprising a multiplicity of processors, each with some private memory and the group with some shared memory, interconnected and arranged such that memory accesses to a first set of address ranges will be to local, private memory whereas memory accesses to a second set of address ranges will be to shared memory, and arranged so that one particular member of said second set is a small region, encoded so that each small elemental portion of said small region represents one of the minimum assignable sub-regions of said shared memory. Another embodiment of the invention is based on a computer system in which each of one or more CPUs has access to a shared area of RAM, such that each CPU may access any area of this shared area. Another embodiment of the invention is based on a computer system, comprising a shared memory node; a first processing node coupled to said shared memory node; and a second processing node coupled to said shared memory node, wherein one or more CPUs has access to a shared area of RAM, such that each CPU may access any area of the shared area. These, and other goals and embodiments of the invention will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawing. It should be understood, however, that the following description, while indicating preferred embodiments of the invention and numerous specific details thereof, is given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the invention without departing from the spirit thereof, and the invention includes all such modifications.

BRIEF DESCRIPTION OF THE DRAWING A clear conception of the advantages and features constituting the invention, and of the components and operation of model systems provided with the invention, will become more readily apparent by referring to the exemplary, and therefore nonlimiting, embodiments illustrated in the drawing accompanying and forming a part of this specification, wherein like reference characters (if they occur in more than one view) designate the same parts. It should be noted that the features illustrated in the drawing are not necessarily drawn to scale.

FIG. 1 illustrates a block schematic view of a system, representing an embodiment of the invention. DESCRIPTION OF PREFERRED EMBODIMENTS The invention and the various features and advantageous details thereof are explained more fully with reference to the nonlimiting embodiments that are illustrated in the accompanying drawing and detailed in the following description of preferred embodiments. Descriptions of well known components and processing techniques are omitted so as not to unnecessarily obscure the invention in detail.

The teachings of U.S. Ser. No. 09/273,430 include a system which is a single entity; one large supercomputer. The invention is also applicable to a cluster of workstations, or even a network.

The invention is applicable to systems of the type of Pfister or the type of U.S. Ser. No. 09/273,430 in which each processing node has its own copy of an operating system. The invention is also applicable to other types of multiple processing node systems. The context of the invention can include a tight cluster as described in

U.S. Ser. No. 09/273,430. A tight cluster is defined as a cluster of workstations or an arrangement within a single, multiple-processor machine in which the processors are connected by a high-speed, low-latency interconnection, and in which some but not all memory is shared among the processors. Within the scope of a given processor, accesses to a first set of ranges of memory addresses will be to local, private memory but accesses to a second set of memory address ranges will be to shared memory. The significant advantage to a tight cluster in comparison to a message-passing cluster is that, assuming the environment has been appropriately established, the exchange of information involves a single STORE instruction by the sending processor and a subsequent single LOAD instruction by the receiving processor.

The establishment of the environment, taught by U.S. Ser. No. 09/273,430 and more fully by companion disclosures (U.S. Provisional Application Ser. No. 60/220,794, filed July 26, 2000; U.S. Provisional Application Ser. No. 60/220,748, filed July 26, 2000; WSGR 15245-711 ;

WSGR 15245-712; WSGR 15245-713; WSGR 15245-716; WSGR 15245-717; WSGR 15245-718; WSGR 15245-719; and WSGR 15245-720, the entire contents of all which are hereby expressly incorporated herein by reference for all purposes) can be performed in such a way as to require relatively little system overhead, and to be done once for many, many information exchanges. Therefore, a comparison of 10,000 instructions for message-passing to a pair of instructions for tight-clustering, is valid.

The invention can include a shared-memory cluster and means of providing highly-efficient operating system control for such a system. Among the means of controlling shared memory in such a tight cluster for improved performance is the provision of a highly efficient method of page mapping utilized by the operating extensions running on different processors within the cluster.

In the context of a computing system for which the memory (e.g., RAM) subsystem or a portion of the subsystem is connected to one or more central processing units (CPUs), the invention can include reducing subsystem contention. The invention can include methods to efficiently and correctly manage memory ownership of portions of the subsystem.

The invention can be used in the environment described in U.S. Ser. No. 09/273,430 where multiple computers are provided with means to selectively address a first set of memory address ranges which will be to private memory and a second set of memory ranges which will be to shared memory. The invention can be a free-page management scheme which is far more efficient in both memory space used and maintenance time required.

Only one single page of memory is required; a page filled with a bit-map representing free pages. In order to avoid allocating pages on an overly- structured basis, the bits in the bit-map are scrambled prior to being written. These bits can be scrambled via a conformal mapping technique. The purpose of the scrambling technique is to assure that bit n of row m is not necessarily page number nk+m, where k is the length of a row in the bit map. This technique precludes the allocation of pages in a fashion so uniform as to tend to cause hot-spots in memory. The preferred embodiment of the scrambling technique is to generate a linear bit-map of the pages, then multiply the resulting polynomial by a known, fixed polynomial chosen to have no real factors. Of course, the invention can be implemented with other scrambling algorithms. The length (k) of the rows in the bit-mapped page is selected to be an entity efficiently handled by the underlying cache and memory system. In the preferred embodiment, the length is the same as the length of a cache line. Of course, the invention can be implemented with different row lengths.

After an initial setup, a pointer is established to the first row of the hashed page. When a first processor needs free pages from the shared-memory free-page list, the kernel running thereon locks the pointer via semaphore, then reads the row pointed to, sets all the bits within that row to indicate unavailable status, and increments the pointer, then releases the semaphore. The kernel process then reverses the scrambling process, records the addresses of the pages marked free, and adds them to a pool of shared pages available to that processor. The technique involves use of two thresholds. The first, lower, of these thresholds is the lowest number of free pages in a particular subsystem's free page list below which it goes to the bit-mapped page to request additional pages. The higher of these thresholds is the number of freed pages in a particular list above which the subsystem returns pages to the common pool, and restores the bit-maps corresponding to those pages.

Restoration of pages may cause accessing of more than one row of the bit map. When restoration is done, the pointer is semaphore-locked, then each row for which free pages are to be returned is read and the particular bits representing the newly-freed pages are bit-flipped by the processor and the row written back to the bit-mapped page. After the bits are written, the pointer is returned to its value at entry and the semaphore is released.

Eventually, therefore, the number of free pages an acquiring process acquires upon reading a row will become less than the number of bits in the row, and can be substantially less. If the lower threshold is not satisfied by one such get_pages operation, the operation will be repeated until the lower threshold is satisfied. When some subsystem cannot obtain sufficient pages after a system-settable number of reads, all processors are signaled, and the upper threshold for each is reduced.

In a computing system where more than one CPU has access to the RAM subsystem, or portion thereof, some means of distributing memory pages among the multiple memory buses should be provided. In the general case, memory pages are allocated from contiguous pools, and hot spots (places in memory where lots of accesses occur by more than one CPU a high percentage of the time) tend to form.

In a computing system where each CPU can communicate with the other CPUs, a methodology can be designed where the page allocation is sufficiently distributed across the entire set of memory pages with minimal overhead, that multiple banks of RAM can be utilized and the overall system performance is thereby increased.

U.S. Ser. No. 09/273,430 described a system in which each compute node has its own, private memory, but in which there is also provided a shared global memory, accessible by all compute nodes. In this case, contention for the shared memory bus only occurs when more than one node is attempting to access the shared memory pages located on the same memory bus at the same time. Other distributed, share everything compute systems, including but not limited to cc-NUMA, as well as traditional SMP machines that contain more than one memory bus, (usually connected via cross-bar switches) can benefit from the techniques taught but this disclosure.

A computing system of the type described in U.S. Ser. No. 09/273,430 can be designed where shared memory pages can be located on physically separate memory buses such that there is no contention in accessing memory on the separate buses at the same time by different CPUs. When a CPU needs access to one or more a pages of shared memory, the CPU first determines what physical shared memory pages to use, and if the pages are not owned, then those pages may be linked into a traditional page frame database for use by that CPU. The means taught by this disclosure involve the selection process of those shared memory pages. A first, simplistic method might be to split shared memory into blocks, one block per shared memory bus, and as shared memory is used, multiple blocks will eventually become used, and bus contention will be reduced. It is the general case however, that multiple sequential pages are used at a single time by any given CPU, and bus contention is not reduced by much.

A second method would be to stripe the pages across all shared memory buses, such that sequential page accesses will always use different shared memory buses. This scenario also leads to problems in that system data structures and other highly used memory pages tend to form hot spots, and shared memory bus contention is still not reduced as much as possible.

The invention can include sufficiently randomizing shared memory pages, and tracking where those pages are by using a large polynomial hash function (including but not limited to the standard IEEE 32-bit CRC function) such that in a special shared memory page, each bit in the page represents a shared memory page available for general application use, and the location of the bits in that special page use a CRC hash function to determine which shared memory pages on which shared memory buses should be used. As pages are used, bits are set from one to zero, as they are released, bits are set from zero to one. A given CPU can read a cache-line size of bits to get a set of usable pages, to help eliminate the need for future shared memory accesses. This randomization help prevents hot spots from developing and thereby greatly reduces shared memory bus contention.

Referring to FIG. 1, the available shared pages are allocated (e.g., associated with bits of a bit-map) at block 101. A random hash is generated into the bit-map page at block 102. A cache line of bits representing available shared memory pages is read by a processing node from the bit-map page at block 103. (In this example, shared memory pages that are available are represented in the bit-map by l's; shared memory pages that are not available are represented in the bit-map by 0's.) At block 104, it is determined whether enough unused pages were read by the cache line. If insufficient unused page bits are read, the invention cycles back to block 102 where another random hash is generated onto the bit-map page. When sufficient unused page bits are read, these bits are set to 0 at block 105. At block 106, an inverse polynomial is taken to find (identify) the shared memory pages associated with the read bits. The shared pages are returned to a calling function at block 107.

While not being limited to any particular performance indicator or diagnostic identifier, preferred embodiments of the invention can be identified one at a time by testing for the substantially highest performance. The test for the substantially highest performance can be carried out without undue experimentation by the use of a simple and conventional benchmark (speed) experiment. The term substantially, as used herein, is defined as at least approaching a given state (e.g., preferably within 10% of, more preferably within 1% of, and most preferably within 0.1% of). The term coupled, as used herein, is defined as connected, although not necessarily directly, and not necessarily mechanically. The term means, as used herein, is defined as hardware, firmware and/or software for achieving a result. The term program or phrase computer program, as used herein, is defined as a sequence of instructions designed for execution on a computer system. A program may include a subroutine, a function, a procedure, an object method, an object implementation, an executable application, an applet, a servlet, a source code, an object code, and/or other sequence of instructions designed for execution on a computer system.

Practical Applications of the Invention A practical application of the invention that has value within the technological arts is an environment where there are multiple compute nodes, each with one or more CPU and each CPU with private RAM, and where there are one or more RAM, and where there are one or more RAM units which are accessible by some or all of the compute nodes, and where two or more shared RAM units can be accessed simultaneously with no memory bus contention. Another practical application of the invention that has value within the technological arts is waveform transformation. Further, the invention is useful in conjunction with data input and transformation (such as are used for the purpose of speech recognition), or in conjunction with transforming the appearance of a display (such as are used for the purpose of video games), or the like. There are virtually innumerable uses for the invention, all of which need not be detailed here.

Advantages of the Invention A system, representing an embodiment of the invention, can be cost effective and advantageous for at least the following reasons. The invention improves the speed of parallel computing systems. The invention improves the scalability of parallel computing systems.

All the disclosed embodiments of the invention described herein can be realized and practiced without undue experimentation. Although the best mode of carrying out the invention contemplated by the inventors is disclosed above, practice of the invention is not limited thereto. Accordingly, it will be appreciated by those skilled in the art that the invention may be practiced otherwise than as specifically described herein.

For example, although the efficient page allocation described herein can be a separate module, it will be manifest that the efficient page allocation may be integrated into the system with which it is associated. Furthermore, all the disclosed elements and features of each disclosed embodiment can be combined with, or substituted for, the disclosed elements and features of every other disclosed embodiment except where such elements or features are mutually exclusive.

It will be manifest that various additions, modifications and rearrangements of the features of the invention may be made without deviating from the spirit and scope of the underlying inventive concept. It is intended that the scope of the invention as defined by the appended claims and their equivalents cover all such additions, modifications, and rearrangements.

The appended claims are not to be interpreted as including means-plus- function limitations, unless such a limitation is explicitly recited in a given claim using the phrase "means for." Expedient embodiments of the invention are differentiated by the appended subclaims.

Claims

CLAIMS What is claimed is:

1. A method, comprising: writing a bit-map representing a freedom state of a plurality of memory pages in a shared memory unit; and scrambling an order of said plurality of memory pages prior to writing said bit-map to reduce contention.

2. The method of claim 1, wherein writing said bit-map is effected with an operating system extension.

3. The method of claim 1, wherein scrambling said order is effected with an operating system extension.

4. The method of claim 1, wherein scrambling includes scrambling with a conformal mapping technique.

5. The method of claim 1, wherein scrambling includes scrambling with a large polynomial hash function.

6. The method of claim 1, wherein scrambling includes scrambling with a 32-bit CRC function.

7. An apparatus, comprising: a shared memory node including a plurality of shared memory pages; a first processing node coupled to said shared memory node; and a second processing node coupled to said shared memory node, wherein a first portion of said shared memory pages owned by said first processing node is coupled to a first separate memory bus and a second portion of said shared memory pages owned by said second processing node is coupled to a second separate memory bus to reduce contention.

8. The apparatus of claim 1, wherein said first portion consists of a first memory block and said second portion consists of a second memory block.

9. A computer system comprising the apparatus of claim 8.

10. An electronic media, comprising: a computer program adapted to write a bit-map representing a freedom state of a plurality of memory pages in a shared memory unit; and scramble an order of said plurality of memory pages prior to writing said bit-map to reduce contention.

11. A computer program comprising computer program means adapted to perform the steps of writing a bit-map representing a freedom state of a plurality of memory pages in a shared memory unit; and scrambling an order of said plurality of memory pages prior to writing said bit-map to reduce contention when said computer program is run on a computer.

12. A computer program as claimed in claim 11 , embodied on a computer- readable medium.

13. A system, comprising a multiplicity of processors, each with some private memory and the group with some shared memory, interconnected and arranged such that memory accesses to a first set of address ranges will be to local, private memory whereas memory accesses to a second set of address ranges will be to shared memory, and arranged so that one particular member of said second set is a small region, encoded so that each small elemental portion of said small region represents one of the minimum assignable sub-regions of said shared memory.

14. The system of claim 13, wherein the small elemental portions are arranged in rows, a row being defined as a group of said small elemental portions, and are conveniently accessible by a processor sharing access to said small region.

15. The system of claim 14, further comprising a locking mechanism is utilized to assure that when a first processor is accessing said small region, no other processor may access said small region until said first said processor relinquishes said locking mechanism.

16. The system of claim 15, further comprising a processor may claim any or all of said minimum assignable sub-regions represented by said small elemental portion of said small region, and does so by altering the value of each said small elemental portion and placing said altered values back into said row of said small region.

17. The system of claim 13 , wherein said small region is a page of memory as defined by the operating system and machine organization of said shared- memory system.

18. The system of claim 13, wherein said small elemental portion is a binary bit.

19. The system of claim 13, wherein said minimum assignable sub-region is a page of memory as defined by the operating system and machine organization of said shared-memory system.

20. The system of claim 13, wherein said arrangement of said elemental portions is scrambled so that the position of one of said elemental portions is not linearly related to the position in memory of the minimum assignable sub- region it represents.

21. The system of claim 13, wherein said processors utilize said small region for management of said minimum assignable regions.

22. The system of claim 21 , wherein said processors each run a separate copy of a management subsystem, herein called a micro-kernel.

23. The system of claim 22, wherein each copy of said micro-kernel, running on its own said processor, maintains multiple threshold values relating to management of said minimum assignable sub-region, said threshold values being utilized by said micro-kernels for determining when to obtain additional said minimum assignable sub-regions and when to return some of said minimum assignable sub-regions to the common pool.

24. A computer system in which each of one or more CPUs has access to a shared area of RAM, such that each CPU may access any area of this shared area.

25. A computer system as described in claim 24 where shared RAM is located on one or more separate memory buses that support simultaneous access.

26. A computer system as described in claim 25 in which the data structures for tracking current page locations are maintained by a randomizing polynomial, using a bit mask page as the managing data structure.

27. A computer system, comprising a shared memory node; a first processing node coupled to said shared memory node; and a second processing node coupled to said shared memory node, wherein one or more CPUs has access to a shared area of RAM, such that each CPU may access any area of the shared area.

28. A computer system as described in claim 27, wherein shared RAM is located on one or more separate memory buses that support simultaneous access.

29. A computer system as described in claim 27, wherein data structures for tracking current page locations are maintained by a randomizing polynomial, using a bit mask page as the managing data structure.