WO2001013256A1 - A mechanism for efficient data access and communication in parallel computations on an emulated spatial lattice - Google Patents
A mechanism for efficient data access and communication in parallel computations on an emulated spatial lattice Download PDFInfo
- Publication number
- WO2001013256A1 WO2001013256A1 PCT/US2000/040633 US0040633W WO0113256A1 WO 2001013256 A1 WO2001013256 A1 WO 2001013256A1 US 0040633 W US0040633 W US 0040633W WO 0113256 A1 WO0113256 A1 WO 0113256A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- bit
- data
- memory
- processing
- lattice
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
- G06F15/8007—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
- G06F15/8023—Two dimensional arrays, e.g. mesh, torus
Definitions
- the present invention relates to the field of massively parallel, spatially organized computation.
- the field of massively parallel, spatially organized computation encompasses computations involving large sets of data items that are naturally thought of as distributed in physical space. Such computations often exhibit some degree of spatial locality during each computational step. That is, the processing to be performed at each point in space depends upon only data residing nearby.
- lattice simulations of physical systems using techniques such as finite- difference calculations and lattice-gas molecular dynamics have such spatial organization and locality.
- Other interesting examples include lattice simulations of physics-like dynamics, such as virtual-reality models and volume rendering.
- Each processor in a SIMD machine may have several different functional units operating in a pipelined fashion. Since computer size is normally fixed while problem size is variable, it is common for an array of SIMD processors to be used to perform a calculation that corresponds naturally to a larger spatial array of processors, perhaps with more dimensions than the actual physical array. This can be achieved by having each of the processors simulate the behavior of some portion of the space.
- Several physical simulations on the ILLIAC IV computer were done in this manner, as described in R. M. Hord's book, The ILLIAC IV: The First Supercomputer, Computer Science Press (1982). Typically, the emulated space is split into equal-sized chunks, one per processor.
- Margolus uses a simpler communication scheme, in which sheets of bits move a given amount in a given direction in the emulated lattice (which has a programmable size and shape) .
- This shifting bit-sheet scheme is implemented as a pipelined version of traditional SIMD mesh data movement. Because of the specialization to shifting entire sheets of bits, however, only a few parameters controlling a restricted set of repeated communication patterns (as opposed to detailed clock-by-clock SIMD control information) are broadcast to the processors.
- a uniform SIMD communication architecture (like that of CAM- 8) is not appropriate in this context, since a uniform array of SIMD processing nodes on each chip would make very uneven and inefficient use of inter-chip communication resources: nodes along an edge of the array on one chip would either all need to communicate off-chip simultaneously, or all need no communication simultaneously. Furthermore, a fixed virtual machine model architecture (like that of CAM-8) gives up much of the flexibility of a more general SIMD architecture. For flexible fine-grained control, a high control bandwidth is needed.
- on-chip DRAM must be used in a constrained fashion. For example, in a given block of DRAM, once any bit in a given DRAM row is accessed, bandwidth may be wasted if all of the bits of that row are not used before moving on to another row. Similarly, if memory rows are accessed as a sequence of memory words, then all of the bits in entire words may also need to be used together. These kinds of memory granularity constraints must be efficiently dealt with. Temporarily storing data that are read before they are needed, or that can't be written back to the right block of memory yet, wastes the bandwidth of the temporary storage memories, and wastes the space taken up by these extra memories. Not having data available at the moment they are needed wastes processing and communications resources .
- the present invention features a mechanism for optimizing the use of both memory bandwidth and inter-chip communications bandwidth in a simple and flexible lattice- emulation architecture.
- the operations are performed by at least one processing node associated with the at least one sector.
- the processing node includes a memory for storing lattice site data associated with the lattice sites and the lattice sites each are associated with data in a data structure. Sets of homologous bits, one from each associated data structure at each lattice site, form bit-fields.
- a shift-invariant partition of the at least one sector into pluralities of lattice sites forms pluralities of site- aggregates, each site-aggregate being unsymmetric about every parallel to at least one edge of the at least one sector.
- a portion of each bit-field associated with each site-aggregate forms a bit-aggregate, which is stored in the memory as an addressable unit.
- the processing node shifts data for at least one of the bit-fields within the at least one sector of the emulated lattice by addressing each bit-aggregate in which each portion of the at least one of the bit-fields is stored.
- the at least one sector is partitioned in a shift- invariant manner into pluralities of lattice sites forming first site-aggregates, which are grouped to partition the lattice sites of the at least one sector in a shift- invariant manner to form a plurality of second site- aggregates, whereby a portion of each bit-field associated with each first site-aggregate forms a first bit- aggregate.
- Pluralities of the first bit-aggregates are grouped together to form second bit-aggregates of data associated with corresponding second site-aggregates, each of which is stored in the memory as an addressable unit composed of separately addressable first bit-aggregates.
- the processing node shifts data for at least one of the bit-fields within the at least one sector by addressing each second bit-aggregate in which each portion of the at least one of the bit-fields is stored, and addressing each of the constituent first bit-aggregates in the addressed second bit-aggregate.
- Embodiments of the invention may include one or more of the following features.
- the bit-field data for each of the lattice sites to be updated may be processed to transform the value of the associated data structure.
- the processing can comprise performing a symbolic operation.
- the processing can comprise performing a numerical operation .
- the processing may include reading from the memory the bit-field data for each lattice site to be updated, updating the read bit-field data and writing the updated bit-field data to the memory.
- the updating can occur after the shifting and the bit-field data read from the memory are shifted bit-field data.
- the updating can occur before the shifting and the bitfield data written to the memory are shifted bit-field data.
- the at least one sector may comprise a plurality of sectors and the operations may be performed by an array of processing nodes, each associated with a different one of the sectors in the plurality of sectors and communicating with others of the processing nodes associated with neighboring ones of the sectors in the plurality of sectors.
- the bit-field data may be shifted periodically within each sector of each associated processing node, such that the data that shifts past an edge of the sector wraps to the beginning of an opposite edge of the sector.
- the periodic shifting may be performed by memory addressing and by re-ordering bits within addressed ones of the bit-aggregates.
- the periodically shifted bit-field data can be read by the processing nodes.
- Each of the processing nodes can access data for one of the site-aggregates to be processed and communicate the wrapped data to a nearby one of the processing nodes, the communicated wrapped data being substituted for the wrapped data within the nearby one of the processing nodes to which it is communicated.
- the processing can include using a table lookup.
- Each of the processing nodes can include a plurality of processing elements for processing a parallel stream of the bit-field data and the table lookup can be shared by all of the processing elements in each processing node.
- the bit-field data can be loaded into the shared lookup table so that data from all of the lattice sites in a given one of the sectors can be used to randomly access data belonging to a fixed set of the lattice sites.
- the plurality of lattice sites aggregated within each of the site-aggregates may have a uniform spacing relative to each edge of the at least one sector, the difference for any two of the site-aggregates in the respective numbers of lattice sites lying within a give distance of an edge being at most one.
- the second bit-aggregate may aggregate first bit-aggregates which are all associated with a single sector, and which in their pattern of grouping of data associated with lattice sites, are all periodic translations of each other along a single line in a single sector.
- the aggregated first bit-aggregates can then be ordered along this line, with this ordering reflected in the memory addresses where they are stored. Shifting of the at least one bit-field then involves only a cyclic permutation in the order of each set of constituent first bit-aggregates within the corresponding second bit- aggregate .
- the at least one emulated lattice can include at least two emulated lattices having unequal numbers of the bit-fields.
- the shifted bit-field data from the at least two emulated lattices may be processed together.
- the memory can include at least two memory blocks, and more than one of the at least two memory blocks can be coupled to each processing element.
- the plurality of processing elements can share a lookup table.
- Each processing element can include bit-serial arithmetic hardware.
- the memory can include at least one memory block and portions of the at least one memory block can be selected to store control information used during a period in which a row of memory words is processed.
- Each of the processing nodes can be connected by mesh I/O links to neighboring processing nodes to form a mesh array, each of the processing nodes being associated with an equal-sized sector of the emulated lattice and the performance of the operations can be divided among the processing nodes.
- the operations can be performed under the control of a host to which the processor is connected.
- the processing node can be coupled to a nonvolatile memory device for storing a program. A copy of the program is loaded into the processing node at boot time .
- the processing node can include reprogrammable logic blocks of the sort used in FPGA devices, along with reprogrammable I/O pins, for interfacing with other electronic devices.
- the processing node can control an external memory device used for storing bit-field data and for storing control information.
- the mechanism for efficient data access and communication in spatial lattice computations of the present invention offers several advantages, particularly for large 2D and 3D spatial lattice computations.
- the data access and communication mechanism relies on an arrangement of data in memory and a scheduling of memory accesses and communication events to optimize the use of both memory bandwidth and communications bandwidth. For computations (including symbolic and arithmetic) on emulated spatial lattices, all bits that are read from memory are exactly those needed next by the processing and communication hardware.
- the mechanism deals with a hierarchy of memory granularity constraints by matching data organization in memory to the most efficient memory access patterns, without having to buffer data.
- the mechanism takes advantage of memory layout and access order to produce an even demand on communication resources.
- a direct virtual processor emulation of a SIMD array on each processing node would not have this property.
- slow external memory can also be dealt with efficiently by simply treating internal memory as an additional level in the granularity hierarchy.
- the method for dealing with memory granularity and for allowing spatial shifting of lattice data by addressing is also directly applicable to lattice calculations on conventional computers.
- the mechanism further supports a virtual machine model for performing SIMD operations on selected subsets of virtual processors. For example, sublattices of the emulated space can be identified and processed in turn. Virtual processors that are not active in a given computational step are not emulated during that step. Both the spatial structure of the emulated lattice and the structure of the data associated with the lattice sites can change with time.
- the mechanism efficiently supports a variety of simple high-level spatial machine models, including a simple mesh machine, a reconfiguring crystal lattice machine and a pyramid machine.
- Each processing node can have its own copy of various programs.
- a host computer may be used to initialize and modify this program information, and to initiate synchronized execution of programs.
- the node-resident programs can be viewed as a kind of micro-code. If all nodes are programmed identically, then the hardware acts as a SIMD machine. Providing micro-coded control programs resident within each node takes advantage of high on-chip memory bandwidth to allow full generality of operations. There is no need to embed a restricted virtual machine model into each node as was done, for example, in CAM-8. Such freedom also resolves conflicts between virtualization and the use of fast hardware registers.
- Lattice sites may be updated in a "depth first" manner, with a sequence of operations applied to each row-sized site-aggregate before moving on to the next, and with each of the sequence of row-operations bringing together a different combination of bit-fields.
- Registers and temporary memory storage may be used to hold intermediate results during each such sequence, and then freed and reused for processing the next site-aggregate .
- FIG. 1 is a representation of a spatial lattice computer as a mesh array of processing nodes, each processing node corresponding to an equal-sized sector of the spatial lattice.
- FIG. 2 is a block diagram of the processing node shown in FIG. 1.
- FIG. 3 is a depiction of a two-dimensional (2D) example of a uniform lattice data movement (bit-plane shifting) .
- FIG. 4 is a depiction of a partition of a one dimensional (ID) sector into groups of sites that are updated simultaneously.
- FIG. 5 is a depiction of a ID partition that is not shift invariant.
- FIG. 6 is a depiction of a data movement process in which periodic shifts within sectors are composed into a uniform shift along a lattice.
- FIG. 7 is a depiction of a balanced partition of a 2D sector.
- FIG. 8 is a depiction of a partition of a 2D sector of unequal dimensions.
- FIG. 9 is a block diagram of the functional units that make up the DRAM module shown in FIG. 2.
- FIG. 10 shows a sample set of DRAM rows that are processed together during a single row-period.
- FIG. 11 is a block diagram of the processing element shown in FIG. 2, illustrating hardware optimized for SIMD table-lookup processing.
- FIG. 12 is a block diagram of the shared lookup table shown in FIG. 11.
- FIG. 13 is a block diagram of the LUT data unit shown in FIG. 12.
- FIGS. 14A and 14B are illustrations of a single- bit-field row data format and a single-site-group row data format, respectively.
- a parallel computation occurring in a physical space 10 is emulated by a mesh array of processing nodes 12.
- the emulated space 10 includes an n-dimensional spatial arrangement of lattice sites each associated with a data element (or structure) , which may vary from site to site. Both the structure of the lattice and the structure of the data may change with time.
- Each processing node 12 in the mesh array corresponds to an equal-sized sector of the emulated space (or spatial lattice) 10. Together, the processing nodes 12 perform parallel computations, each acting on the lattice data associated with its own sector of the emulated space 10.
- each processing node 12 is connected to and communicates with its neighboring processing nodes in the mesh array by a mesh I/O interface 14.
- the mesh I/O interface 14 provides forty-eight single-bit differential-signal links that can be apportioned among up to six directions.
- each processing node 12 includes a memory 16 connected to a plurality of processing elements (PEs) 18.
- the memory 16 includes a plurality of DRAM modules 20. Data belonging to a given sector may be stored in the memory 16 of the corresponding processing node 12 or in an external memory associated with that processing node, as will be described.
- Each processing node 12 simultaneously processes data associated with a plurality of lattice sites using the processing elements 18.
- All of the processing nodes operate and communicate in a synchronized and predetermined fashion in order to implement a spatially- regular lattice computation.
- the processing node 12 is implemented as a semiconductor chip. However, it could be implemented as a discrete design as well.
- a master memory interface 22 allowing the chip to access an external slave memory via a first memory I/O bus 24
- a slave memory interface 26 allowing the processing node or chip 12 to be accessed as a memory by an external device via a second memory I/O bus 28
- a controller 30 receives control information and data from external devices via memory interfaces 26 and 22 on memory interface I/O lines 32 and 33, respectively, and over a serial I/O bus 34.
- the controller 30 also receives control information at high bandwidth from the memory 16 through the PEs 18 over input control lines 36 and distributes control information to the memory 16 and PEs 18 over a control bus 37.
- Memory 16 is also read and written through the PEs 18 over bidirectional data lines 38 and memory bus lines 39.
- the PEs 18 communicate with each other over a shared LUT bus 40, as will be described. Details of the control bus signal interconnections, as well as other control and clocking signals, have been omitted for simplification.
- the structure of the memory 16 imposes certain constraints on efficient computation. Perhaps the most prominent are certain granularity constraints that determine which groups of memory bits should be used together.
- the memory (DRAM) bits on each processing node 12 are logically organized as 2D arrays of storage elements which are read or written on a row-at-a-time basis. For each block of DRAM, it takes about the same amount of time to read or write all of the bits from a sequence of rows as it does to access just a few bits from each row. For this reason, computations performed by the processing node 12 are organized in such a way as to use all of the data from each row as that row is accessed.
- rows are divided up into smaller words, which correspond to data units that are communicated to and from the memory modules 20.
- Computations are organized to also use all of the bits of each word as that word is read.
- the processing nodes 12 handle memory granularity constraints by organizing the memory and processing hardware so that lattice data that are needed at the same time can always be stored together as a bit-aggregate in the memory (i.e., a word, a row, etc.) that is efficiently accessed as a unit. It will be appreciated that the techniques described herein are quite general and apply to other types of lattice computation architectures having a hierarchy of memory granularity constraints.
- the structure of the data item associated with each lattice site may vary from site to site and may change with time. The structure of the lattice itself may also change with time.
- the processing nodes 12 of the lattice processor array 10 use a spatial communication scheme to move data within the emulated space.
- the spatial communication scheme involves a uniform shifting of subsets of lattice data.
- Each processing node 12 performs a data movement operation separately on each bit-field to shift the bitfield uniformly in space. Every bit in each bit-field that is operated upon is shifted to a new position. Each shifted bit is displaced by the same distance in the same direction.
- FIGS. 3A-B an exemplary 2-D square lattice 41 having two bit-fields 42a, 42b of bits 44 is depicted.
- FIG. 3A shows the bits before a shift is performed and
- FIG. 3B illustrates the same bits after a shift has been performed.
- One top bit-field bit 44a and a bottom bit-field bit 44b are shaded to highlight the effect of the shifting operation on bits in the bit-fields 42a, 42b.
- FIG. 3B only the top bitfield 42a shifts in this example.
- the shift of the top bit-field 42a brings together the two shaded bits 44a and 44b, both of which belonged to different lattice sites before the shift. It will be noted that every similar pair of bits that were initially separated by the same displacement as the two marked bits are also brought together by this shift. If the square lattice 41 in FIGS. 3A-B represents an entire lattice, then the shifted data that moves beyond the edge or boundary of the lattice space wraps around to the opposite side of the space.
- the square lattice 41 represents only the sector of space associated with a single processing node 12
- the shifted data that crosses the edge of that sector is communicated to adjacent processing nodes (i.e., processing nodes associated with adjacent sectors), each of which is performing an identical shift. In this manner, a seamless uniform shift across the entire lattice can be achieved.
- each processing node 12 After each movement of data, the processing nodes 12 separately process the data that land at each lattice site. Since each site is processed independently of every other site, processing of lattice sites can potentially be done in any order and be divided up among many or few processors. In the embodiment described herein, each processing node 12 has sixty-four (64) processing elements 18, and so updates 64 lattice sites at a time. A set of lattice sites that are updated simultaneously within a node is referred to as a site- group. All lattice sites in a site-group are processed identically.
- FIG. 4 a one-dimensional (ID) sector 50 is partitioned into a plurality of site-groups 52a-52d as shown. For clarity, the figure depicts only 4 lattice sites to be updated at a time.
- FIG. 5 illustrates an alternative site-group partitioning 54 of the same 1-D sector 50 (FIG. 4) .
- a site-group has the same number of elements as a memory word. Therefore, the partition of a lattice into site-groups induces a corresponding partition of bit-fields into memory words: all of the bits of a bit-field that belong to the same site-group are stored together in the same memory word. It can be appreciated that, to process a site-group during an operation without data bit-shifting, a desired set of bit-fields may be brought together for processing by simply addressing the appropriate set of memory words.
- bit field shifts are periodic within a single sector. That is, shifted data wraps around within the bit-field sector.
- the partition of the lattice into site-groups remains fixed as the bit-fields are shifted. If the partition of FIG. 4 is used to divide the lattice up into site-groups, and a bit field is shifted by some amount, one observes a very useful property: each set of bit-field bits grouped together into a single site-group before the shift are still grouped into a single site-group after the shift. Since these groups of bit-field bits are stored together in memory words associated with site-groups, shifts of data simply move the contents of one memory word into another.
- a partition of lattice sites such as that of FIG. 4 that is invariant in its grouping of bits under all periodic bit-field shifts within a sector may be described as a "shift invariant" lattice partition.
- a shift-invariant partition can be characterized as a pattern of grouping of lattice sites which isn't changed if the pattern is shifted periodically.
- the partition of FIG. 5 is an example of a lattice partition that is not shift invariant.
- spatial shifting of bit-field data can be accomplished by memory addressing.
- the processing node brings together associated portions of a designated set of shifted bit-fields by simply addressing the corresponding set of memory words. For each bit-field, and for any shift, all of the shifted data needed for the site-group are accessed together in a single memory word, and all of the data in that memory word belong to the same site-group.
- This bit-field shifting technique can be extended to additional levels of memory granularity.
- memory words are grouped into memory rows, and entire rows must be processed together for the sake of efficiency.
- site-groups of lattice sites are further grouped together or aggregated into a set of larger site-aggregates that also form a shift-invariant partition of the lattice.
- site-group aggregation and referring once again to FIG.
- the resulting set of larger site-aggregates also forms a shift-invariant partition of the lattice. Consequently, the same sets of bits from each bit-field are grouped together by the larger site-aggregates both before and after any bit-field shifts. If the larger site-aggregate is the size of a
- the processing node 12 simply addresses the set of rows that contain the shifted data. As it processes each constituent site-group in turn, the processing node addresses only words within this set of rows. This technique can be applied to further levels of memory granularity.
- the sectors 60, 62, 64 are illustrated as having the top bit-field 60a, 62a, 64a, respectively, shifted to the right.
- the portion of the top bit field that spills past the edge of each sector is labeled A, B and C, for the top bit- fields 60a, 62a and 64a, respectively.
- A, B and C The portion of the top bit field that spills past the edge of each sector.
- A, B and C for the top bit- fields 60a, 62a and 64a, respectively.
- the location in which the protruding data belongs as a result of a uniform shift is indicated with an arrow. It will be recognized that the uniform shift is accomplished by simply substituting the bits of periodically shifted data that wrap around past the edge of one sector (the wrapped bits) for the wrapped bits within the next adjacent sector. Therefore, bits replace corresponding bits.
- a uniform shift transfers bits to the same relative positions within a sector as a same-sized periodic shift.
- a uniform shift merely places the wrapped bits in the appropriate sector.
- all processing nodes 12 operate synchronously, each acting on an identically structured sector of lattice data, with each processing the same site-group at the same time.
- Periodically shifted site data for a designated set of bit-fields and for a designated site-group are assembled by addressing the appropriate DRAM words and rotating each word (as needed), in the manner described above. Bits of data that wrap around the edge of a sector are communicated to an adjacent sector, where they replace the corresponding (wrapped) bits in the site-group being processed in that adjacent sector. In this manner, exactly the data that are needed for the set of corresponding site-groups being processed by the various nodes are read from DRAM.
- each lattice site is updated independently and the bit-fields that constitute each updated site-group are written to DRAM.
- the processing node 12 can alternatively (or also) perform shifts of the bit-fields after the updating (processing) operation, using the addressing of where data is written to perform the periodic portion of the shifts.
- a memory organization based on shift-invariant partitioning of lattice sectors is also effective in multi-dimensional applications.
- an exemplary square 2D sector 70 (shown as a 16x16 lattice) is partitioned into sixty-four four-element site-groups 72, of which four -- a first site-group 72a, a second site-group 72b, a ninth site-group 72i and a tenth site- group 72j — are shown.
- the first site-group 72a is spread evenly along a diagonal.
- the other 63 site-groups, including the site-groups 72b, 72i and 72j are periodic translations of the diagonal pattern.
- these site- groups demonstrate the property of a shift-invariant partition.
- any periodically shifted data for a designated set of bit-fields and for a designated site-group can be assembled by reading the appropriate shifted data, and rotating the bits within the words which require bit re-ordering.
- Periodically shifted bit-fields within each 2D sector can be glued together into uniformly shifted bit- fields that slide seamlessly along the composite space.
- wrapped data from one sector replaces corresponding bits in an adjacent sector. This substitution is performed one dimension at a time.
- the processing node takes data that has wrapped around horizontally and moves it to the adjacent sector where it belongs.
- the now horizontally- correct data is then shifted vertically by moving data that has wrapped around vertically into the adjacent sector where it belongs.
- the net effect of this two-hop process may be to communicate data to a diagonally adjacent sector, even though the processing nodes only communicate directly with their nearest neighbors.
- a periodic bit-field shift can always be accomplished by addressing plus a rotation of bits within a word.
- a uniform shift can be achieved through periodic shifts and inter-node communications. Balancing Communication Demands
- each site-group has exactly one lattice site within four positions of the edge of the sector, two within eight positions, and so forth. Consequently, a bit-field can be shifted by four positions by communicating exactly one bit to an adjacent sector for each DRAM word.
- To shift by eight-positions requires a communication of two bits.
- this even spacing of site-groups is an automatic byproduct of shift invariance and guarantees that, for a given shift amount, the demand for inter-node communication resources is divided as evenly as possible between the various words of a bit-field.
- each site-group has exactly one lattice site within four positions from each edge of the sector, two within eight, and so on. Consequently, the communication resources needed to implement a shift of a bit-field are as balanced as possible between the various words of the bit-field. Because not all shift-invariant partitions in 2-D have this additional balanced property, it is desirable to choose partitions which do so that communication resources are used as efficiently as possible. In 3D, the periodically shifted diagonal site- groups discussed above also have this balanced property.
- Sector size is selected to be a power of two along each dimension.
- the horizontal dimension is twice as long as the vertical dimension.
- the elements of each site group are spread out twice as much horizontally as vertically.
- One site-group still runs "diagonally" from corner to corner, and the rest are still periodic shifts of this diagonal pattern.
- communication demands for each dimension will be as balanced as is possible.
- a given shift amount would require about twice the communication resources per word for a given vertical shift along the lattice as for the same horizontal shift, since sites in each site-group are twice as close together vertically. This disparity in communications is, however, unavoidable in this case.
- the sector of the bit-field has a horizontal edge that is twice as long as the vertical edge, and so overall twice as many bits "spill over the edge" for a given vertical shift as for the same horizontal shift.
- each DRAM module 20 includes the circuitry needed to read or write 64-bit words of uniformly shifting bit-field data using the scheme described above.
- the DRAM module 20 includes a DRAM block 80, which is implemented as a DRAM macro of the sort that is currently available as a predefined block of circuitry from manufacturers of chips having integrated DRAM and logic functionality.
- the DRAM block 80 is organized as IK rows, each of which holds 2K bits, with a 128 bit data word. If all of one row is used while another row is activated, a new data word can be accessed every 6ns. To reduce wiring and to better match with logic speeds, a 64-bit data word with a 3ns clock period is used instead.
- This rate conversion is accomplished by a 128:64 multiplexer 82, which connects a 64-bit data path to a selected half of the 128-bit DRAM block data word during each clock cycle.
- the multiplexer 82 provides an added level of memory granularity, as both halves of each 128-bit word must be used for maximum efficiency. This constraint is dealt with in the data organization by adding one more level to the site grouping hierarchy described above. In a similar manner, additional levels in which the word-size is cut in half could be added if additional rate conversion was desired. Beyond the multiplexer 82, and thus for the remainder of the operations performed by the processing node 12 (FIG. 2), the basic memory word size is taken to be 64 bits.
- a mesh I/O unit 86 Connected to the output of the barrel shifter 84 is a mesh I/O unit 86.
- the mesh I/O unit 86 performs substitutions of bits in one processing node for corresponding bits in another processing node to turn periodic bit-field shifts within each node into uniform lattice shifts.
- each processing node has sufficient I/O resources to send and receive up to 8 bits per clock along each of the three axes of a 3D cubic lattice; however, this number could be made larger or smaller. Because of the manner in which bit-field shifts are implemented, any bit that is transferred out of the processing node by the mesh I/O unit 86 in one direction is replaced with a bit that arrives at the mesh I/O unit 86 from the opposite direction.
- the 24-bit mesh I/O bit-stream consists of a 24-bit mesh I/O unit input 88 and a 24-bit mesh I/O unit output 90.
- the bit to be replaced appears at the corresponding output 90. Otherwise, the output 90 has a constant value of zero.
- the 48 mesh-I/O signals 14 (FIG. 2) for the chip thus consist of 24 inputs which are distributed to all mesh I/O units, and 24 outputs which are constructed by taking the logical OR of all corresponding mesh-I/O unit outputs .
- Mesh communication resources are shared among all of the DRAM modules. Each DRAM module deals with only one word at a time, and all of the bits in each word belong to a single bit-field which may be shifted. There is no fixed assignment of I/O resources to particular DRAM modules. How far a given bit-field can be shifted in one step depends on competition for resources among all the modules. In the described embodiment, sufficient communications resources are provided to simultaneously shift 8 bit-fields, each by up to the distance between two elements of a site-group, with each bit-field shifting along all three orthogonal dimensions at the same time. The actual maximum size of these shifts in units of lattice positions depends upon the size of the sector, which is what determines the site-group spacing. With the same communication resources, four bit-fields can be shifted up to twice as far, two bit-fields four times as far, or one bit-field eight times as far.
- Bits that are to be replaced in one node are output through the mesh I/O unit 86 onto the mesh I/O interface 14 (FIGS. 1-2) to be received by a mesh I/O unit in another node, where the received bits are used to replace the corresponding bits that were output from that node, as earlier described.
- Mesh signals are reclocked after each traversal of a mesh link 14, and a single bit can hop along each of the three dimensions in turn as part of a single shift operation, thereby allowing the bitfield shifts to be in arbitrary directions in 3D. If the processing nodes are interconnected as a ID or 2D array, the mesh I/O resources from the unused dimensions are reassigned to active dimensions.
- the output from the DRAM module 20 on a 64-bit data bus 92 is a 64 bit word of correctly shifted bit-field data that is available as input data for the processing elements 18.
- the processing node 12 includes twenty of the DRAM modules 20. This number of modules is a practical number which can reasonably be fit onto a semiconductor chip today. Using twenty DRAM modules, the processing node can potentially process up to 20 bits of shifted bit-field data for each of 64 lattice sites at a time, as illustrated in the memory access example 94 of FIG. 10. Referring to FIG. 10, rows of words accessed simultaneously 95 (in each of twenty DRAM modules 20 of FIG. 2) are shown. The first word 96a accessed in each DRAM module 20 is shown on the left, the second word 96b is shown on the right.
- DRAM row of 32 64-bit memory words all 32 words of each row are processed as a unit, all 32 being either read or written. For simplicity, however, only two words of each row are depicted. The order in which the various words are accessed depends upon the various shifts of the associated bit-fields that are being performed, as was described earlier.
- Some of the twenty DRAM rows 95 that are simultaneously accessed may contain non-bit-field data. For example, one of the rows may contain data which controls which set of DRAM rows will be processed next, and how they will be processed.
- Groups of twenty words are accessed simultaneously. Of these twenty words 95, those that contain bit-field data that are to be used together all are associated with the same set of 64 lattice sites: the same site- group.
- FIG. 10 illustrates that groups of corresponding bits from each simultaneous word (e.g., 97 or 98) are handled by individual PEs
- Each PE processes bit-field data from one lattice site at a time.
- a wide variety of different processing elements, with or without persistent internal state, are compatible with the memory/communications organization used in this design.
- the processing element 18 (FIG. 2) is illustrated as a PE well suited to symbolic processing of lattice data.
- a 20-bit memory-to-PE interface 100 connects each PE 18 to the twenty DRAM modules 20.
- Each PE 18 receives a bit-line from each of the DRAM modules 20 and all of the twenty bit lines in the interface 100 for a particular PE 18 correspond to the same bit position within a DRAM word. Some of the lines are used as inputs and some are used as outputs. The direction of data flow depends upon how the DRAM modules have been configured for the current set of rows that are being processed.
- the PE 18 includes a permuter 102, which attaches each of the 20 bit-lines from the memory 16 to any of 20 functional lines inside the PE.
- the permuter 102 is a reconfigurable switching device which produces a complete one-to-one mapping of bit-lines from two separate sets of lines (i.e., the memory module side and internal PE side) based on configuration information supplied by the controller 30 (FIG. 2) .
- the permuters 102 in each PE 18 are configured identically at any given time.
- each PE 18 9 inputs are dedicated to data coming from a set of bit-fields, 8 outputs are dedicated to data going to a different set of bit-fields, one input is dedicated to program control information that is sent to the controller 30, one input carries lookup table data to be used for later processing, and the remaining wire is involved in I/O operations to and from the memory 16.
- the permuter allows data from any DRAM module to play any role. Bit-field data flows through the processing elements. Input data arrive from one set of DRAM modules and results are deposited into a different set of DRAM modules.
- a given choice of DRAM data directions, mesh I/O communication paths and PE permuter settings lasts at least 32 clocks (the time it takes to access all 32 64-bit words of a given 2 Kbit row) .
- the amount of time required to process one set of DRAM rows is referred to as a row- period.
- the basic data-transforming operation within each PE 18 is performed by a lookup table (LUT) 104 with 8-inputs and 8-outputs. All LUTs 104 in all of the PEs use identical table data. Each LUT 104 performs independent 8-bit lookups into the shared data. Eight input bits 106 from some lattice site are transformed by the LUT 104 into 8 new output bits 108, which are deposited into a different set of bit-fields than the input bits 106 came from. A ninth input bit is used as a conditional bit 110. This ninth bit (together with global control information) determines whether or not the LUT 104 should be bypassed within the PE. When not bypassed, the 8-bit LUT output 108 becomes the 8-bit PE output 112.
- the conditional bit operates as a select for a LUT MUX 114, which receives as inputs the input bits 106 and the 8-bit LUT output 108 and, based on the state of the conditional bit 110, selects one of these inputs as the PE output 112.
- Larger LUTs i.e., LUTs with more inputs
- any calculation on a regular spatial lattice can be performed.
- PEs 18 operate in the same manner. Each works on the same set of bit-fields and sees the data for a particular lattice site. They each transform their respective data in the same manner, using the same LUT 104.
- the LUT 104 has 256 8-bit entries, specified by a total of 2 Kbits of data copied from a single DRAM row. During each row- period, one DRAM module is selected to provide a row of data for use as the LUT 104 during the next row-period. The data arrives as 32 64-bit words, with one bit of each word entering each PE through a next-LUT input 122 during each of 32 clocks.
- each PE stores 32 bits of current LUT data and 32 bits of next-LUT data.
- Each of the 64 PEs broadcasts its current 32 bits of LUT data onto a separate portion of the 2K-bit LUT bus 40, and all of the PEs share the data on the LUT bus 40, each using a multiplexer to perform 8- input/8-output lookups with these 2K shared bits.
- the composition of the LUT 104 is shown.
- the 8 bits of LUT input data 106 control a 256x8 to 8 multiplexer 130, which selects 8 bits of data from the LUT bus 40.
- the LUT 104 further includes a LUT data unit 132, which holds 64 bits of LUT data.
- the LUT data unit 132 is illustrated in more detail in FIG. 13. Referring to FIG. 13, the LUT data unit 132 includes a 32 bit shift register 140 for loading a sequence of 32 next-LUT data bits 122 on consecutive clocks of the row-period, and a 32 bit latch 142 which can latch 32 bits in parallel from shift register 140 and drive them onto 32 distinct wires of the 2 Kbit wide LUT bus 40.
- New data is serially accumulated in the shift register 140 while previous data is being driven onto the LUT bus 40.
- All LUT data can be changed as often as every row-period.
- Both the serial loading of next-LUT data 122 and the parallel loading of current-LUT data 134 are separately controlled during each row-period (with shared control for all PEs) .
- the above-described scheme also provides a large lookup table shared by all PEs that can be quickly filled with a row of bit-field data.
- all lattice sites in the same sector can randomly access the set of lattice site data contained in the LUT 104.
- This provides a non-local communications mechanism.
- a similar operation is also very useful for data reformatting.
- a row of bit-field data to be reformatted is stored into the LUT.
- a set of predefined constant rows of data are then used as LUT inputs in order to permute this data within a row (or even between rows) in any desired manner.
- This kind of operation can be made more efficient if, in addition to an 8-input/8-output LUT, the same 2 Kbits of table data can also alternatively be used as an 11-input/l-output LUT. Since this only uses a total of 12 wires, whereas an 8-input/8-output LUT uses 16, there are 4 unused LUT wires in this case. These can be usefully assigned as output wires, containing additional copies of the single output value.
- the conditional bit 110 can still be used in the 11-input/l- output case. It simply replaces the single output bit of the LUT with one of the inputs.
- Bit-serial processing is also fully compatible with the site-group shifting mechanism, and allows economical register use with time-multiplexed PEs.
- Bit-serial arithmetic hardware receives the bits of the numbers it operates on sequentially. For example, to multiply two unsigned integers, the bits of the multiplicand might first be sent into the serial multiplication unit, one bit at a time. Then the bits of the multiplier would be sent in one at a time, starting with the least significant bit (lsb) . As the multiplier bits enter the multiplication unit, bits of the product leave the multiplication unit.
- the hardware inside the multiplication unit is very simple.
- integer data is stored together in DRAM rows, and serial arithmetic hardware is added to each PE.
- An appropriate data format for serial arithmetic is to have single DRAM rows hold data corresponding to many different bit fields for the same set of lattice sites. For example, one word of a row could contain the lsb of a 32-bit integer present at each of 64 lattice sites (i.e., the lsb bit-field for a site- group) . Other words within the row would contain each of the other bit-fields for the same site-group of integers.
- FIG. 14B An exemplary data format for serial arithmetic is illustrated in FIG. 14B. Referring to FIG. 14A, in a single-bit-field
- all words 161 in a given DRAM row contain data belonging to the same bit-field.
- Each word 161 contains data from a different site-group. Taken together, these words form a larger site-aggregate.
- a single site-group (per row) data format (or, numerical row format) 162 all words contain data from the same site-group, with each word belonging to a different bit-field.
- each PE sees the consecutive bits of an integer -- for example, PEO sees consecutive bits of one integer 168 and PE63 sees consecutive bits of another integer 170 -- in successive clocks, which is exactly the kind of format needed by serial arithmetic algorithms. Reading these words in other orders yields other useful serial bit orderings.
- a number of DRAM rows belonging to the same site-group of lattice sites may also be processed before moving on to the next site-group. In this way, data can remain in PE registers during sequences of operations.
- the single site-group per row data format 162 puts site-groups of 32-bit integers together into single DRAM rows. By addressing the appropriate set of rows, shifted integer data can be brought together for a given site-group. Since each DRAM word is the portion of a bit-field belonging to this site-group, the rotation and inter-chip bit substitution hardware of FIG. 9 is perfectly suited to complete the shift of integer bitfield data seamlessly, exactly as described earlier. Data can also be quickly converted back and forth between single site-group per row format 162 and the single-bitfield per row format 160 (FIG. 14A) as necessary, using the LUT-based PE of FIG. 11.
- the controller 30 is able to change the order of the LUT inputs at each clock (e.g., the permuter is a Benes network, and the controller changes the bits controlling the final butterfly involving the LUT inputs) , then this format conversion only requires each bit of each number to pass through the PEs twice. If about 100 bits of storage is available within each bit- serial arithmetic processor, this conversion can be done in a single pass. Moreover, since numbers will mostly be handled arithmetically, such conversion shouldn't need to be done frequently.
- Single-input and single-output bit-serial arithmetic hardware can be integrated with the LUT based PE of FIG. 11. For example, eight copies of such serial hardware (with a total of eight inputs and eight outputs) could be configured by the controller 30 to replace the multiplexer 130 of FIG. 12, taking inputs 106 and transforming them into outputs 108. All serial units in all PEs would share a common configuration/control stream. Next-LUT data 122, I/O data 38, and control data 36 would all pass through the PE as usual. The shared LUT data on the LUT bus 40 would be available for use by the arithmetic hardware. This shared LUT could contain, for example, function tables used by CORDIC algorithms. Note that this bit-serial arithmetic processing approach would also work efficiently in a chip architecture with very few DRAM modules coupled to each set of PEs.
- Providing the processing nodes 12 with access to external memory makes it possible to perform large computations using small arrays of nodes. Even on large arrays of nodes, the usefulness of calculations (particularly in 3 or more dimensions) may depend crucially on how large a lattice can be emulated. External memory is also useful for holding extra lookup table data and other control information, and for accumulating analysis information.
- the master memory interface 22 serves as a high-speed interface to a (potentially large) external memory associated with each node. Communication between external memory and the DRAM modules 20 passes through the PE I/O port 38 (FIG. 11) . External memory can be regarded as an additional level in the memory granularity hierarchy discussed earlier. In order to emulate a very large lattice, each processing node can keep most of its sector data in external memory. This sector is partitioned in a shift-invariant manner into external site-aggregates, each consisting of the number of lattice sites that will be accessed together in the external memory.
- the update operation that is to be applied to the entire lattice can be performed on each external site-aggregate separately. Periodically shifted data for a particular external site- aggregate can be read into on-chip memory, processed, and then written back to external memory. If the update operation involves lattice sites with many bit-fields, some of which must be accessed multiple times in the course of the update, then completely processing one external site aggregate before moving on to the next may save a significant amount of time (since keeping the data on-chip greatly speeds up the repeated accesses) .
- the single site-group per row data format 162 discussed earlier makes it possible to have each DRAM row filled with data from just 64 lattice sites. This can make it convenient to perform numerical computations in which very large data objects are kept at each lattice site, and only a very small part of the lattice is on-chip at any given time.
- Shifting hardware and control can be simplified if some mild constraints are placed on the way that sites can be aggregated.
- a hierarchy of shift-invariant partitions is used to aggregate lattice sites that are processed together, and bit-field data are structured as a corresponding hierarchy of bit-aggregates in the memory. Shifting is performed hierarchicaly . Shifted bit-field data for a largest site-aggregate is accessed by addressing a largest bit-aggregate associated with a correspondingly shifted largest site-aggregate, and then performing any remaining shift on the addressed data . This remaining shift only involves data within the largest bit-aggregate, and is performed by a recursive application of the same technique of splitting the shift into a part that can be performed by addressing, and a part that is restricted to smaller site-aggregates .
- Shifting can be simplified if the site-aggregates that are grouped together to form a larger aggregate are all related by a translation along a single direction.
- the first 16 site-groups of the partition illustrated in FIG. 7 are all horizontal shifts of each other, and so could form such a single-direction site-aggregate.
- the vertical shifts of such a site-aggregate would form other single-direction site-aggregates, which together would constitute a shift-invariant partition of the lattice.
- the site-groups that form each aggregate are naturally ordered sequentially along the aggregation direction. If the corresponding words of a bit-field are similarly ordered, then periodic shifts along this direction only involve a rotation of this ordering.
- a particularly simple example of single-direction aggregation is the grouping of individual sites into the striped site-groups shown in FIG. 7. As already discussed, in performing bit-field shifts only a rotator is needed to reorder the bits within words.
- This data includes the next row address to be used by each of the DRAM modules 20, information about the order in which words within rows should be accessed for each DRAM module, information about word rotations for controlling the barrel shifter 84 (FIG. 9) and the mesh I/O unit 86, the common setting to be used for all permuters 102 (FIG. 11) and other PE configuration data, information about which DRAM module will be connected via I/O 38 to external memory through RDRAM master 24 (FIG. 2) or RDRAM slave 28, etc.
- the 2Kbits of control data can be viewed as a single microcode instruction. Provisions are made for encoding a sequence of similar operations on a group of consecutive rows within each DRAM module 20 as a single instruction in order to reduce the memory used for instruction storage.
- control and initialization data also pass through external
- I/O interfaces 28 and 34 (FIG. 2) . These I/O channels are used for initializing memory contents and for real-time control and feedback.
- Instruction data are stored within the memory modules 20 of each node, and function as a set of microprograms. Execution of the current microprogram and scheduling of the next are overlapped: data are broadcast to all processing nodes about which
- Low- bandwidth data-I/O (including initial program loading into all nodes) can also use the serial-I/O interface 34.
- serial-I/O interface 34 For higher bandwidth external-I/O, data is accessed through the slave interface 26 of the distinguished node, and the DRAM on this node is memory mapped. Any data anywhere in the array of nodes can be shifted (under microprogram control) through the mesh I/O interface 14, so that it becomes accessible within the distinguished node. Data that is common to all nodes (or any subset of nodes) can be written once, and then rapidly distributed under microprogram control. This kind of data broadcast is important for distributing program data to all nodes.
- Conditional operations can be performed which depend upon lattice data. Each conditional operation involves using serial-I/O interface 34 to communicate a request to all other nodes, which may subsequently at a suitable decision point simultaneously initiate a new microprogram -- without the intervention of an external microprocessor. Some control latency can be hidden by speculative execution. The next microprogram is started at the decision point assuming no new program will be scheduled. This program is written in a manner that avoids overwriting current lattice state information as new state information is generated, at least until enough time has passed that it is known that no new program will be scheduled. Such execution can be interrupted if necessary, and a new microprogram started that ignores the partially completed new lattice state data.
- a nonvolatile memory such as a serial ROM can be connected to serial I/O line 34 to provide initialization data, making it possible to avoid the use of a microprocessor altogether.
- Controller status information and DRAM I/O data 38 may be placed on the serial-I/O interface 34 under program control. This data can be decoded by external logic to produce signals that interface with external circuitry (eg., interrupt signals for a microprocessor). It might be convenient to have a simple conventional processor on-chip managing the serial-I/O interface 34, thereby making its protocols flexible and extendible.
- Such an on-chip processor could also be useful in system initialization.
- the foregoing efficiently supports a wide variety of virtual machine models.
- the simplest of these is a fixed-lattice machine having uniform bit-field data movement.
- Another supported model is the multi-resolution machine: a fixed lattice machine in which some bit-fields are represented with less resolution than others.
- This kind of model can be implemented by reusing the same bitfield data at several different nearby-shifted positions, rather than keeping separate data for all lattice sites. If the lower resolution data is not changed during site updating, then the processing remains equivalent to a simultaneous updating of all sites. If the lower resolution bits are changed, then their values at the end of each update of the lattice will depend upon the order in which they are paired with other lattice bits.
- a related model is the multi-grid machine, in which power-of-two subsets of lattice data interact more often and at greater range than other lattice data. For example, an entire 2D lattice might first be updated using a nearest neighbor interaction, then only sites with coordinates that are both even would be updated, using a second neighbor interaction along each axis, then only sites with coordinates that are multiples of four using a fourth neighbor interaction, etc. This kind of technique is sometimes used to speed up numerical lattice simulations.
- each power-of-two subset is an element of a shift-invariant partition of the lattice, and can be constructed out of the kind of shift-invariant striped partitions that have been used above.
- the controller 30 (FIG. 2) also suppresses the mesh-I/O unit substitution of data that won't be updated, permitting all of the communication resources to be reserved for bits that will actually participate in the update. Note that, when eight or fewer bits in a site-group are shifted, these can be moved arbitrarily far through the lattice before being substituted for bits in other nodes (the number of clocks used by the mesh communication pipeline is extended as necessary) .
- a particular kind of multi-resolution model is a pyramid machine model.
- a 2D example of such a model might begin with a lattice filled with numerical data, with the goal being to calculate the sum of all of the numbers. This could be accomplished by partitioning the lattice into 2x2 blocks and calculating the sum for each block separately. These sums could then in turn be partitioned into 2x2 blocks, and the sum for each of these blocks calculated, and so on.
- data at two different resolutions interact, and the spatial distance between the lower-resolution sum-sites (which can be pictured as lying at the center of each 2x2 block of higher-resolution sum-sites) doubles at every step.
- the final steps are performed by masking updating of some sites using the conditional bit 110, and taking advantage of fast shifts of sparse data. This kind of calculation is useful for accumulating statistical information about a spatial system, finding extreme values of field data, and for other kinds of analysis.
- a crystal lattice model is a machine model in which the spatial arrangement of lattice data is not uniform, but has a regular crystalline structure. Regularly spaced subsets of the crystal lattice sites are called sublattices, and bit-fields are associated with sublattices. For example, a 2D checkerboard has two sublattices, which could be called the red sublattice and the black sublattice. Some bit-fields might be associated with both sublattices, and some only with the red sublattice. The black sublattice would then have no bit-fields associated with corresponding data. Some of the site-updating might involve the data associated with both sublattices, and might apply to all sites.
- a spatially scalable mesh architecture of the sort described here is also scalable as technology improves, allowing more devices to be put onto each chip.
- the most direct scaling involves simply putting several of the nodes described above onto each chip, arranged in a regular array. Only one copy of the direct RDRAM slave interface
- the number of PEs may be adjusted to match advancing logic speed and changing DRAM parameters by altering the time-multiplexing factor for the PEs (i.e., the effective word size, as determined by the multiplexer 82 of FIG. 9) .
- Some computations would be more efficient if it were possible to use a smaller site-group. In particular, this would allow the use of smaller 3D sectors to efficiently emulate small 3D lattices.
- the effect of having smaller site-groups can be achieved by splitting the site-groups up into a set of smaller site-aggregates that together form a shift-invariant partition of the lattice.
- Each site-group then consists of several smaller site-aggregates, all of which are processed in parallel.
- the same amount of information is needed to control the permuting of the bits within a word in all cases.
- As a single 64-bit word for example, 6 bits are needed to specify the rotation amount, and by choosing the aggregation of words into rows appropriately, the same rotation can be used for all words of the same bit-field during a given row-period.
- the 64-bit word is divided up into four 16-bit aggregates, then 4 bits are needed for a given bit-field to specify a single fixed rotation of all 16-bit aggregates during a row-period.
- Two additional bits are needed to specify a rotation of the order of the four aggregates that comprise each 64-bit word, again totaling 6 bits of control information.
- the amount of hardware used in a 64-bit barrel rotator 84 is also sufficient for the more general permutation.
- the rotator is implemented as a butterfly network, then it is only necessary to change the control of the network to allow the 64-bit word to be split into smaller bit-aggregates that can be individually rotated, and also rotated as blocks within the word.
- This additional flexibility in the control of the butterfly network also removes some constraints on the control of the mesh-I/O unit 86 (FIG. 9) , which may make it slightly more complicated.
- interconnect Another possible enhancement concerns interconnect.
- the discussion has been limited to arrays of nodes in ID, 2D and 3D, since physical arrays of nodes in more dimensions are not uniformly scalable.
- the same hardware described here with only the provision of additional communication resources, and a corresponding change to the mesh-I/O unit, can be used with any number of dimensions of interconnect.
- the physical interconnect does not limit the maximum number of dimensions of the lattice that can be emulated by a given array of nodes of the preferred embodiment, since each node can emulate a sector of a lattice with any desired number of dimensions, limited only by available memory.
- the embodiment described here is aimed at simultaneously achieving very high memory bandwidth, a single large on-chip address-space, and efficient sharing of inter-chip communications resources. Similar architectures based on data movement using shift-invariant partitioning can be adapted to other constraints.
- a particularly interesting example is a lattice-computer node design that is constrained to be essentially just an ordinary DRAM chip. In this case a single large block of on-chip DRAM might be coupled to the PEs, with whole rows still accessed one word at a time. By providing storage for several rows of DRAM data along with the PEs, new PEs that were very similar to those outlined above could be constructed (but with only one serial arithmetic unit per PE) .
- Mesh-communications resources i.e., pins
- the master RDRAM interface 24 in FIG. 2 is omitted, the result would be a memory chip with only a handful of extra pins.
- correctly shifted bitfield data for the PE inputs would be accumulated one row at a time, then the LUT would be used to produce the output rows, one at a time, which would be stored back into DRAM.
- the single memory-module version of the data movement architecture discussed above uses more buffering and less parallelism than the 20 memory-module version. Intermediate architectures with a few coupled memory modules would also be interesting. These would also share the advantage of having little memory bandwidth dedicated to specific functions, such as control, and would have more parallism.
- multi-module embodiments including the 20 memory module embodiment detailed above, it may be useful to allow memory lines that aren't used by a PE to be connected to each other. Since all of the bit-field shifting is done by the memory modules 20, this would allow bit-field data to be shifted and copied independently of the other operations of the PEs.
- an FPGA with a direct RDRAM interface would provide a convenient way to connect a processing node to external circuitry -- for example, for image processing.
- An alternative would be to put some FPGA logic onto the same chip with the processing node, adding some reconfigurable I/O pins, and perhaps making the existing mesh-I/O pins reconfigurable .
- Such a hybrid lattice/FPGA chip would be particularly convenient for embedded applications, which would involve electronic interfacing and some buffering of data for synchronous lattice processing.
- the FPGA array would connect to the rest of the chip through the controller 30 of FIG. 2. It would be capable of overriding parts of the controller's state machine, in order to directly control the RDRAM interfaces and other on-chip resources. It could use the DRAM modules 20 simply as high-bandwidth on-chip memory, if desired.
- the design of the PE is quite independent of the mechanism described here for efficiently assembling groups of shifted lattice site bits.
- the same shift mechanism can be used with many fewer or many more bit-fields coming together at each PE .
- the basic elements stored and shifted and applied to each PE can also be larger than single bits.
- SIMD PEs which provides many alternatives for how to independently and identically process many parallel streams of data.
- the preferred embodiment described here couples one particular style of SIMD processing with a rather general data-field shift mechanism in a spatial lattice computer.
Abstract
Description
Claims
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU77595/00A AU7759500A (en) | 1999-08-12 | 2000-08-14 | A mechanism for efficient data access and communication in parallel computations on an emulated spatial lattice |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/373,394 | 1999-08-12 | ||
US09/373,394 US6205533B1 (en) | 1999-08-12 | 1999-08-12 | Mechanism for efficient data access and communication in parallel computations on an emulated spatial lattice |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2001013256A1 true WO2001013256A1 (en) | 2001-02-22 |
WO2001013256A9 WO2001013256A9 (en) | 2002-08-01 |
Family
ID=23472228
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2000/040633 WO2001013256A1 (en) | 1999-08-12 | 2000-08-14 | A mechanism for efficient data access and communication in parallel computations on an emulated spatial lattice |
Country Status (3)
Country | Link |
---|---|
US (1) | US6205533B1 (en) |
AU (1) | AU7759500A (en) |
WO (1) | WO2001013256A1 (en) |
Families Citing this family (83)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3391262B2 (en) * | 1998-05-11 | 2003-03-31 | 日本電気株式会社 | Symbol calculation system and method, and parallel circuit simulation system |
US7412462B2 (en) * | 2000-02-18 | 2008-08-12 | Burnside Acquisition, Llc | Data repository and method for promoting network storage of data |
US7233167B1 (en) * | 2000-03-06 | 2007-06-19 | Actel Corporation | Block symmetrization in a field programmable gate array |
US6268743B1 (en) | 2000-03-06 | 2001-07-31 | Acatel Corporation | Block symmetrization in a field programmable gate array |
US6285212B1 (en) | 2000-03-06 | 2001-09-04 | Actel Corporation | Block connector splitting in logic block of a field programmable gate array |
US6861869B1 (en) | 2000-03-06 | 2005-03-01 | Actel Corporation | Block symmetrization in a field programmable gate array |
US6636930B1 (en) * | 2000-03-06 | 2003-10-21 | Actel Corporation | Turn architecture for routing resources in a field programmable gate array |
US6915277B1 (en) * | 2000-05-10 | 2005-07-05 | General Electric Capital Corporation | Method for dual credit card system |
US6725230B2 (en) * | 2000-07-18 | 2004-04-20 | Aegis Analytical Corporation | System, method and computer program for assembling process data of multi-database origins using a hierarchical display |
US6754802B1 (en) | 2000-08-25 | 2004-06-22 | Micron Technology, Inc. | Single instruction multiple data massively parallel processor systems on a chip and system using same |
WO2002021323A2 (en) * | 2000-09-08 | 2002-03-14 | Avaz Networks | Hardware function generator support in a dsp |
EP1376380A1 (en) * | 2002-06-14 | 2004-01-02 | EADS Deutschland GmbH | Procedure for computing the Choleski decomposition in a parallel multiprocessor system |
US7577553B2 (en) * | 2002-07-10 | 2009-08-18 | Numerate, Inc. | Method and apparatus for molecular mechanics analysis of molecular systems |
US7478096B2 (en) * | 2003-02-26 | 2009-01-13 | Burnside Acquisition, Llc | History preservation in a computer storage system |
US7145049B2 (en) * | 2003-07-25 | 2006-12-05 | Catalytic Distillation Technologies | Oligomerization process |
US6990010B1 (en) | 2003-08-06 | 2006-01-24 | Actel Corporation | Deglitching circuits for a radiation-hardened static random access memory based programmable architecture |
US7394472B2 (en) * | 2004-10-08 | 2008-07-01 | Battelle Memorial Institute | Combinatorial evaluation of systems including decomposition of a system representation into fundamental cycles |
US7607129B2 (en) * | 2005-04-07 | 2009-10-20 | International Business Machines Corporation | Method and apparatus for using virtual machine technology for managing parallel communicating applications |
US20070067488A1 (en) * | 2005-09-16 | 2007-03-22 | Ebay Inc. | System and method for transferring data |
US20070180310A1 (en) * | 2006-02-02 | 2007-08-02 | Texas Instruments, Inc. | Multi-core architecture with hardware messaging |
US20070226455A1 (en) * | 2006-03-13 | 2007-09-27 | Cooke Laurence H | Variable clocked heterogeneous serial array processor |
US8656143B2 (en) | 2006-03-13 | 2014-02-18 | Laurence H. Cooke | Variable clocked heterogeneous serial array processor |
WO2007131190A2 (en) | 2006-05-05 | 2007-11-15 | Hybir Inc. | Group based complete and incremental computer file backup system, process and apparatus |
GB2446199A (en) | 2006-12-01 | 2008-08-06 | David Irvine | Secure, decentralised and anonymous peer-to-peer network |
JP4497184B2 (en) * | 2007-09-13 | 2010-07-07 | ソニー株式会社 | Integrated device, layout method thereof, and program |
WO2009146267A1 (en) * | 2008-05-27 | 2009-12-03 | Stillwater Supercomputing, Inc. | Execution engine |
US9501448B2 (en) | 2008-05-27 | 2016-11-22 | Stillwater Supercomputing, Inc. | Execution engine for executing single assignment programs with affine dependencies |
US8090896B2 (en) * | 2008-07-03 | 2012-01-03 | Nokia Corporation | Address generation for multiple access of memory |
US8755515B1 (en) | 2008-09-29 | 2014-06-17 | Wai Wu | Parallel signal processing system and method |
JP5438551B2 (en) * | 2009-04-23 | 2014-03-12 | 新日鉄住金ソリューションズ株式会社 | Information processing apparatus, information processing method, and program |
US8037404B2 (en) | 2009-05-03 | 2011-10-11 | International Business Machines Corporation | Construction and analysis of markup language document representing computing architecture having computing elements |
JP5817341B2 (en) * | 2011-08-29 | 2015-11-18 | 富士通株式会社 | Information processing method, program, and apparatus |
WO2013100783A1 (en) | 2011-12-29 | 2013-07-04 | Intel Corporation | Method and system for control signalling in a data path module |
CA2836137C (en) | 2012-12-05 | 2020-12-01 | Braeburn Systems Llc | Climate control panel with non-planar display |
AU2014318570A1 (en) * | 2013-09-13 | 2016-05-05 | Smg Holdings-Anova Technologies, Llc | Self-healing data transmission system to achieve lower latency |
US10331583B2 (en) | 2013-09-26 | 2019-06-25 | Intel Corporation | Executing distributed memory operations using processing elements connected by distributed channels |
MX357098B (en) | 2014-06-16 | 2018-06-26 | Braeburn Systems Llc | Graphical highlight for programming a control. |
US10356573B2 (en) | 2014-10-22 | 2019-07-16 | Braeburn Systems Llc | Thermostat synchronization via remote input device |
US10055323B2 (en) | 2014-10-30 | 2018-08-21 | Braeburn Systems Llc | System and method for monitoring building environmental data |
US10430056B2 (en) | 2014-10-30 | 2019-10-01 | Braeburn Systems Llc | Quick edit system for programming a thermostat |
US9703721B2 (en) * | 2014-12-29 | 2017-07-11 | International Business Machines Corporation | Processing page fault exceptions in supervisory software when accessing strings and similar data structures using normal load instructions |
US9569127B2 (en) * | 2014-12-29 | 2017-02-14 | International Business Machines Corporation | Computer instructions for limiting access violation reporting when accessing strings and similar data structures |
CA2920281C (en) | 2015-02-10 | 2021-08-03 | Daniel S. Poplawski | Thermostat configuration duplication system |
US10317867B2 (en) | 2016-02-26 | 2019-06-11 | Braeburn Systems Llc | Thermostat update and copy methods and systems |
US10317919B2 (en) | 2016-06-15 | 2019-06-11 | Braeburn Systems Llc | Tamper resistant thermostat having hidden limit adjustment capabilities |
MX2017011987A (en) | 2016-09-19 | 2018-09-26 | Braeburn Systems Llc | Control management system having perpetual calendar with exceptions. |
US10402168B2 (en) | 2016-10-01 | 2019-09-03 | Intel Corporation | Low energy consumption mantissa multiplication for floating point multiply-add operations |
US10795853B2 (en) | 2016-10-10 | 2020-10-06 | Intel Corporation | Multiple dies hardware processors and methods |
US10416999B2 (en) | 2016-12-30 | 2019-09-17 | Intel Corporation | Processors, methods, and systems with a configurable spatial accelerator |
US10572376B2 (en) | 2016-12-30 | 2020-02-25 | Intel Corporation | Memory ordering in acceleration hardware |
US10474375B2 (en) | 2016-12-30 | 2019-11-12 | Intel Corporation | Runtime address disambiguation in acceleration hardware |
US10558575B2 (en) | 2016-12-30 | 2020-02-11 | Intel Corporation | Processors, methods, and systems with a configurable spatial accelerator |
US10515046B2 (en) * | 2017-07-01 | 2019-12-24 | Intel Corporation | Processors, methods, and systems with a configurable spatial accelerator |
US10445234B2 (en) | 2017-07-01 | 2019-10-15 | Intel Corporation | Processors, methods, and systems for a configurable spatial accelerator with transactional and replay features |
US10515049B1 (en) | 2017-07-01 | 2019-12-24 | Intel Corporation | Memory circuits and methods for distributed memory hazard detection and error recovery |
US10387319B2 (en) | 2017-07-01 | 2019-08-20 | Intel Corporation | Processors, methods, and systems for a configurable spatial accelerator with memory system performance, power reduction, and atomics support features |
US10445451B2 (en) | 2017-07-01 | 2019-10-15 | Intel Corporation | Processors, methods, and systems for a configurable spatial accelerator with performance, correctness, and power reduction features |
US10469397B2 (en) | 2017-07-01 | 2019-11-05 | Intel Corporation | Processors and methods with configurable network-based dataflow operator circuits |
US10467183B2 (en) | 2017-07-01 | 2019-11-05 | Intel Corporation | Processors and methods for pipelined runtime services in a spatial array |
US11086816B2 (en) | 2017-09-28 | 2021-08-10 | Intel Corporation | Processors, methods, and systems for debugging a configurable spatial accelerator |
US10496574B2 (en) | 2017-09-28 | 2019-12-03 | Intel Corporation | Processors, methods, and systems for a memory fence in a configurable spatial accelerator |
US10380063B2 (en) | 2017-09-30 | 2019-08-13 | Intel Corporation | Processors, methods, and systems with a configurable spatial accelerator having a sequencer dataflow operator |
US10445098B2 (en) | 2017-09-30 | 2019-10-15 | Intel Corporation | Processors and methods for privileged configuration in a spatial array |
US10445250B2 (en) | 2017-12-30 | 2019-10-15 | Intel Corporation | Apparatus, methods, and systems with a configurable spatial accelerator |
US10417175B2 (en) | 2017-12-30 | 2019-09-17 | Intel Corporation | Apparatus, methods, and systems for memory consistency in a configurable spatial accelerator |
US10565134B2 (en) | 2017-12-30 | 2020-02-18 | Intel Corporation | Apparatus, methods, and systems for multicast in a configurable spatial accelerator |
US10564980B2 (en) | 2018-04-03 | 2020-02-18 | Intel Corporation | Apparatus, methods, and systems for conditional queues in a configurable spatial accelerator |
US11307873B2 (en) | 2018-04-03 | 2022-04-19 | Intel Corporation | Apparatus, methods, and systems for unstructured data flow in a configurable spatial accelerator with predicate propagation and merging |
US10921008B1 (en) | 2018-06-11 | 2021-02-16 | Braeburn Systems Llc | Indoor comfort control system and method with multi-party access |
US11200186B2 (en) | 2018-06-30 | 2021-12-14 | Intel Corporation | Apparatuses, methods, and systems for operations in a configurable spatial accelerator |
US10853073B2 (en) | 2018-06-30 | 2020-12-01 | Intel Corporation | Apparatuses, methods, and systems for conditional operations in a configurable spatial accelerator |
US10459866B1 (en) | 2018-06-30 | 2019-10-29 | Intel Corporation | Apparatuses, methods, and systems for integrated control and data processing in a configurable spatial accelerator |
US10891240B2 (en) | 2018-06-30 | 2021-01-12 | Intel Corporation | Apparatus, methods, and systems for low latency communication in a configurable spatial accelerator |
US10678724B1 (en) | 2018-12-29 | 2020-06-09 | Intel Corporation | Apparatuses, methods, and systems for in-network storage in a configurable spatial accelerator |
US10915471B2 (en) | 2019-03-30 | 2021-02-09 | Intel Corporation | Apparatuses, methods, and systems for memory interface circuit allocation in a configurable spatial accelerator |
US10817291B2 (en) | 2019-03-30 | 2020-10-27 | Intel Corporation | Apparatuses, methods, and systems for swizzle operations in a configurable spatial accelerator |
US11029927B2 (en) | 2019-03-30 | 2021-06-08 | Intel Corporation | Methods and apparatus to detect and annotate backedges in a dataflow graph |
US10965536B2 (en) | 2019-03-30 | 2021-03-30 | Intel Corporation | Methods and apparatus to insert buffers in a dataflow graph |
US10802513B1 (en) | 2019-05-09 | 2020-10-13 | Braeburn Systems Llc | Comfort control system with hierarchical switching mechanisms |
US11037050B2 (en) | 2019-06-29 | 2021-06-15 | Intel Corporation | Apparatuses, methods, and systems for memory interface circuit arbitration in a configurable spatial accelerator |
US11907713B2 (en) | 2019-12-28 | 2024-02-20 | Intel Corporation | Apparatuses, methods, and systems for fused operations using sign modification in a processing element of a configurable spatial accelerator |
CN111371622A (en) * | 2020-03-13 | 2020-07-03 | 黄东 | Multi-network isolation, selection and switching device and network resource allocation method |
US11925260B1 (en) | 2021-10-19 | 2024-03-12 | Braeburn Systems Llc | Thermostat housing assembly and methods |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5159690A (en) * | 1988-09-30 | 1992-10-27 | Massachusetts Institute Of Technology | Multidimensional cellular data array processing system which separately permutes stored data elements and applies transformation rules to permuted elements |
US5691885A (en) * | 1992-03-17 | 1997-11-25 | Massachusetts Institute Of Technology | Three-dimensional interconnect having modules with vertical top and bottom connectors |
US5848260A (en) * | 1993-12-10 | 1998-12-08 | Exa Corporation | Computer system for simulating physical processes |
-
1999
- 1999-08-12 US US09/373,394 patent/US6205533B1/en not_active Expired - Lifetime
-
2000
- 2000-08-14 AU AU77595/00A patent/AU7759500A/en not_active Abandoned
- 2000-08-14 WO PCT/US2000/040633 patent/WO2001013256A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5159690A (en) * | 1988-09-30 | 1992-10-27 | Massachusetts Institute Of Technology | Multidimensional cellular data array processing system which separately permutes stored data elements and applies transformation rules to permuted elements |
US5691885A (en) * | 1992-03-17 | 1997-11-25 | Massachusetts Institute Of Technology | Three-dimensional interconnect having modules with vertical top and bottom connectors |
US5848260A (en) * | 1993-12-10 | 1998-12-08 | Exa Corporation | Computer system for simulating physical processes |
Also Published As
Publication number | Publication date |
---|---|
AU7759500A (en) | 2001-03-13 |
US6205533B1 (en) | 2001-03-20 |
WO2001013256A9 (en) | 2002-08-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6205533B1 (en) | Mechanism for efficient data access and communication in parallel computations on an emulated spatial lattice | |
KR100319768B1 (en) | Multi-Dimensional Address Generation in Imaging and Graphics Processing Systems | |
Veen | Dataflow machine architecture | |
US6470441B1 (en) | Methods and apparatus for manifold array processing | |
JPH04267466A (en) | Parallel processing system and data comparing method | |
WO2017120517A1 (en) | Hardware accelerated machine learning | |
US11714780B2 (en) | Compiler flow logic for reconfigurable architectures | |
JP2021507383A (en) | Integrated memory structure for neural network processors | |
US4939642A (en) | Virtual bit map processor | |
US20220179823A1 (en) | Reconfigurable reduced instruction set computer processor architecture with fractured cores | |
US11782729B2 (en) | Runtime patching of configuration files | |
Hockney | MIMD computing in the USA—1984 | |
US5991866A (en) | Method and system for generating a program to facilitate rearrangement of address bits among addresses in a massively parallel processor system | |
Wang et al. | FP-AMR: A Reconfigurable Fabric Framework for Adaptive Mesh Refinement Applications | |
Loshin | High Performance Computing Demystified | |
Nodine et al. | Paradigms for optimal sorting with multiple disks | |
US7516059B2 (en) | Logical simulation device | |
TW202227979A (en) | Compile time logic for detecting streaming compatible and broadcast compatible data access patterns | |
WO2022047403A1 (en) | Memory processing unit architectures and configurations | |
Johnsson | Massively parallel computing: Data distribution and communication | |
Williams | Voxel databases: A paradigm for parallelism with spatial structure | |
Pech et al. | A dedicated computer for Ising-like spin glass models | |
CN116774968A (en) | Efficient matrix multiplication and addition with a set of thread bundles | |
Schomberg | A transputer-based shuffle-shift machine for image processing and reconstruction | |
JP3726977B2 (en) | Two-dimensional PE array device, data transfer method, and morphological operation processing method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
AK | Designated states |
Kind code of ref document: C2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: C2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
COP | Corrected version of pamphlet |
Free format text: PAGES 1/14-14/14, DRAWINGS, REPLACED BY NEW PAGES 1/14-14/14; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE |
|
122 | Ep: pct application non-entry in european phase | ||
NENP | Non-entry into the national phase |
Ref country code: JP |