Monday, April 13, 2009

Super Computer/Parallel computing


Super Computer

Abstract

Parallel computers have several different architectures, therefore, the focus of my research has been on which architectures are used today and why developers choose to follow certain avenues, in regards to speed up, and not examine others. This research will be a valuable part of the teams work as it will to help prevent ?the wheel from being reinvented?, so to speak. Some of the systems that will be researched include the Blue Gene and the Blue Gene Family, Cray X1 and the Earth Simulator.
Many other supercomputer systems may be considered in future research. The objectives of this project are to provide relevant information to the team so that they may continue to work on their simulator. With the help of this research, the team hopes to create and propose something better than what already exists today.

Introduction

Computer technology today has come along way from the past. We are able to do many different things, and have accomplished building more powerful machines that might not have been imagined 100 years ago. Furthermore we do and will continue to, build bigger and better computers. Technology has become a vast influence in our every day lives, some may even argue that it impossible to live without them.
The potential for supercomputers to run complex and time consuming applications is rapidly expanding, however the market for supercomputer is very limited. The US Department of Defense and very large companies are their main customers, because of the building costs.
Some of the supercomputers being built today are aimed at making our life easier, more satisfying and more predictable. In terms of speed, the latest and greatest is not a Pentium 4, which a few people may think. It is however a supercomputer with the name, Earth Simulator. The Earth Simulator according to the top500.org, has just been rated the number one supercomputer in the world. The Earth Simulator is a Japanese made machine, which is being used to predict earth phenomenons.
Amongst the supercomputer community, the big topic up for discussion is, how are the US going to keep up with technology? It has been suggested that if the US wants to remain competitive, then they are going to have to learn from the Japanese. The US companies may be inspired to obtain more information and work even harder to produce better systems, than the current Earth Simulator. The US companies will try to outclass the Earth Simulator, which could benefit society as a whole.
Blue Gene is still in the early stages of development, but it is predicted to be better than the Earth Simulator, according to the www.top500.org, when completed. Could the final product of Blue Gene be the answer they have been looking for?
As the world is rapidly advancing in the technological field, it makes it harder for humans to compete with machines. What happens if computers become so good that they take the place of humans in everything? I must add, this is not completely possible, yet.
Professor. Dr. Hans-Werner Meuer , states in the abstracts taken from the Supercomputer '98 conference, that ?In order to exploit not only parallelism but also introduce scalability and adaptively one has to go deeper into program structures and to change algorithmic features.? IBM researchers claim that ?High-performance supercomputers are built today by connecting together large numbers of nodes, each consisting of a conventional server system? (IBM Blue Gene team, 2001). However this is not necessarily the case as you will see, with some of the different systems out there, some of which are mentioned throughout this paper.
"How good a particular computer is depends on the kind of problem you're trying to solve, Decker said. A major problem with many recent U.S. supercomputers is that they have not been designed with scientific problems in mind. We need to work more closely with the vendors to do that.? (Nance, S., 2002).
US companies have said that vector computers are the thing of the past. However the ES has proved this to be wrong. I have a feeling that there is not going to be any specific architecture in the future of the supercomputer field.
?Perhaps just as important as the processors-or maybe more-so-is the interconnect technology through which the many processors communicate with each other. Like the processors, some U.S. vendors have tried using less-expensive interconnects. The Earth Simulator uses new technology? (Nance, S., 2002).

Architectures

There are many different types of architectures for all types of computer systems. Parallel computers have several different architectures, some of which are not used today. According to (Roosta, S.H., 2000) Michael Flynn proposes to classify the different types of architectures by characteristics. The characteristics that he chooses to group them by are: the number of processors, or the number of executable programs or by memory structure. Flynn also proposes a few taxonomies for computers, which are agreed upon by others today.

SISD

The Single Instruction Stream, Single Date Stream (SISD) architecture is composed of a single computer, having one Central Processing Unit (CPU) which fetches, executes or stores one instruction at a time. It contains a single Program Counter (PC) register which is used to enforce serial execution of instructions, serial meaning in some sequence or order of execution. When the PC has fetched and executed the instruction it then is updated to the next instruction that needs to be carried out.
This is a model of what the SISD architecture looks like (Roosta, 2000, p2)


SIMD

The Single Instruction Stream, Multiple Data Stream (SIMD) architecture is composed of one Central Processing Unit (CPU) which executes a single stream of instructions and contains one or more processing elements (PE's).

(Roosta, 2000, p3)

The CPU then generates control signals for each PE, which then executes the same instruction on different data elements. This is how we achieve the multiple data stream. As stated in Roosta, this type of architecture is mainly used when problems provide a high degree of small grain parallelism and they can also ?support vector processing, accomplished by giving vector element to individual processing elements for concurrent computation? (Roosta, S.H., 2000).
In comparison to the SISD, with serial execution, the speedup is vast with SIMD architectures. Dr. Gita Alaghband: states that the SIMD ?are essential in the parallel world of computers? and he categorized SIMD architecture into three types, true distributed memory SIMD, true shared memory SIMD and Pipelined SIMD.

True Distributed Memory

True distributed memory SIMD has a single control unit that has many processing elements. These processing elements act as the arithmetic units of the system. The arithmetic units have their own memory, and only that arithmetic unit has access to its own memory.
The main advantage to this type of architecture is that it is very easy to add more arithmetic units to the computer, thus adding more memory at the same time. However, the control unit takes a lot of time for memory exchanges and because of the multiple numbers of arithmetic units the control unit needs to manage this time. Thus there is some time wasted in this procedure.
This is a model of what the true distributed SIMD memory looks like (Chau, J., et al., n.d)

True Shared Memory

True shared memory SIMD has been designed to take into account the connection between the processing elements and the memory. This system consists of memory modules, which are used to replace the local memory. The memory modules are shared by all the processing elements either through a switching unit or a network. In this way, each processing element can share the memory modules without needing access to the control unit. However, the disadvantage of this type of architecture is that it is not easy to add more memory.
This is a model of what the true distributed SIMD memory looks like. (Chau, J., et al., n.d)

Parallel Memory

Parallel memory SIMD is made up of a pipeline of arithmetic units with shared memory (Alaghband, Dr. G. 2003). A pipeline is defined as ?a direct channel by which information is privately transmitted (www.dictionary.com). This pipeline is a first in first out procedure that performs the necessary operations that the arithmetic units need. To make the parallel memory SIMD architecture fast and efficient in terms of speed, the data must be stored in different memory modules so that the pipeline can receive the data as fast as is possible.

MISD

The Multiple Instruction Stream, Single Data Stream (MISD) architecture does not really exist, however it can be thought of like a pipeline. If this type of machine existed, it would consist of several individual processing units and one data item. Each processing unit would receive distinct instructions, which could then be processed on the same data item. Pipeline architectures can perform ?processing through a series of stages, each of which performs a particular function and produces an intermediate result? (Roosta, S.H., 2000). The intermediate results can be thought of as stages in the pipeline. ? While one stage is executing, another stage is being loaded and the input of one stage is the output of the previous stage. The processor carries out many different computations concurrently.? (Lomonosov, I., 2002)
This is a model of what the MISD architecture looks like (Roosta, 2000, p4)


MIMD

The Multiple Instruction Stream, Multiple Data Stream (MIMD) architecture contains one or more processors, which share their resources. They can each execute different programs and operate independently and in parallel, thus multiple streams. MIMD is a more complex architecture and because of this complex structure and its parallel processing, it is more efficient that other architectures. At present, MIMD machines, usually contain fewer processors than a SIMD machine. By using fewer processors in the machine, it has the potential to be cheaper. Roosta claims that this type of MIMD machine can take advantage of mid-level to large grain parallelism. MIMD machines ? differ with regard to interconnection networks, memory addressing techniques, synchronization and control structures. A high throughput can be achieved if the processing can be broken into parallel streams keeping all the processors active concurrently? ( Lomonosov, I., 2002 ).
MIMD can be broken down further, into two categories, shared memory or distributed memory. This is because of the way that they access their memory. (Alaghband, Dr. G. 2003). The two categories use different architecture to access memory.

Shared Memory

Shared Memory may use bus-based, extended, or hierarchical types. Bus-based has a shared central memory to which the processors are connected. There is an additional bus that can be used for synchronization amongst the processors, since bus-based shared memory can only support a small number of processors.
Extended shared memory divides memory into smaller independent parts, to try to avoid conflict between processor. The independent memory units are then connected to each processor by an interconnecting network. ?N processors are linked to M memory units which requires N times M switches? (Alaghband, Dr. G. 2003).
The Hierarchical shared memory uses the bus architecture to allow different processors to access other processor's memory. Even processors on different boards are allowed to access other processor's memory by communicating over internodale buses, which allows this type of hierarchical shared memory structure to allow supporting many processors.

This is a model of what the shared memory MIMD architecture looks like (Aad van der Steen, 2000)

Distributed Memory

Distributed memory may use hypercube or mesh interconnection schemes to access memory. Roosta says that the processors have private memory and access to global memory, which has a potential to reduce communication delays that may occur. Distributed memory has processors which have their own individual memory. They pass this memory to others in the form of messages because of the huge number of processors available. It is not easy to connect them all directly to each other. ?A way to avoid this multitude of direct connections is to connect each processor to just a few others. This type of design can be inefficient.? (Alaghband, Dr. G. 2003) Given that this distributed memory scheme is inefficient there are two other schemes that have been designed to try to make this efficiency better, the hypercube interconnection network and the mesh interconnection network.
The hypercube interconnected network can have a number of different processors such that it satisfies the equation 2 to the power of N and each processor is connected to exactly N other processors. N can also be referred to as the diameter of the system. ? The diameter of the system is the minimum number of steps it takes for one processor to send a message to the processor that is the farthest away.? (Alaghband, Dr. G. 2003)
Roosta states that the hypercube was one of the most accepted architectures used by Intel Corporation, NCUBE Corporation, FPS and Thinking Machine Corporation. The disadvantage of using such a hypercube is that it must use the equation 2 to the power of N, thus ? a machine must be built that could potentially have many more processors than is really needed for the application.? (Alaghband, Dr. G. 2003)
This is a model of what the true distributed SIMD memory looks like. The diagram on the left has 8 processors, each connected to three neighbours. The diagram on the right has 16 processors, each connected to four of its neighbours. (Roosta, p71, 2000)

The Mesh interconnected network has the basic structure and framework of a two-dimensional grid. The grid can have as many or as few processors as it wants. These processors are connected directly to its four neighbours , if they exist.
The grid can either have ?wrap-around connections? or ?no wrap-around connections.? (Roosta, 2000) The only difference here is that with wrap around connections it is possible to connect the first and the last processor in a column or row together. This allows for more processors to be connected making communications a little faster than the non wrap-around connections.
?One advantage of the mesh interconnection network over the hypercube is that the mesh system need not be configured in powers of two. A disadvantage is that the diameter of the mesh network is greater than the hypercube for systems with more than four processors.?(Alaghband, Dr. G. 2003)
(Roosta, 2000, p70)

Parallel Architectures

Flynn then goes further to talk about Parallel Architectures. ?The granularity of an algorithm for a problem is one important criterion for determining what kind of parallel architecture is appropriate for that computational problem? (Roosta, 2000, p127). Granularity is defined as the different number of processing elements, size of memory, and the type of interconnecting networks, when they differ from one implementation to another. Granularity can be classified as fine-grain, medium grain or coarse grain (Wojcik, V. 2003).
Flynn also states that SIMD machines are good for dealing with fine grain algorithms because they need less hardware and less memory. They also only require one control unit as opposed to the MIMD machines. MIMD machines are good for medium grain algorithms, however they involve considerably more overhead. On the other hand, MIMD machines can allow different programs to be performed on the different processing elements in a concurrent fashion. Where SIMD is limited to what is being processed on the different processing elements waiting for another to be done before it may start.
Lastly the time taken to execute a sequence of instructions for the two types of machines differs greatly. Since the MIMD machines can execute sequentially, obviously they are going to be able to process instructions faster. The SIMD machines will clearly be slower as they must execute instructions in a queue fashion. Also in SIMD machines, each processor must wait for the other to finish before it can start.
So then how do you classify performance maximization and how do you measure performance? Sequential and Parallel algorithms have to be evaluated differently. Sequential algorithms are evaluated based on their execution time and parallel algorithms are evaluated depending on their size, their architecture and the number of processing elements that they contain. The major characteristics that affect parallel performance are:
• Clock speed • Size and # of registers
• # of concurrent paths to memory • Instruction issue rate
• Memory size • Ability to fetch/store vector efficiency
• Number of duplicate arithmetic functional units • Whether functions can be chained together
• Indirect addressing capabilities • Handling of conditional blocks of code.

Parallel and Sequential Algorithms can also be measured according to the quality of their algorithms. Some methods for doing this are Big O Notation, and Theta Omega Notation.
Some parallel architecture use Hybrids, which are defined as ? something of mixed origin or composition, something, such as a computer or power plant, having two kinds of components that produce the same or similar results ? (www.dictionary.com). The most common hybrids are SIMD-MIMD and Dataflow.

SIMD-MIMD

The SIMD-MIMD hybrid is composed of one or more independent SIMD machines and/or one or more independent MIMD machine. An example of this type of machine is TRAC. Roosta states that there are three different types of hybrid systems, PASM, VLIM and MSIMD.

PASM

PASM is an abbreviation for Partition able SIMD/MIMD system. It was designed at Purdue University and is a reconfigurable parallel system mainly used for image processing and pattern reorganization. (Roosta, 2000, p31)

VLIM

VLIM is an abbreviation for Very Long instruction word systems. It has all the processing elements under a central control. However each processing element can perform different operations on different sets of data. (Roosta, 2000, p33)

MSIMD

MSIMD is an abbreviation for Multiple SIMD systems. It is made up of several tree-connect and circuit switched designs which are reconfigurable. These can also be broken down into smaller configurations. (Roosta, 2000, p34)

Data Flow

Data Flow architectures can use a single or a multiprocessor. ?Based on the principle of execution of a sequence of instructions by the control unit and the determination of the flow of control by the program? (Roosta, 2000 p35). Data flow problems are normally described using diagraphs, where the arcs represent where the results should be sent and the nodes represent the operations.
Data flow allows operations or programs to be performed as soon as the data is available. This is how the instruction is determined to be ready. Instructions are allowed to be run in parallel as long as all data needed for the instructions are available.
The main difference in this type of architecture is that it does not use conventional methods to execute instructions as in MIMD or SIMD. Conventional methods are seen as executing the flow of control from the program. The disadvantage of dataflow is that there are many complex issues to deal with. There are many processing elements, and one must be assigned as the supervisor. As data is passed along arcs it is contained within tokens which must be stored, therefore data flow requires some storage capabilities.
The advantage of the dataflow is that it produces a high throughput because it is able to use the maximum degree of concurrency. Data flow architectures can be divided into three types, static architectures, reconfigurable static architecture and dynamic architectures.

Static

Static Dataflow architectures can only use one arc at a time to send data. This is so that the arcs know which set of instruction belongs to which input. By using the static dataflow, the complexity of executing the program is taken away from the machine and is given to the compilation unit. Computations that need to be performed in this type of architecture are held in a storage area called the activity storage. They are stored in a queue until the instruction is in the ready state, and then it can be executed. Static Dataflow does not support arrays, procedure calls and repetition. (Roosta, 2000 p38)

Reconfigurable Static

Reconfigurable Static Dataflow is composed of a number of processors. When the program is initially loaded these processors are connected logically by an interconnected network. This decision is made by the compiler. This architecture requires at least a minimum number of processors and for a physical connection to exist. (Roosta, 2000) p39

Dynamic

Dynamic Dataflow uses a feature called Tagged token. This is the main feature and allows arcs to carry several tokens at once, where each token carries a different tag to distinguish it from another. Operations are triggered as soon as a token carrying the same tag reaches an input with the same tag.
The advantage of using tagged tokens is that each arc can carry multiple tokens, allowing for more to be transferred at once. It helps to decrease traffic associated with tokens having to wait for others to finish transmitting before another can be sent. It also eliminates the need for acknowledgement signals, which frees up the communication channel and thus allowing programs to be executed and be evaluated concurrently. (Roosta, 2000, p43)

The table below summarizes the concepts, complexity and efficiency of all the architectures:
(Roosta, 2000 p53).
System
Concept
Generality
Complexity
Efficiency
SIMD
EASY
MODERATE
HIGH
HIGH
MIMD
HARD
GOOD
HIGH
MODERATE
Pipeline
EASY
POOR
LOW
HIGH
Associative
MODERATE
POOR
MODERATE
HIGH
Data flow
HARD
GOOD
HIGH
HIGH

In general, the architectures that are used today in supercomputers are mostly SIMD and MIMD. Some Hybrid systems such as dataflow machines are being used also. Recently SISD is becoming more widely used, as cluster computer. A cluster ?is loosely defined as a parallel system comprising a collection of stand-alone computers connected by a network? (www.dictionary.com). Clusters are said to offer many advantages over the current supercomputer, mostly costs and availability, however they do not offer the power and the performance that current supercomputers offer. ?Supercomputers generally conform to the SIMD (Single Instruction, Multiple Data) model. That is, multiple processors simultaneously carry out the same sequence of instructions on different sets of data? (Hugo, I., 2001).
The other computer architectures are not used because they do not allow for concurrent data transfers. When given a useful architecture, there are some different things being done to try to maximize the performance. Such architectures as SMASH and TRIPS are being developed for certain systems to improve performance and out rate their competitors. The TRIPS architecture is the only one that mentions that the hardware of the system can be altered to meet special requirement.
?TRIPS provides large pools of execution resources and exposing communication latencies throughout the components of the system and delivering system level configurability to applications and runtime systems. In this way, the system can adjust the hardware to meet the varying demands of each application? (Unknown, n.d. ).
Conversely, Blue Gene is being made to be adaptable to many different applications. Please refer to the Blue Gene, TRIPS and SMASH sub-sections for more information.

Communication Channels

There are two types of communication channels, serial and parallel which transmit bits via hardware. Serial communication is where one process starts and must finish executing before any other bits can be sent. The order of instructions is sequential. Parallel communication is where several bits can be transmitted at the same time. Instructions can be concurrently executed, allowing for more information to be processed faster.
When using serial communication, channel skew does not occur, however with parallel communication, channel skew does occur. Channel Skew is a distortortion that occurs when signals need to travel a long distance, over wires. The distortion occurs because there is a long delay and signals are distorted and can arrive at unpredictable times. The longer the wire, the more distortion there is, and the more likelihood that the signal will be dropped. There is also no way of knowing when the signal will reach its destination.
The wires are also limited by their bandwidth. Shorter wires have a higher bandwidth and as the bandwidth increases, so does the frequency, which creates more distortion. Conversely, as the bandwidth decreases, so does the frequency and thus the distortion also decreases.
When you look at the advantages and disadvantages of both types of channels, clearly parallelism is better when we talk about on the surface communication. On the other hand, I have not come across any information stating that serial communication is bad. Serial channels are slower, but they guarantee that the data transferred arrives in the order it was sent. Some serial channels do not provide sufficient bandwidths to allow transfer of data at a suitable rate and a realistic cost. Serial channels do have problems with speed and distortion.
Conversely, parallel channels are faster, but distortions occur because of channel skew problems, which have not fully been resolved. Also signals may not arrive in the order in which they were sent and there is no guarantee that they will even arrive at all, as there is no telling how long a signal will take to reach its destination. Another problem with parallel channels is the cost.
From my research, I have found that there is no stated preference as to which type of communication channel to use. This is because both types of channels can use selector or multiplexing methods. One of the systems, the Earth Simulator, actually uses both parallel and serial channels, and converts the signals via a switching circuit.
In regards to speed ups parallelism is the better choice, if you have short wires even with channel skew problems. Burst mode is a way to help reduce channel skew, but it does not get rid of it. ?Burst mode is often referred to a low cost simultaneous sampling method. Bursting is a way to minimize channel to channel skew but it does not eliminate it? (LinuxGuruz, n.d). However if you have longer wires, serial channels is the way to go, to eliminate the potential issues with channel skew problems.
Multiplexing channel and a selector channel is a logical idea to how data is transferred. Serial and parallel channels can use either multiplexing or selector channels. A selector channel is used to transfer one bit at a time, and uses the whole bandwidth of the wire to do so. A multiplexing channel is used to transmit several bits at a time, each using a portion of the bandwidth. Multiplexing channels can have various methods. The one used for computer is time division multiplexing (TDM).
Many of the IBM computers use a technique called block multiplexing which was first developed in the IBM 370. The predecessor to the 370 was the 360 and the architecture allowed for either a selector or multiplexer channel. When the 370 was being developed they wanted to create a channel that allowed for the speed of a selector channel but they also wanted multiplexing capabilities, thus block multiplexing was created.
In a byte multiplexer a few bytes are transferred from storage within a time frame, to a specific device that has requested them, compared to block multiplexing, which is a technique used that allows concurrent execution of multiple channels, to increase throughput. ?
The block multiplexer channel, according to the article 370 System, is said to be ?more efficient that the old selector channels for high speed devices?. This technology is the basis of a lot of the IBM supercomputers today. ?Today, the strategic IBM mainframe environment is delivered via the IBM Enterprise Systems Architecture/370? (Gribbin, M. 1989).
Selector channels facilitate the use of switches, where multiplexers make it difficult to use a switch. ?A switch is likely to be in a wrong position when you want to transmit a message. This introduces delays. Even if you could flip the switch without delays it still has latency? (Wojcik, V. 2003).


Blue Gene

Blue Gene is an IBM project to build the world's fastest supercomputer. The idea was thought of in 1995 by Monty Denneau and Peter Hochschild. The original Blue Gene machine was supposed to be used for high end photographic quality computer graphics using a procedure called ray tracing. However, plans were changed to make Blue Gene into a human protein folding machine.
According to the article (Hugo, I. 2001) ?Blue Gene was announced on December 6 th 1999, as a 100 million dollar initiative to build the world's fastest supercomputer, aimed at the field of computational biology.? This machine is proposed to be completed in 2006, making it a five year project. IBM states that this will be an incremental project with the ?early studies being carried out on currently available hardware platforms? (IBM Blue Gene team, 2001) .
The project was spawn by the idea of advances in biology. This supercomputer is being made to simulate human protein folding which is a very complex mechanism. ?Protein folding is the way that proteins fold themselves into unique patterns? (IBM Research, 1999). The Blue Gene Project is one of IBM's biggest challenges, with the previous being the Deep Blue project, which proved very successful. ?Learning more about how proteins fold is expected to give medical researchers better understanding of diseases, as well as potential cures? (IBM Research, 1999 ).
If Blue Gene is successful, this massive machine will help to increase our understanding of certain biological events and help to validate models and simulations that, to this date, have been constructed. Since the Blue Gene is going to be an incremental process, the stepping stones used and Blue Gene itself, will be known as the Blue Gene family of supercomputers. This family of supercomputers will be ?capable of handling a broad set of scientific and technical challenges in such fields as nanotechnology, life sciences and weather modeling? (MORAN ALTERIO, J, 2001 ).
IBM's main goal for the Blue Gene project is to advance understanding of protein folding and to ?explore novel ideas in massively parallel machine architecture and software. Major areas of investigation include: how to make such massively parallel machines more usable and how to achieve performance targets, with reasonable cost, through novel machine architectures? (IBM Blue Gene team, 2001 ).
According to Denneau, from the Alumni News, the biggest challenge so far has been with power consumption and latency problems.
IBM hopes to create a petaflop supercomputer, which can only be done through the use of parallelism. They claim that ?such a machine would need to have high-bandwidth, low-latency communications; otherwise it would be essentially limited to data-parallel operation, a mode resulting in unacceptably long turn-around time for jobs, each of which would be run on a low-power subset of the machine. A conventional implementation of a petaflop/s machine would be too massive, expensive, and unreliable. Because of the communications requirement, solutions involving huge numbers of loosely coupled personal computers are too inefficient for the desired applications? (IBM Blue Gene team, 2001 ).
If they are successful in creating a high-bandwidth and low-latency communications, using integrated processors, this will be a break through in helping to minimize the Von Neumann bottleneck.
Denneau states, that the plan for Blue Gene is for each chip to contain 128 processors. Each of these will run at a gigaflop and every four processors will be sharing a data cache and a floating point unit, thus making each chip capable of performing at 32 gigaflops. The proposal for Blue Gene is that it will contain 36 chips on a single board and eight of these boards will be placed in a rack, which Denneau says still could be rearranged, due to the space limitation for technicians to work on the machine. The final Blue Gene is supposed to have 64 racks linked together.
This huge machine will also consist of a cooling tank, measuring ?8 ft high and 15 ft long, affectionately know as Bertha.? It will contain ?one-foot diameter pipes, pumping 1000 gallons of water per minute via a gigantic turbine water cooler, Blue Gene must demand no more that 2 million watts, the maximum available at IBM Watson? (IBM Blue Gene team, 2001 ). IBM claims that ?Blue Gene system is very similar to that offered by large-scale clusters today. Each chip is a shared memory multiprocessor that runs simultaneous multiple threads. Threads running on the same chip communicate via shared memory. Communication between chips uses message passing.? (IBM Blue Gene team, 2001 ) The question here to think about then is, upon completion of the final Blue Gene, will it be in the same league as the current Earth Simulator, or will IBM have wasted 5 years of development and construction.
IBM is hoping that by making these massively parallel computers, which uses less power and requires less space, that they can make these machines more affordable by using parts which cost less, and overall make these machines more widely used in the environment. ? The architecture is also proving to be readily adaptable to a range of applications, and will be more affordable than current supercomputing resources due to its smaller physical size and power efficiency.? (IBM Research News, 2003)
The cost of building a supercomputer ultimately lies in the cost of the parts needed. The dominant cost is in the storage and the logic required to move data from storage to the processors, thus we are talking about the cache and the buses. If IBM is able to meet its goals and identified main areas of interest then, these supercomputers could result in the significant decrease in price, thus making them more appealing to businesses.


On the left is what the Blue Gene Chip is made up of, and to the right is the chip configuration. (Unknown, 2004)


Blue Gene/L

On November 9, 2001 a partnership was announced with IBM and Department of Energy's National Nuclear Security Agency's Lawrence Livermore National Laboratory to expand the Blue Gene (IBM Research, n.d ). The expansion includes a new machine, named Blue Gene/L, which will precede the release of the Blue Gene. Blue Gene/L is expected to be completed according to IBM in 2005, and will help to predict ?physical phenomenon of natural interest? (IBM Research, n.d ).
Blue Gene/L will consist of more that 65,000 nodes and each chip will have two microprocessors. One will be used for communications between nodes and the other will be used for computing, i.e. performing calculations. Each node will work on a small part of the problem to help solve the large problem that the machine has been given. ?One of processors will be dedicated to communication between the nodes, freeing the other for performing calculations.? (Moran Alterio, J., 2001 ). The advantage of having one processor do all the communication is that it is much faster, than if it had to be done by each processor sequentially. It frees up space for the other processor to work with.
Blue Gene/L's network system consists of three different architectures to connect the 65,000 plus nodes. The first type that is used is a 3 Dimensional Torus which is the main communication backbone. The second is a Global Tree and the third is an Ethernet.
3D-Torus Global Tree Ethernet
These are the three types of network configurations that the Blue Gene/L machines use. (Bhanot, G.et al., n.d.)

IBM has developed a new type of architecture for the Blue Gene/L, which they have called SMASH, which stands for, simple, multiple and self-healing. (IBM Research, n.d .). This architecture is supposed to be more adaptable than the original proposed architecture for the Blue Gene project, thus hoping that it will also be more affordable.
This is how the Blue Gene/L machine is configured. The system is made up of chips, then 8 chips are
put on a board. 128 boards are then put on a rack and the the racks are put into cabinets.
The total having 64 cabinets.(Bhanot, G.et al., n.d.)
This is a diagram to show the composition of the BlueGene/L node configuration. (Bhanot, G.et al., n.d.)

SMASH

The SMASH architecture stands for Simple, Many and Self-Healing. This architecture is different from many of the other architectures in three ways. The three differences are stated in the IBM Research News article as:
• ?It dramatically simplifies the number of instructions carried out by each processor, allowing them to work faster and with significantly lower power and chip surface requirements(the traditional approach is to add complex features to gain performance)?
• ? It will facilitate a massively parallel system capable of more than 8 million simultaneous threads of computation (compared to the maximum of 5000 threads today)?
• ?It will make the computer self-stabilizing and self-healing -- automatically able to overcome failures of individual processors and computing threads? .

The SMASH architecture is claimed to have been thought of my Monty Denneau, who was one of the major designers of the Blue Gene. One of the biggest problems with the Blue Gene was latency. Denneau states, that these problems can be solved by using the SMASH architecture. This is because ?Denneau plans to solve these problems by using a new technology that combines processor and memory on one chip so that latency is reduced to just 10 nanoseconds. There will be only minimal cache?primarily DRAM will be used. To compensate for the resulting slower memory speeds, the machine will use multi-threading. Denneau is particularly excited about using embedded DRAM for the first time on a large scale, thereby avoiding the DRAM bottlenecks that cripple current machines.? (Alumni News, 2001 ).
Since SMASH is self-healing, the architecture has been developed so that it will not crash, and will not have to start the computation process all over again. The system will check for faulty processors and communicating threads. If an error is detected within one of the threads or processors, then the system will start that part of the calculation again. If this is not feasible, then it will pass the computation or information onto a good one, and ignore that processor or thread. ?Each of Blue Gene's clusters will perform 4 tasks, or threads, simultaneously. If one is busy, the next one will pick up the thread? (Alumni News, 2001 ).
This is a way of cutting costs, because it is too expensive for every processor working at optimum power and speed all the time. There are so many processors within the ES that if there are some, that are broken, then there will not be a significant decrease in performance. The cost is reduced by ignoring the broken processors instead of shutting the whole machine down to make such a minor fix.

TRIPS

TRIPS stands for, Tera-op Reliable Intelligently-adaptive Processing System, and was designed by the University of Texas at Austin in collaboration with IBM Research in Austin. The project was also supported by the Defense Advanced Research Projects Agency (DARPA) and is proposed to be completed in 2010, with fully running prototypes by December of 2005.
The idea behind this design is putting the performance of a supercomputer on a single chip. (IBM Research, 2003). TRIPS will consists of multiple chips, each one having the TRIPS architecture, thus it will be a recursive hierarchical system. Each TRIPS chip will contain eight processors. The processors will make use of ?Grid Processor Architecture (GPA) consisting of an 8x8 array of ALUs, a local register file, local instruction and data caches, and control circuits? (Unknown, n.d ). Each chip will include a ?collection of on-chip secondary memory arrays, and off-chip channels to external DRAM and other TRIPS chips? (Unknown, n.d).
The TRIPS chips and the processors are configurable to allow for efficiency changes. ?The fundamental innovation in the TRIPS design is its block-oriented execution. Instead of operating on only a few computations at a time, the TRIPS processor operates on large blocks of computations mapped to an array of execution units on the chip. This approach allows many more instructions to execute in parallel, thus offering higher performance.? (IBM Research, 2003 )
This type of chip has not been specified for a particular market segment, in the hopes that it will be affordable by a larger market segment, and not just the current supercomputer community. ?The chip also includes a sensor network and a small embedded monitor processor to dynamically detect application behavior and changes in system behavior. This information is fed back to the runtime system, the application, and the compiler for on-line optimization.? ( Unknown, n.d)
Some of the main goals identified, by the University of Texas in Austin and IBM in Austin, for the TRIPS system are:
Technology-scalable - The team developing TRIPS created a new architecture called Grid Processor Architecture (GPA) which composed of an array of Arithmetic Logic Units (ALU), connected by a network so that ?critical dataflow paths are placed along nearby ALU's? (Unknown, n.d ). The Grid Processor Architecture was designed as a replacement for the conventional processors that are currently being used in supercomputer architectures.
Malleable architectures - The team developing TRIPS designed the new architecture to have the GPA's and the on-chip memory systems to be configurable. This is so that it can be specified to the needs of the workload or the external conditions.
Dynamic adaptively - This is used to ?respond to changing workloads and conditions by on-chip sensors and morphware? (Unknown, n.d ), which is used to help control the runtime operations.
Application diversity - This is planned to support various workloads and external conditions and be portable.
One of the main components developed to reach the TRIPS team goals was the Grid Processor Architecture (GPA). The University of Texas team working on TRIPS claims that TRIPS is proposed to be a substitute for the current processors used in many supercomputer. The GPA is made up of multiple arithmetic logical units (ALU) connected via a thin network and instructions are executed using dataflow. Chaining these ALU's together efficiently and effectively, will minimize both logic and wire overheads on the critical paths. (Unknown, n.d).
Another main component in the TRIPS research is scalable on-chip memory. The main idea with having a good memory system is cache. The cache is used to access frequently used pieces of data with minimizing the time taken to retrieve data because cache is more efficient.
The TRIPS team is attempting to build a combined on-chip memory system, which is currently presenting them with a few challenges when access times are short and distances to memory are longer. ?We are exploring a non-uniform cache architecture (NUCA) to assemble secondary TRIPS memory banks into a large level-2 cache.? (Unknown, n.d)
Lastly the TRIPS architecture is designed to monitor its applications. This is done by introducing a monitor processor into the TRIPS architecture to ?collect information about the application and its environment, synthesize that information, implement low-level policies to manage the resources, and deliver the necessary information to the application and the runtime system? (Unknown, n.d ).

This is the structure of the TRIPS chip architecture. (Unknown, n.d)

Earth Simulator

The machine named Earth Simulator (ES), developed by the Japanese, has made a significant impact on today's supercomputer technology. It was developed as a national project, in union with three government agencies. The idea behind ES's development was to help predictions earthquakes, hurricanes and typhoons and many other phenomena. This machine is initially predicted to save the Japanese communities millions of yen in damages.
ES is a high-speed vector parallel computer which took 5 years to complete, originally designed in 1997.
?Vector technology is based on using specially built parts and systems to solve complex calculations. In contrast, the method generally used in the United States, called "massively parallel," is based on off-the-shelf components. One advantage of the vector technology process is that it requires fewer processors? (McDonald, T. , 2002).
It currently has claimed the number one position on the top500.org list. ES was turned on for the first time, on March 1 st , 2002 at Japan's Marine Science and Technology Center.
This supercomputer was built for two main purposes. Firstly it was developed to study atmospheric and oceanographic sciences and secondly to study solid earth. This will help predict changes in and on earth using satellite technology. ?The Earth Simulator is capable of creating a virtual Earth, using information sent from satellites and other observation points, so that scientists will be able to predict changes? (Green, S. 2002 ). Since the ES can be described as ?Virtual Earth?, there is a possibility that scientists will be able to predict the future and simulate years down the road to see what humans are doing to the earth. This way, we may be able to save the Earth from human's destruction.
? The system will contribute to analyze and predict environmental changes on the earth through the simulation of various global scale environmental phenomena such as global warming, El Nino effect, atmospheric and marine pollution, torrential rainfall and other complicated environmental effects. It will also provide an outstanding research tool in explaining terrestrial phenomena such as tectonics and earthquakes.? (NEC Press, n.d. ).
US companies, have tended to focus on building supercomputers out of more and cheaper parts put together. This is so that they can save money and try to make more profit out of the computer. ?Many computing experts touted the ability to pile more and more cheap commodity systems together to scale or achieve ever higher levels of performance, as the solution to the future of supercomputing? (Nance, S., 2002 ). They had abandoned the idea of using vector processing because they felt that it was too costly and the technology was too outdated.
The architecture for the ES is MIMD and consists of distributed memory with vector parallel processors connected and sharing memory. ES is constructed of 640 processor nodes connected by the interconnecting network consists of a single stage crossbar network. Each processor node will have eight vector processors. The performance is estimated to achieve 40tflops.


This is the Earth Simulation's system Configuration. (JAMSTEC /Earth Simulator Center, n.d.)

The ES is capable of operating concurrent pipelines and single vector instructions. The ES will have lots of interconnecting nodes and has several data-transfer modes. The maximum bandwidth is about 8 TB and the ES consists mainly of two cabinets, the processor and the interconnection network cabinet. ( Márton, S. n.d)
The Processor Node (PN) cabinet is used for taking care of all the processor's components and computations. The Interconnection Network Cabinet (IN) is used for communications. The two cabinets are capable of converting signals from parallel to serial and serial to parallel via the switching circuit. The PN cabinet accepts serial signals, while the IN cabinet accepts parallel signals.
?The signal from PN changes from Serial to Parallel and is inputted into a switching circuit. The Signal which came out from the switching circuit is performed Parallel/Serial conversion. Then, It is sent to the PN cabinet. The number of Cables which connects PN cabinet and IN cabinet is 640x130=83200, and the total extension is 2,400Km.? (JAMSTEC / Earth Simulator Center, n.d. ).



On the left is the PN Cabinet configuration and on the right is the IN Cabinet configuration. (Dongarra, J, n.d.)

ES is so big it needs a cooling system. The Japanese use air cooling, with ?heat pipe technology? (Dongarra, J, n.d.).
In conclusion, the power and the advancements that the ES brings to the forefront of the supercomputer technology is over powering. The Earth Simulator has made specific advances in ?particularly through being able to combine individual simulations at different scales into one holistic picture? (Nance, S., 2002).


This is the processor node configuration of the ES and the MMU Package. (Dongarra, J, n.d.)

Cray X1

The Cray Research Company was founded by Seymour Cray in 1972. In 1996 the company was bought by SGI. The company is ?now considered a separate business unit within SGI from 1999. Sold to Tera Computer Corporation in 2000? (Gannett, F., 2002).
Seymour Cray's idea of making vector supercomputers made him a legend in the supercomputer market between 1978 to 1995. The Cray X1 is the newest release of the Cray Family and has achieved a speed of 11.5 billion calculations per second. According to Unknown, 2003 the Cray X1, as of November 2003, was ranked by the top500.org as having the most powerful processors. Cray X1 consists of only 252 application processors. However, according to (Unknown, 2003), it was built to contain up to 4,096 processors.
?Cray Inc.'s customers include many of the world's leading automotive, aerospace, petroleum and chemical/pharmaceutical companies; weather forecasting and global climate research centers; major academic research centers; and government agencies in the U.S. and abroad? (Cray Research, n.d.). This proves that there is a wide market for supercomputers, but the fields that are requiring them are wealthy. It is hard for a company to invest that much money for a supercomputer until the price decreases.
The Cray machines are certainly not the fastest of the supercomputers today but they have been around for a long time, and have made some major advances to date. The most important being the vector line of supercomputers. For example one of the first Cray supercomputers, the XMP which was the successor to the original Cray 1, introduced dual processing to the vector line of computers. As each successor was developed and created it doubled the speed and processing power of it predecessor. Vector computers do not contain data caches in their CPU because buffers act as the cache. They do contain caches only in their registers. The power for this computer comes from the many small processors that are available. These processors work together to acquire the necessary power for the supercomputer.
A whole new "message passing" programming method was developed to overcome the communication and co-ordination problems inherent in such loosely bound architecture
T3E had no I/O capability, instead it was attached to and hosted by a YMP or C90 front-end.
Without fast inter-CPU communication, MPP problems can easily become stalled at the inevitable serial regions and co-ordination points of any program. Many topologies for CPU interconnection had been tried, trees, hyper cubes and meshes being the most popular. The idea of a using a 3D torus with each CPU connected to its nearest 6 neighbours north, south (y direction), east, west (x) , front and back (z) surprised the industry with its simplicity and elegance.

Conclusion

In Conclusion, there are a number of supercomputer architectures out there in the world today; all with there own bells and whistles. Each company has their own theory on how to increase performance and speed of their supercomputers, with the goal of making their system the top machine according to the top500.org.
At present the ES has the most advanced technology for the fastest supercomputer. The Blue Gene is a major competitor in the making and is proposed to outclass the Earth Simulator once it is completed in 2006. The biggest question out there, is whether the efforts of IBM and the creation of the Blue Gene supercomputer family are going to be a success, or will the company have wasted 5 years of development.
Some future topics of interest are the Cray X1 and the Cray T3E supercomputer, the Octigabay and the Virginia Tech. Apple supercomputer. I will leave these for you to look up and ponder over.
I believe that in the future there are not going to be any specific architecture used by everyone out there. This is because, supercomputers will be built with either an intended purpose or a general configuration, with the allowance to tweak certain parts of the hardware allowing for improved performance. I believe that Vlad said it correctly, ?Parallel computing will take off if EVERYBODY starts using it.? This is how the desktop computer got started.

Appendix

Aad van der Steen, (2000). Shared-Memory MIMD machines. Retrieved February 26th 2004 from http://www.euroben.nl/reports/web00/sm-mimd.html

Alaghband, Dr. G. (2003), Parallel Computing and Architectures, Retrieved January 9 th , 2004 from http://carbon.cudenver.edu/~galaghba/mimd.html

Alaghband, Dr. G. (2003), Parallel Computing and Architectures, Retrieved from January 9 th , 2004 http://carbon.cudenver.edu/~galaghba/simd.html

Alumni News, (2001), Monty Denneau Designing Blue Gene, Retrieved January 28 th , 2004 from http://www.cs.uiuc.edu/news/alumni/su01/denu.html

BBC News, (2002), Japanese supercomputer simulates Earth, Retrieved March 5 th , 2004 from http://news.bbc.co.uk/1/hi/sci/tech/1951265.stm

Bhanot, G., Chen, D. Gara, A., Vranas, P., (n.d.)The BlueGene/L Supercomputer. Retrieved January 17th 2004 from http://lnssrv.mit.edu/lattice02/2002_06_27_plenary/Chen.pdf

Blumrich, M., Chen, D., Coteus, P., Gara, A., Giampapa, M., Heidelberger, P., Singh, S., Steinmacher-Burow, B., Takken, T., Vranas, P., (2003), IBM Research Report: Design and Analysis of the BlueGene/L Torus Interconnection Network, Retrieved January 15 th , 2004 from http://domino.watson.ibm.com/library/cyberdig.nsf/papers/654C703D8F16D55285256DFE005329A4/$File/rc23025.pdf

Cary, D. (updated 2004), Molecular Nanotechnology, Retrieved from February 2004, http://rdrop.com/~cary/html/nanotech.html

Chau, J., Fakhrai, S., Burney, M., (n.d.) SIMD Chip Architecture Graphics Architecture. Retrieved January 17th, 2004 from http://www.cs.umd.edu/class/fall2001/cmsc411/proj01/xbox-too/detail.html

Cray Research, (n.d), About Cray Inc. Retrieved April 9 th , 2004 from http://www.cray.com/company/index.html
Dongarra, J, (n.d), Notes on the Earth Simulator, Retrieved March 8 th 2004 from http://www.cs.utk.edu/~dongarra/esc.pdf

Gannett, F., (2002), Cray supercomputer FAQ and other documents. Retrieved April 2 nd 2004, from http://www.spikynorman.net

Geralds, J., (1999), SMASH is IBM's Blue Gene differentiator, Retrieved February 10 th 2004 from http://parallel.rz.uni-mannheim.de/sc/seminar98/europe/abstracts.html and p://www.vnunet.com/News/104189

Green, S., (2002), Quick Thinking, Retrieved January 22 nd 2004 from http://www.smh.com.au/articles/2002/05/03/1019441438865.html
Gribbin, M. (1989), Development of 360/370 Architecture ? A Plain Man's View. Retrieved January 23 rd 2004 from http://pucc.princeton.edu/~melinda/gribbin.doc
Godo, L., (1996), SMASH Project Summery, Retrieved March 1 st 2004 from http://www.iiia.csic.es/Projects/smash/smash-summary.htm

Hugo, I., (2001), Supercomputer Futures, Retrieved January 2003 from http://www.it-analysis.com/article.php?articleid=1913

Hugo, I., (2001), Supercomputer Problem Space Retrieved January 26 th 2004 from http://www.it-analysis.com/article.php?articleid=1914

IBM Blue Gene team, (2001), Blue Gene: A vision for protein science using a petaflop supercomputer , Retrieved January 22 nd 2004 from http://www.research.ibm.com/journal/sj/402/allen.html

IBM Research, (1999), IBM Announces $100 Million Research Initiative to build World's Fastest Supercomputer, Retrieved from http://www.research.ibm.com/resources/news/19991206_bluegene.shtml

IBM Research, (n.d), IBM and Department of Energy's NNSA Partner to Expand IBM's Blue Gene Research Project, Received March 15 th 2004 from http://www.research.ibm.com/resources/news/20011109_bluegene.shtml

IBM Research, (2003), IBM Research to Collaborate with The University of Texas at Austin on Single-Chip Embedded Supercomputer, Received from March 25 th 2004 http://www.research.ibm.com/resources/news/20030827_trips.shtml

IBM Research News, (2003), IBM's Blue Gene Bursts onto Top 500 Supercomptuer List, Received January 28 th 2004 from http://www.research.ibm.com/resources/news/20031114_bluegene.shtml

JAMSTEC / Earth Simulator Center, (n.d), Earth Simulator, Received March 17 th 2004 from http://www.es.jamstec.go.jp/esc/eng/ES/hardware.html

Lazou, C., (1999-2003), Japanese Earth Simulator: A Challenge and an Opportunity, Received March 5 th 2004 from http://www.top500.org/news/es.html

LinuzDevices.com, (2003), Inside IBM's BlueGene/L supercomputer, Received March 15 th 2004 from http://www.linuxdevices.com/articles/AT7249538153.html

LinuzGuruz, (n.d), LinuzGuruz Retrieved February 6 th 2004 from http://www.linuxguruz.com/foldoc/foldoc.php?Multiplexor+Channel

Lomonosov, I., (2002), Parallel Programming, Received January 29 th 2004 from http://cs.hiram.edu/~irina/parallel/Ch1.ppt

Márton, S., (n.d). The Earth Simulator, Received March 30 th 2004 from http://www.cab.u-szeged.hu/~bohus/02nap/earthsim.pdf

McDonald, T., (2002)., Japanese supercomputer dethrones IBM's ASCII White, Received March 17 th 2004 from http://www.newsfactor.com/perl/story/17379.html

Moran Alterio, J., (2001)., IBM expanding Blue Gene computing initiative, Received January 26 th 2004 from http://www.thejournalnews.com/newsroom/110901/09ibm.html

Nair, R., (2002), Effect of Increasing chip density on the evolution of computer architecture, IBM Journal of Research and Development: Vol 46, Received March 3 rd 2004 from http://researchweb.watson.ibm.com/journal/rd/462/nair.html

Nance, S., (2002), Success of Japan's Earth Simulator to Spur Big Changes, Received March 5 th , 2004 from http://www.csm.ornl.gov/PR/GT090902.html

NEC Press, (n.d.)., Great Moments in Computer History, Received February 2004 from http://www.marznetproductions.com/computing/00s/ess.htm

Pool, R, (n.d). Assembling life's building blocks, Received January 17 th 2004 from http://domino.research.ibm.com/comm/wwwr_thinkresearch.nsf/pages/bluegene499.html

Roosta, S.H. (2000). Parallel Processing and Parallel Algorithms: Theory and Computation.
Springer-Verlag New York, Inc.

Salamone, S., (2003). Blue Gene is cool for 2006, Received January 17 th 2004 from http://bio-itworld.com/news/071503_report2898.html

top500.org, (n.d). Top 500 Supercomputer Site, Received March 26 th 2004 from http://www.top500.org/ORSC/2002/contents.html

Tanenbaum, A. S., with contributions from Goodman, J.R. (1999). Structured Computer Organization: Fourth Edition. Prentice Hall, New Jersey.

Unknown, (2000). Cray Research and Cray computers FAQ Part 1, Retrieved March 27 th , 2004 from http://www.spikynorman.dsl.pipex.com/CrayWWWStuff/Cfaqp1.html#TOC1

Unknown (2000), 370 Systems, Retrieved March 24 th 2004 from http://www.punch-card.co.uk/37001.htm

Unknown, (2002). Blue Gene Project Update, Retrieved January 28 th 2004 from http://researchweb.watson.ibm.com/bluegene/BG_External_Presentation_January_2002.pdf

Unknown,(2003). CRAY INC. REPORTS CRAY X1 SUPERCOMPUTER PROCESSORS ARE MOST POWERFUL ON 'TOP500' LIST, Retrieved March 28 th 2004 from http://www.cray.com/media/2003/november/top500_nov03.html

Unknown, (n.d). The Tera-op Reliable Intelligently adaptive Processing System, Retrieved March 26 th 2004 from http://www.cs.utexas.edu/users/cart/trips

Unknown, (2004). Meaning of Life, Retrieved January 15 th 2004 from http://www.electronicsweekly.co.uk/default.asp?=/issue/articleview.asp?vpath=/articles/2000/07/12/tech05.htm

Winslow, R.L, Scollan, D.F., Greenstein, J.L., Yung, C.K., Baumgartner, Jr. W., Bhanot, G., Gresh, D.L., Rogowitz, B.E., (2001). Mapping, modeling and visual exploration of structure-unction relationships in the hear, IBM Systems Journal Bol 40, No 2, 2001, Retrieved January 17 th 2004 from http://www.research.ibm.com/journal/sj/402/winslow.pdf

Wojcik, V. & Martin, J., An Algorithm for Adaptive Maximization of Speedup V.7, September 25 st , 2003

Wojcik, V. & Martin, J., An Algorithm for Adaptive Maximization of Speedup V.12, December 29 th , 2004

Wojcik, V. (2003), Taxonomy of supercomputers Retrieved April 1 st , 2004 from http://www.cosc.brocku.ca/Offerings/3P93

V. Wojcik, J. Martin, (2004), On Apaptive Maximization of Speedup, Power Point Presentation, January 2004.

No comments:

Post a Comment