THE UNIVERSITY OF CHICAGO A PARTICLE IN CELL PERFORMANCE MODEL ON THE CS2 A DISSERTATION SUBMITTED TO


 Claire Foster
 3 days ago
 Views:
Transcription
1 THE UNIVERSITY OF CHICAGO A PARTICLE IN CELL PERFORMANCE MODEL ON THE CS2 A DISSERTATION SUBMITTED TO THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES IN CANDIDACY FOR THE DEGREE OF BACHELOR OF ARTS IN COMPUTER SCIENCE WITH HONORS DEPARTMENT OF COMPUTER SCIENCE BY MANDY LA CHICAGO, ILLINOIS JUNE 2021
2 Copyright 2021 by Mandy La All Rights Reserved
3 TABLE OF CONTENTS LIST OF FIGURES iv LIST OF TABLES v ACKNOWLEDGMENTS vi ABSTRACT vii 1 INTRODUCTION THE CS2 WAFER SCALE ENGINE PARTICLE IN CELL DESIGN Static Particle Method Dynamic Particle Method PARTICLE IN CELL ON CS Parameters Preprocessing Charge Density Finite Difference Electric Field Interpolate Particle Update PERFORMANCE MODEL AND RESULTS Problem Size Performance Model DISCUSSION Additional Complications Load Imbalance Particle Data Transfers Mapping Modifications Comparing Against Other Machines CONCLUSION REFERENCES iii
4 LIST OF FIGURES 2.1 The CS2 (left) houses the WSE2 (right) along with cooling mechanisms and power supply. All computehardware is located on the WSE2. The WSE2 is shown next to a hockey puck for size comparison. The WSE2 is 8.5 inches on both sides. The CS2 is 15 Rack Units. Photos Courtesy of Cerebras Systems The WSE consists of a 2D grid of tiles. Each tile holds a processor, its memory, and a router. The router connects the tile to its North, South, East, and West neighbors A direct mapping of a 3by3 grid of mesh cells onto an array of PEs Particle represented as a struct in C. Particle.position[0] represents the xcoordinate of the particle, and Particle.position[1] represents the ycoordinate. Particle.velocity can be interpreted analogously The blue square represents the PE s region. The white circle represents a particle with a global location of (px, py). The green circle represents the bottom left node of the PE, located globally at (nx, ny). In preprocessing, we will calculate the offset position as (pxnx, pynx) so that the coordinates are relative to the bottom left node of the PE The blue square represents the PE s region. The white circle represents a particle with a global location of (px, py). The green circle represents the bottom left node of the PE, located globally at (nx, ny). In preprocessing, we will calculate the offset position as (pxnx, pynx) so that the coordinates are relative to the bottom left node of the PE Plot of theoretical and observed performance based on problem size. Np = 408 is the maximum problem size Load imbalance can have a dramatic impact on performance. This graph compares the theoretical performance given different ratios of load imbalance. Mp = maximum number of particles on any given PE. Np = Average number of particles per PE PIC Performance after accounting for particle movement A 9by9 grid of mesh cells tessellated onto a 3by3 grid of PEs. The mesh grid is folded onto the array of PEs. This mapping preserves adjacencies of mesh cells, keeping communication patterns consistent with nontessellated mappings Performance worsens with larger B. Blocking produces better performance than tessellation. In all cases, blocking and tessellation perform worse than the original one mesh per PE mapping. Thus, we recommend only using blocking or tessellation when the mesh size is larger than 850, A log scale graph comparing the performance of the Tesla K40, GTX 90, and Quadro K620 to the CS iv
5 LIST OF TABLES 2.1 The newly announced CS2 more than doubles the potential of the firstgeneration CS1. Although the physical size of the system stays the same, the massive increase in core count is thanks to the decrease in IC Process, allowing Cerebras Systems to fit 2.1 times more PEs on the WSE The CS2 compared to the IBM Blue Gene/Q petascale supercomputer located at the Argonne National Laboratory. Although now retired, the IBM Blue Gene/Q ranked number 22 on the Top500 list of supercomputers in November A list of relevant input parameters the notation used to denote them Number of Operations per PE by step Number of cycles per PE needed to complete each step Performance Summary of PIC program on the CS2. Assumes all PEs are utilized Updated PIC performance model taking into account load imbalance and particle data transfer Compares the Total Work and Total Latency between the original mapping, block mapping, and tessellation mapping v
6 ACKNOWLEDGMENTS I could not have successfully completed this thesis without the help and support of the following people. I would like to thank my advisor, Professor Andrew A. Chien, for his guidance through each stage of my research and writing process. Your enthusiasm for innovation in computer science inspires me to continue learning and working hard. I would like to acknowledge my colleagues at Cerebras Systems for generously sharing information about the CS1 and CS2 with our research team. I would particularly like to thank Ilya Sharapov for the many productive discussions we have had regarding the ideas expressed in this thesis, Robert Schreiber for being our key advocate, and Michael James for lending us his technical expertise. In addition, I would like to thank my friends and family for their constant support. You might not always understand what I am working on, but your presence is always comforting. Thank you, all. vi
7 ABSTRACT As the once rapid improvement of general purpose computers slows due to the end of Moore s Law and Dennard Scaling, computer scientists must find novel methods to meet the world s hunger for computing power. The CS2 by Cerebras Systems approaches this problem from a fresh perspective. The CS2 packs 850,000 processing elements (PEs) onto one 462 sq cm Wafer Scale Engine. The combination of highly parallel computation, fast distributed memory, and efficient communication makes the CS2 a fitting machine to run one of the most widely used algorithms in computational plasma physics, Particle In Cell simulations. We discuss two viable mappings of Particle In Cell onto the CS2, the Static Particle Method and the Dynamic Particle Method. Using the Dynamic Particle Method, we present a theoretical peak performance model and an empirical performance model based on a partial implementation. The CS2 achieves 3.67 PFLOPS and 1.72 PFLOPS in the theoretical and empirical performance models, respectively. We also consider additional complications such as load imbalance and particle movement between PEs. Accounting for these complications, the CS2 achieves 1.66 PFLOPS in the theoretical performance model. This is three orders of magnitude greater performance than GPUs and CPUs on the market today. Our findings show the potential for the CS2 in computing Particle In Cell codes, and we hope to inspire future implementations of Particle In Cell on the CS2. vii
8 CHAPTER 1 INTRODUCTION As general purpose computers have reached the end of the monumental age of rapid and consistent improvement driven by Moore s Law and Dennard Scaling, computer scientists must look to more specialized computing devices. These types of computers are called accelerators and allow computers to achieve higher and faster performance by exploiting the structure and operations of specific computations. Graphics processing units (GPUs) are one example of an accelerator. GPUs are built with more arithmeticlogic units (ALUs) than central processing units (CPUs) in order to exploit parallelism and achieve higher operation rates. This is a desirable trait for computations like machine learning, dense and sparse linear algebra computations, spectral methods, and more.[1, 3] These are highlyrelevant and impactful classes of computation in the modernage of technology. For example, machine learning opens up the uses of computers from just computing to inferring. Given large amounts of data, which we have ample access to in the modern day, the computer is able to not only perform operations but use the results of those operations to inform future computations. The uses of this technology are unbounded and have been steadily creeping into our lives for over a decade. Some examples include selfdriving cars, personalized ads online, virtual assistants, and facial recognition software. Such a variety of potential uses drives companies to build bigger, better, and faster accelerators. Currently, the primary tactic for solving large parallel computations consists of clusters of GPUs wired together and attached to an external memory system. Performance on GPU clusters does not scale proportionally and much is lost in these cluster systems.[4] Thus computer engineers are driven to explore novel ideas that will allow their accelerators to tackle even the largest of problems. One such approach, and the focus of this thesis, is waferscale processors. The word wafer here refers not to a sweet, thin cookie, but a wafer of thin silicon used for the fabrication 1
9 of integrated circuits. A wafer is around 300mm in diameter and typically makes many microcircuits that are later separated by wafer dicing and packaged as an integrated circuit. A typical GPU is at most around 900mm 2. A waferscale processor is one that utilizes the entire 70,650mm 2 wafer to make a single large chip. The motivation for this thesis is to demonstrate the potential of a waferscale processor to achieve impressive performance on impactful, realworld applications. 2
10 CHAPTER 2 THE CS2 WAFER SCALE ENGINE The CS2, a system created by Cerebras Systems that packs powerful computing onto a single 462 sq cm Wafer Scale Engine (WSE), is a promising new approach to parallel accelerators. Instead of cutting up a silicon wafer into tiny microprocessors, Cerebras Systems packs an impressive 850,000 PEs onto one large chip. Figure 2.1 shows the CS2 and the WSE it contains. The CS2 houses the WSE2 along with cooling mechanisms and power supply. All computehardware is located on the WSE2. Figure 2.1: The CS2 (left) houses the WSE2 (right) along with cooling mechanisms and power supply. All computehardware is located on the WSE2. The WSE2 is shown next to a hockey puck for size comparison. The WSE2 is 8.5 inches on both sides. The CS2 is 15 Rack Units. Photos Courtesy of Cerebras Systems. The strength of the WSE lies in its impressively low memory latency and network communication cost. The network fabric can communicate with all 850,000 PEs without ever leaving the chip. This mechanism gives the CS2 a memory bandwidth of 20 Petabytes/second and a 27.5 Petabytes/second fabric bandwidth. These are the numbers marketed by Cerebras Systems and confirmed by independent benchmarking done by our research group. Table 2.1 gives an overview of the CS2 compared to its firstgeneration version, the CS1. On this oversized chip is a highly parallel, distributed memory architecture. All memory within the CS2 is onchip, producing more memory bandwidth, single cycle memory latency, 3
11 Table 2.1: The newly announced CS2 more than doubles the potential of the firstgeneration CS1. Although the physical size of the system stays the same, the massive increase in core count is thanks to the decrease in IC Process, allowing Cerebras Systems to fit 2.1 times more PEs on the WSE2. and lower energy cost for memory access. Each PE sits on what Cerebras calls a tile. Every tile contains the processor, memory, and a router. Figure 2.2 diagrams the structure of a tile. There is no shared memory that all PEs can access. There is no central PE that directs other PEs. Each of the 850,000 PEs work in parallel, managing its own memory, and communicating with one another through the 2Dmesh interconnection fabric. In addition, each PE can be individually programmed. The router is bidirectionally linked to the processor and the routers of the four neighboring tiles. Each of the five links can be used in parallel on every cycle. The router has hardware queues for its connection to the core and for each of a set of virtual channels, avoiding deadlock. There are 24 virtual channels on each PE, and all or a subset of them can be configured during compile time. No runtime software is involved with communication. All arriving data is deposited by the hardware directly into memory or other desired location. Routing is configured during compilation and sets up any routes needed for communication 4
12 between potentially distant PEs. Figure 2.2: The WSE consists of a 2D grid of tiles. Each tile holds a processor, its memory, and a router. The router connects the tile to its North, South, East, and West neighbors. Each tile holds a modest 48 KB of static randomaccess memory (SRAM). This novel PE architecture enables fast local coupling of communication with local computation, allowing the CS2 to move three bytes to and from memory for every floating point operation (FLOP). Compute rate, memory bandwidth, and communication bandwidth are closely comparable on the CS2, freeing it from the barrier of slow memory access in von Neumann architectures. Clearly, the CS2 introduces a new and capable chip layout. The instruction set on the CS2 is also quite unique. The instruction set can operate on 16bit integer, 16bit, and 32bit floating point types. For 16bit operands, floating point adds, multiples, and fused multiply accumulates (FMAC) can occur in a 4way SIMD manner. The instruction set supports SIMD operations across subtensors of four dimensional tensors. The CS2 includes hardware for tensor addressing, allowing the instructions set to efficiently access tensor data 5
13 in memory. This replaces the need for nested loops, eliminating any loop overhead. Tensor operands can have more than four elements and so the instruction can execute for multiple cycles. Instructions with tensor operands can run synchronously or as an asynchronous background thread. There is no context switch overhead. The background thread uses registers and memory as assigned by the programmer or compiler in the instruction. These may not be overwritten until the thread terminates. Any subsequent computations can be delayed until the thread terminates. The core supports nine concurrent threads of execution. Scheduling is performed directly by the hardware, allowing efficient and simultaneous data movement. Table 2.2: The CS2 compared to the IBM Blue Gene/Q petascale supercomputer located at the Argonne National Laboratory. Although now retired, the IBM Blue Gene/Q ranked number 22 on the Top500 list of supercomputers in November The WSE has already shown much promise in tackling large, complex parallel problems. In late 2020, Rocki et al. demonstrated the breakthrough performance on regular finite difference (stencil) problems achieved by the CS1. They implemented a BiCGStab solver for a linear system arising from the 7point discretization of a partial differential equation on a 3D mesh. For a 600by595by1536 mesh, they measured a run time of 28.1 microseconds per iteration. Every iteration required 44 operations per meshpoint resulting in an achieved 6
14 performance of 0.86 PFLOPS.[7] From these numbers, we estimate a clock rate of roughly 1 GHz for the CS1 and assume the same for the CS2. In this thesis, we present another application that benefits from the highly parallel, distributed memory, wafer scale architecture. We present a performance model of a Particle In Cell program and compare it to implementations of Particle In Cell on GPU clusters. 7
15 CHAPTER 3 PARTICLE IN CELL Particle In Cell, commonly referred to as PIC, is an algorithm used in plasma physics to simulate particle movement and interaction. PIC simulation is one of the most widely used methods in computational plasma physics. It has been used to successfully study laserplasma interactions, electron acceleration, and ion heating in the auroral ionosphere, magnetohydrodynamics, magnetic reconnection, and more.[5] As one may recall from an introductory physics class, charged particles repel or attract each other due to an electric field that they create. Calculating the electric field and forces for a system of two particles is a simple matter of solving a few equations. However, because of its quadratic growth, any system beyond a dozen particles quickly becomes cumbersome. PIC introduces a shortcut reducing the computation cost by modeling interaction through an electric field, and by localizing the electric field computation into an Eulerian (stationary) grid of mesh cells. Particles sit within a space that is divided into a grid of cells. The following is a brief overview of the PIC algorithm. A more detailed description of each step is given in Section Find the charge density within each cell by counting how many particles are in the cell and weighing them by their distance from the corner nodes. 2. From the charge density, calculate the electric potential at each node  this involves solving a finite difference equation. 3. Based on the electric potential, compute the electric field. 4. Interpolate the electric field of the cell onto the particle then compute acceleration and update position. 8
16 5. Repeat Due to the locality of computations created by separating the space into cells, PIC can benefit greatly from parallelization. PIC has been previously implemented on GPUs and GPU clusters, offering a significant performance gain compared to CPUs.[8] However, the forced locality introduces an increased need for communication, a stated strength of the CS2. Calculation of electric potentials rely on values from adjacent nodes, and particle updates rely on the electric fields held at particular nodes. The combination of highly parallel computation, efficient communication, and the demonstrated ability to perform fast stencil computations makes the CS2 a fitting machine to run PIC codes. Even so, a key feature of PIC is the ability to achieve parallelism across the mesh as well as the particles. These two separate bases of parallelism present challenges to achieving good scaling when they fail to align spatially. In the next section, we discuss implementation design and mappings that address these challenges. 9
17 CHAPTER 4 DESIGN Mapping this problem onto the CS2 is not trivial and has a large impact on performance. A key feature of PIC is the ability to achieve parallelism across the mesh as well as the particles. In the ideal case when particle and mesh parallelism align, meaning particle workload and mesh workload are both balanced across the PEs of the machine, the CS2 will be able to achieve its best performance. We consider potential mappings to achieve this. We proceed with a straightforward mesh mapping by directly mapping the 2D grid of mesh cells onto the 2D grid of PEs on the CS2. For simplicity, each PE will be responsible for one mesh cell. An example mapping of a 3by3 mesh onto a 3by3 grid of PEs is shown in Figure 4.1. This mapping is intuitive for local calculations which, luckily, comprises most of our PIC steps. As we will see later, the Charge Density and Interpolate steps are quite convenient in this case. Figure 4.1: A direct mapping of a 3by3 grid of mesh cells onto an array of PEs. Now that we have assigned regions to PEs, we turn to the problem of assigning particles to PEs. Recall that the CS2 is a distributed memory system, and therefore cannot rely 10
18 on some shared memory to keep track of all the particles. Particle data will be distributed amongst the PEs. We will consider two potential methods. 1. Static Particle Method  Similar to how we evenly divided our space to assign to PEs, we can evenly divide our list of particles and assign groups to PEs. 2. Dynamic Particle Method  Somewhat more intuitively, we can assign particles to the PE whose region they are currently located. We now consider the benefits and drawbacks of these two methods. 4.1 Static Particle Method The mapping of the Static Particle Method is consistent with our region mapping. It divides the particles up evenly and assigns them to the PEs. Parallelism across the mesh and parallelism across the particle are both achieved. This would help ensure an even distribution of workload. However, unlike regions, particles are dynamic. This means that in order to update particle positions, the PE that owns the particle would need to locate and communicate with the PE that is responsible for the particle s region. This would strain the communication fabric since at every timestep and for every particle on every PE, two PEs that are potentially very far away from each other would need to communicate. The added communication work for each particle would cause the performance of our program to scale poorly with more particles and more mesh cells. In addition, charge density would also be difficult to calculate since PEs would need to perform an alltoall communication in order to account for all particles in its region. 4.2 Dynamic Particle Method The Dynamic Particle Method is more intuitive in the sense that as the particles move around in the 2D mesh, the particle data follows by moving around in our 2D grid of PEs. 11
19 This method avoids the expensive communication present in the Static Particle Method. Particle acceleration now becomes a local computation. For each particle in its region, the PE computes acceleration based on the electric field of its region and the position of the particle within its region. Both of these are conveniently stored locally. However, particles will occasionally move out of its PE s region. In this case, the PE will need to transfer the particle s data to the PE responsible for the new region. Evidently, some amount of communication is still necessary, although certainly not as much as the Static Particle Method requires. It is not unreasonable to assume that the timestep is small enough that no particle will move farther than one region away from its last location. In other words, a particle will only ever move over to an adjacent region, never further. A PE will only ever have to transfer data to a direct neighbor, decreasing communication latency. The movement of particles also creates the additional complexity of finding a data structure that will allow PEs to efficiently add and remove particles from its local list. Data structure type and efficiency will impact the performance of the program. A linked list would be a good candidate for this. In our performance model, we assume the use of an efficient data structure. A larger problem this method faces is imbalance of workload across PEs. PEs with more particles in its region will have to do more work, leaving other PEs to sit idle for some time. This is an example of when particle and mesh parallelism do not align. There are load balancing tactics that can be used to alleviate this problem. Overloaded PEs can pass some of its workload off to a nearby idle PE. This would improve our rate of performance, but add overhead in workload management. Another possible tactic is to pause the program when a certain threshold of imbalance is reached, and repartition the grid into a nonuniform Cartesian grid with more density in highconcentration particle regions. In other words, regions with higher particle density will be further divided and spread out onto more PEs, balancing the workload. Rate of performance would benefit, but a large overhead would 12
20 be added every time the program halts to repartition the grid. Workload sharing and repartitioning will not be considered in our performance model, but may be an interesting direction for future work. There is also the concern of overloading the memory on a PE. PEs only have 48KB of SRAM to work with. In order to saturate the CS2, we will certainly be working with more than 48KB worth of particle data. It is possible that particles will congregate on a single PE, and overload its memory. To address this we will need to be conservative with the number of particles we can put on the CS2. An estimate on maximum problem size is discussed in Section 6. Lastly, the Dynamic Particle Method restricts the types of boundary conditions we can have. It would be very inefficient to implement a periodic boundary (one where particles that move past the edge of the space wrap around and appear on the opposite side). This is because when a particle wraps around, the PE would have to transfer the particle data across the PE mesh. For this reason, the Dynamic Particle Method is not a good fit for problems that utilize a periodic boundary condition. Luckily, periodic boundaries are not commonly used in particle physics simulations, because they simulate a sort of infinite, repeating space, not often present in the real world. Ultimately, we believe that the Dynamic Particle Method will produce a better performance. The communication overhead required by the Static Particle Method is not worth its simplicity. In the next section, we move on to an analysis of PIC using the Dynamic Particle Method. 13
21 CHAPTER 5 PARTICLE IN CELL ON CS2 In this section, we analyze a theoretical performance model accounting for potential latencies and using peak computation rates achievable on the CS2. In addition, we present a partial implementation of PIC on the CS2 to provide an empirical, but currently imperfect, performance model. 5.1 Parameters The performance of any PIC program depends mainly on the number of particles and the mesh size. In our performance model we will be considering an input size of N particles on an MxbyMy mesh. Mapping PIC onto the CS2, we will only need to consider Np, the number of particles on a specified PE, and the PxbyPy array of PEs we will be working with. In the ideal case where PEs have perfectly balanced workloads, we will have Np = NPx * Py. Unless otherwise specified we will use a mesh size of Mx*My = Px*Py = 850,000, which is the number of PEs on the CS2. Other relevant input information include the timestep (time that passes between each iteration), number of iterations, and space dimensions (area of space particles are able to occupy). We will denote these parameters as dt, I, and XbyY respectively. Although these inputs are important in a full run of PIC, they will rarely come up in our performance model since we will only be measuring a single iteration and scale up from there. Let us also establish the representation of a particle in memory. We will represent a particle as a struct detailing the particle ID, the particle s current position, and the particle s current velocity. 14
22 Table 5.1: A list of relevant input parameters the notation used to denote them. Figure 5.1: Particle represented as a struct in C. Particle.position[0] represents the xcoordinate of the particle, and Particle.position[1] represents the ycoordinate. Particle.velocity can be interpreted analogously. 5.2 Preprocessing As decided in Section 4, we will adopt the Dynamic Particle Method. Given a list of particles as our input parameter, we must create Px*Py lists to load onto the CS2. Each PE will receive a reduced list consisting of particles residing in its region. Positions of particles on this list must also be offset by the PE s lower left boundary. This makes the position of the particle relative to the PE, and is necessary in order to avoid the PE needing to reference its own location on the WSE. This saves computation in the Charge Density and Interpolation steps. Now we are ready to load the preprocessed input data onto the CS2. 15
23 Figure 5.2: The blue square represents the PE s region. The white circle represents a particle with a global location of (px, py). The green circle represents the bottom left node of the PE, located globally at (nx, ny). In preprocessing, we will calculate the offset position as (pxnx, pynx) so that the coordinates are relative to the bottom left node of the PE. 5.3 Charge Density The charge density at each node is calculated as the sum of all charges in adjacent cells weighted by their distance from the node. Each PE is responsible for a rectangular region with four corner nodes. For example, the green node in the bottom left corner of the PE region in Figure 5.3 would receive a contribution from the particle (colored white) equal to its charge multiplied by AG divided by the area of the entire region, where A G = (Sx x) (Sy y). The purple, red, and yellow nodes would receive a contribution from the white particle calculated analogously with A P, A R, and A O respectively. Notice that nodes closer to the particle receive a more heavily weighted contribution, and nodes farther away receive a lighter contribution. In addition, the sum of all four contributions produced by a particle is equal to its charge. The PE calculates and sums the contribution for every particle in its region for each of 16
24 Figure 5.3: The blue square represents the PE s region. The white circle represents a particle with a global location of (px, py). The green circle represents the bottom left node of the PE, located globally at (nx, ny). In preprocessing, we will calculate the offset position as (pxnx, pynx) so that the coordinates are relative to the bottom left node of the PE. the four corner nodes. Because these corners overlap with adjacent PEs, we then have to add up the contributions of the four adjacent PEs to arrive at the final charge density of the node. This can be done in post processing since the data will have to be reloaded in between this step and the next. In total, this requires 10 operations per particle. Theoretically, with proper data access patterns and thoughtful use of FMAC operations, we should be able to achieve 4 operations per cycle. We would then be able to complete this step in 2.5*Np cycles. Our implementation of this step does not quite reach this performance. Running the program on a CS2 simulator showed that we are able to complete this step in 10*Np cycles, achieving only 1 operation per cycle per PE. The culprit of the inefficiency is in the accumulation steps. Since we are summing into a register, we do not receive the speed up that tensor operations allow. This slows our program down due to the need to read and write 17
25 into the same register Np times. However, there are likely more efficient ways to implement this step using FMAC operations that we have yet to explore. Therefore, we present this implementation only as an empirical example and will continue considering the theoretical performance of 2.5*Np cycles. 5.4 Finite Difference Using the charge densities, we can compute the electric potential at each node. This is done by solving the following system of equations for φ i,j, the electric potential at node (i,j). Note that x represents the width of the mesh cell, y represents the height of the mesh cells, n 0 represents the average charge density, and n i,j represents the charge density at node (i,j). φ i 1,j + 2φ i,j + φ i+1,j 2 x + φ i,j 1 + 2φ i,j + φ i,j+1 2 y = n i,j n 0 To solve this differential equation, we use a finitedifference method. We approximate the performance of a 2D iterative solver on the CS2 by implementing a stencil program that performs a single iteration. This program performs 9 operations. Unfortunately, the iterative nature of this program means it will at most be able to perform one operation per cycle since it must wait for input from other PEs. Thus it will require 9 cycles plus 11 cycles of communication latency. With theoretical peak performance, this step should take 20 cycles. However, our implementation of this step runs in 30 cycles. We estimate that iterative solvers take less than 50 iterations to converge. This is supported by estimates made by Marjanovic et al. at the High Performance Computing Center Stuttgart.[6] Although a rough approximation, this gives us a generous estimate of 800 operations and 1,500 cycles to complete this step. 18
26 5.5 Electric Field Now that we have the electric potential at each node, calculating the electric field is relatively simple. At every node, we solve the following two equations: E x,i = φ i+1,j φ i 1,j 2 x E y,j = φ i,j+1 φ i,j 1 2 y We implemented this by loading the electric potential at each node onto its corresponding PE. The PEs then communicate with its direct North and South neighbors to receive φ i,j+1 and φ i,j 1 to calculate E y, and East and West neighbors to receive φ i+1,j and φ i 1,j to calculate E x. This gives us the electric field at each node. The theoretical performance and measured performance is the same for this step. We implemented this step with 5 operations. It takes 14 cycles to run. The bulk of the runtime of this program is consumed by communication latency. 5.6 Interpolate To find the electric field at a particle s location, we interpolate the electric fields at the corner nodes of its region. This is often referred to as a bilinear interpolation. This step is essentially the inverse of the Charge Density step. Looking again at Figure 5.2, the green node will have a contribution to the electric field at the particle equal to its electric field multiplied by A G divided by the area of the entire region, where A G = (Sx x) (Sy y). The white particle would receive a contribution from the purple, red, and yellow nodes calculated analogously with A P, A R, and A O respectively. This step requires 23 operations per particle to perform. With peak performance and using FMAC operations, we could theoretically complete this step in 3.75*Np cycles. Our implementation of this program completes in approximately 6*Np cycles, achieving between 2 to 4 operations per cycle. Despite requiring more operations, this program is more efficient 19
27 than the Charge Density program since we are no longer accumulating Np times into a single register. We now only read from that register Np times and write into Np particles structs. However, our implementation suffers from inefficient memory bank access patterns. 5.7 Particle Update The particle update step is quite straightforward. Given the acceleration of each particle, we update position and velocity according to these elementary physics equations: v f = v 0 + at x f = x 0 + vt This step takes 8 operations per particle: 2 operations each to update xvelocity, y velocity, xposition, and yposition. On this program, the CS2 is able to achieve 8 operations per cycle, the theoretical peak performance, using FMAC. We complete this step in Np cycles. 20
28 CHAPTER 6 PERFORMANCE MODEL AND RESULTS 6.1 Problem Size First, it is productive to discuss the problem size that will saturate the CS2 without overloading its memory. The CS2 holds 850,000 PEs, in a rectangular grid. Our analysis assumes that each PE corresponds to one mesh cell. This means that the finest mesh we can support is a 900 by 900 grid on the CS2. It is possible to restructure the mapping so that PEs can handle multiple mesh cells, or even a column of mesh cells in a 3D PIC implementation. This is further discussed in Section 7. Each of the 850,000 PEs on the CS2 has 48KB of onchip memory. While this makes memory latency very low, it severely limits the size of problems we can put on the CS 2. Just how limited is our problem size? The piece of computation that is most memory intensive is the particle update step where we update particle position and velocity based on acceleration. In this step, we need to keep a particle ID, xcoordinate, ycoordinate, x velocity, and yvelocity for each particle. Assuming each piece of data is one word (16 bits), we need to store 10 bytes per particle. 10 bytes per particle and a maximum of 48KB on the PE means that we can have a maximum of 4800 particles on any single PE. The CS2 has approximately 850,000 PEs, bringing us to a total of 4.08 billion particles over the entire chip. In order to avoid overloading any single PE due to particle congregations, we need to restrict these numbers even further. We will reduce the number of particles loaded on a single PE at the start of a simulation to 10% of the calculated maximum. This brings us down to 480 particles on any single PE and 408 million particles across the entire chip. 21
29 Table 6.1: Number of Operations per PE by step. 6.2 Performance Model Putting all the steps together, we approximate that a PIC implementation on the CS2 will require 41 Np operations per PE, or (41 Np + 405) (850, 000) across the entire system. Theoretically, this program could run in 7.25 Np cycles on the CS 2. However, our implementation takes 17 N p cycles, not including preprocessing or intermediate setup between steps. Using our estimate of a 1 GHz clock rate, the total runtime of this PIC program will be (7.25 Np ) billion seconds and (17 Np ) billion seconds in the theoretical and observed performance model, respectively. This gives us a performance rate of ((41 Np + 405) )/(7.25 Np ) GFLOPS and ((41 Np + 405) )/(17 Np ) GFLOPS, respectively. These numbers are better summarized in Table 6.3. Graph 6.1 shows the theoretical and observed performance given a range of problem sizes. The CS2 performs better with larger problem sizes because it is able to utilize more of its compute power. On our maximum problem size of 408 million particles, the CS2 can achieve PFLOPS in our theoretical performance model and PFLOPS in our observed performance model. 22
30 Table 6.2: Number of cycles per PE needed to complete each step. Table 6.3: Performance Summary of PIC program on the CS2. Assumes all PEs are utilized. 23
31 Figure 6.1: Plot of theoretical and observed performance based on problem size. Np = 408 is the maximum problem size. 24
32 CHAPTER 7 DISCUSSION 7.1 Additional Complications Load Imbalance There are some additional complications that are worth addressing, the first of which is load imbalance. It is usually unnatural for particles to congregate. In an open space, particles will repel and attract each other in such a way that causes them to spread out, not congregate. Regardless, it is unlikely that we will have a perfect balance of particles between PEs at every iteration. Load imbalance is an important factor in determining valid input sizes that reduce the probability of overloading PE memory as well as performance. The performance of the CS2 is only as fast as the slowest PE. In the case of PIC, the slowest PE is the PE with the most particles in its region. Graph 7.1, shows how load imbalance can have a negative impact on performance Particle Data Transfers The second complication we consider is the possibility of particles moving across PE boundaries. When this happens, the particle data must be transferred to the PE whose region it is now in. Typically, with a reasonably small timestep, particles will not move farther than one cell size in one iteration. In fact, with a small enough timestep, particle movement across boundaries is not likely at all. However, if a particle does move out of its current region, it will end up in one of the adjacent regions, and the particle s data will only need to be transferred to one of the neighboring PEs. A program that performs this particle data transfer would first have to check whether the particle s position is out of the PE s boundaries in the north, east, west, or south directions, 25
33 Figure 7.1: Load imbalance can have a dramatic impact on performance. This graph compares the theoretical performance given different ratios of load imbalance. Mp = maximum number of particles on any given PE. Np = Average number of particles per PE. and if it is, send the particle s data to the appropriate PE. In our implementation of this step, it takes 14 operations to check a particle s position, and 5 operations to send a particle data out if necessary. The extra 5 operations it takes to send particle data out is negated by the fact that particles that are out of bounds will often not need to complete the 14 operations of conditional checks. Once one of the conditionals is satisfied, the PE will send the data out without finishing the rest of the conditional checks. Therefore we estimate that this step will take 14*Np operations overall. Our run of this program takes 20 cycles per particle. In addition, once a particle s data is transferred out of the PE s memory, it must be removed to avoid performing future updates on it. The naive method would be to simply mark the particle by zeroing out the particle ID. Any new particles transferred into the region would simply be appended to the end of the PE s particle list. This is an inefficient use of 26
34 space, and many clever data structures such as linked lists would be better candidates. However, due to the lowlevel programming required to work with the CS2, we prefer a method that is simple to model. To avoid filling up our particle list with null particles and running out of space, we introduce a condense step. In the condense step, the PE shifts the next active particle s data into the last zeroedout particle space. Using a two pointer method, the PE would check if the particle ID is zeroedout, and if so, place a pointer there. With the second pointer, continue along the particle list until it reaches an active particle. Then move the active particle data into the first pointer location. In the worst case, the PE moves every piece of particle data once, and eight operations will be needed per particle (one conditional operation and one move operation per piece of particle data). Taking advantage of the CS2 s SIMD architecture allowing it to move four words per cycle, we estimate that this step will take three cycles per particle. How often we need to perform a condense step depends on how quickly PEs fill up their particle lists. We settle for an overestimate and perform a condense step at every iteration. Factoring these complications in, we update our performance model. Instead of taking the average number of particles (Np), we instead take the maximum number of particles on any given PE (Mp). 7.2 Mapping Modifications We began the formulation of our performance model with the assumption that each PE will be assigned one mesh cell. This restricts the size and shape of the mesh to the number and arrangement of PEs on the CS2. There are modifications that can be made to accommodate larger mesh sizes. The first modification is blocking. We can place a rectangular array (or block) of mesh cells onto a single PE. The overall layout of the mapping would be preserved. However, it 27
35 Table 7.1: Updated PIC performance model taking into account load imbalance and particle data transfer. Figure 7.2: PIC Performance after accounting for particle movement. 28
36 would impact the PE s workload. Suppose we use blocks of size BbyB. The maximum number of particles would remain the same. The Charge Density step would gain added complexity with more nodes to keep track of. However, the overall number of operations and latency would remain the same since it relies solely on the number of particles on the PE. The Finite Difference step would acquire additional communication latency. Each PE would have to communicate 4 B values and receive 4 B values instead of just 4 each. The number of operations performed on each PE would also increase by a factor of B 2 (the number of mesh cells on the PE). Assuming we can achieve 4 operations per cycle, this would imply an increase in latency by a factor of 2 B B 2 for this step. The Electric Field step would receive the same increase in work and latency. Similar to the Charge Density step, the Interpolation and Particle Update steps performance would remain unchanged. Table 7.2 shows the performance result of blocking. The second modification we discuss is tessellation. Similar to blocking, tessellation places multiple mesh cells on a single PE. However, instead of a continuous block of mesh cells, we fold the mesh onto the PEs, achieving continuity across PEs, but not within PEs. This folding is better described in a diagram. Figure 7.1 shows a tessellation of a 9by9 mesh grid onto a 3by3 array of PEs. Tessellation, like blocking, would have no effect on the work or latency of the Charge Density, Interpolation, or Particle Update steps. However, tessellation would require B 2 times more values communicated in both the Finite Difference and Electric Field steps, as well as B 2 more operations to perform. Assuming a performance of 4 operations per cycle, this would imply an increase in latency by a factor of 0.5 B 2. Figure 7.3 shows how performance is affected by the block size, B. 7.3 Comparing Against Other Machines First, we isolate the Finite Difference step to compare its performance to more traditional HPCG code running on a CPU. We would like to see a performance improvement propor 29
37 Figure 7.3: A 9by9 grid of mesh cells tessellated onto a 3by3 grid of PEs. The mesh grid is folded onto the array of PEs. This mapping preserves adjacencies of mesh cells, keeping communication patterns consistent with nontessellated mappings. Table 7.2: Compares the Total Work and Total Latency between the original mapping, block mapping, and tessellation mapping. 30
38 Figure 7.4: Performance worsens with larger B. Blocking produces better performance than tessellation. In all cases, blocking and tessellation perform worse than the original one mesh per PE mapping. Thus, we recommend only using blocking or tessellation when the mesh size is larger than 850,000. tional to the number of PEs, but anticipate something closer to 1000 times. Using the HPCG benchmark program provided by the Innovative Computing Laboratory at the University of Tennessee[2], we benchmarked the performance of an Intel Core i78700k CPU with 6 cores running at 3.7 GHz. We chose a problem size consisting of a 208by312by208 matrix, ensuring that it would fit nicely into 16 GB of memory. This program ran in seconds, performing billion floating point operations, resulting in a performance rate of 3.53 GFLOPs per second. The CS2 performs the Finite Difference step with this problem size in 1500 cycles, performing 680 million floating point operations, resulting in a performance rate of TFLOPs per second. This is a somewhat unsatisfying comparison seeing as the HPCG benchmark implements a 3D iterative solver running on a 6 core CPU while our model is of a 2D iterative solver running on 850,000 PEs. However, it is difficult to find or access 31
39 machines that can fairly compare to the CS2. A common, and widely studied, parallel device is a GPU. Like the CS2, GPUs are a highly parallel accelerator. Unlike the CS2, however, GPUs utilize offchip, shared memory. GPUs also lack the wafer scale aspect that differentiates the CS2 from almost all other machines and delivers its unmatched communication speeds. With this said, GPUs are much more commonplace than CS2s currently are and have inspired many PIC implementations. We compare our performance model to a PIC implementation on the Kepler GPU architecture by Shah et al. From a run of their PIC implementation, they measured the Nvidia Tesla K40, GTX 690, and the Quadro K620 to perform TFLOPs, 117 GFLOPs, and 25 GFLOPs per second respectively on an input size 5.24 million particles on a 512by512 mesh.[8] The CS2 outperforms the Tesla K40 with a 1000x speed up. Again, this is not an entirely fair comparison. The CS2 costs much more than a Tesla K40, and is much larger with more PEs. However, it is important to note that performance of GPU clusters do not scale proportionally to the number of GPUs in the system. Doubling the number of GPUs used will not double the performance. For this reason, the CS2 as a whole system will perform better than a GPU cluster with a comparable processor count. 32
40 Figure 7.5: A log scale graph comparing the performance of the Tesla K40, GTX 90, and Quadro K620 to the CS2. 33
41 CHAPTER 8 CONCLUSION In this thesis we explored the viability of PIC codes on the CS2. We considered possible ways to map the problem onto the machine, and presented theoretical performance models. We also partially implemented PIC on the CS2 to get empirical performance results. At the maximum problem size of 408 million particles and 850,000 mesh cells, the CS2 has the potential to achieve 3.67 PFLOPS. In our implementation, we estimate a performance of 1.72 PFLOPS on the same problem size. Because this is only a partial implementation, it is worthwhile to consider both the empirical and theoretical numbers due to potential performance improvements in future implementations. We also account for potential complications such as load imbalance and particle movement across PE regions. Factoring this in and estimating the additional latency of a particle transfer step, we estimate a performance of 1.66 PFLOPS. A decrease in performance from our initial theoretical estimates, however still dominating in comparisons to other machines such as CPUs and GPUs. The particle implementation and empirical results we gathered showcase the potential of the CS2 beyond theoretical projections. It sets the baseline for future implementations. If the numbers presented in this paper aim to accomplish anything, it is to inspire future implementations and investigations into PIC on the CS2. The design and mapping considerations are essential elements to any PIC implementation on the CS2 and will serve future research on this topic well. Clearly, a complete implementation of PIC with attention to efficiency will allow for more concrete performance data that is more readily comparable to other PIC performance models and solidify the strength of the CS2 on PIC computations. A feature of a more efficient implementation may include workload sharing between PEs to mitigate latency introduced by workload imbalance. We strongly encourage future research to pick up where this thesis ends. 34
HardwareAware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
HardwareAware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationwhat operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?
Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the
More informationSystolic Computing. Fundamentals
Systolic Computing Fundamentals Motivations for Systolic Processing PARALLEL ALGORITHMS WHICH MODEL OF COMPUTATION IS THE BETTER TO USE? HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM? HOW
More informationon an system with an infinite number of processors. Calculate the speedup of
1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationImplementation of Canny Edge Detector of color images on CELL/B.E. Architecture.
Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture. Chirag Gupta,Sumod Mohan K cgupta@clemson.edu, sumodm@clemson.edu Abstract In this project we propose a method to improve
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT Inmemory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro and even nanoseconds.
More informationLecture 11: MultiCore and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: MultiCore and GPU Multicore computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 MultiCore System Integration of multiple processor cores on a single chip. To provide
More informationChoosing a Computer for Running SLX, P3D, and P5
Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some online
More informationLet s put together a Manual Processor
Lecture 14 Let s put together a Manual Processor Hardware Lecture 14 Slide 1 The processor Inside every computer there is at least one processor which can take an instruction, some operands and produce
More informationSCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION. MarcOlivier Briat, JeanLuc Monnot, Edith M.
SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION Abstract MarcOlivier Briat, JeanLuc Monnot, Edith M. Punt Esri, Redlands, California, USA mbriat@esri.com, jmonnot@esri.com,
More informationx64 Servers: Do you want 64 or 32 bit apps with that server?
TMurgent Technologies x64 Servers: Do you want 64 or 32 bit apps with that server? White Paper by Tim Mangan TMurgent Technologies February, 2006 Introduction New servers based on what is generally called
More informationNext Generation GPU Architecture Codenamed Fermi
Next Generation GPU Architecture Codenamed Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationA New Paradigm for Synchronous State Machine Design in Verilog
A New Paradigm for Synchronous State Machine Design in Verilog Randy Nuss Copyright 1999 Idea Consulting Introduction Synchronous State Machines are one of the most common building blocks in modern digital
More informationImproving Grid Processing Efficiency through ComputeData Confluence
Solution Brief GemFire* Symphony* Intel Xeon processor Improving Grid Processing Efficiency through ComputeData Confluence A benchmark report featuring GemStone Systems, Intel Corporation and Platform
More informationAnalysis of Micromouse Maze Solving Algorithms
1 Analysis of Micromouse Maze Solving Algorithms David M. Willardson ECE 557: Learning from Data, Spring 2001 Abstract This project involves a simulation of a mouse that is to find its way through a maze.
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationRecommended hardware system configurations for ANSYS users
Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range
More information?kt. An Unconventional Method for Load Balancing. w = C ( t m a z  ti) = p(tmaz  0i=l. 1 Introduction. R. Alan McCoy,*
ENL62052 An Unconventional Method for Load Balancing Yuefan Deng,* R. Alan McCoy,* Robert B. Marr,t Ronald F. Peierlst Abstract A new method of load balancing is introduced based on the idea of dynamically
More informationContributions to Gang Scheduling
CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance Driven Gang Scheduling,
More informationFRIEDRICHALEXANDERUNIVERSITÄT ERLANGENNÜRNBERG
FRIEDRICHALEXANDERUNIVERSITÄT ERLANGENNÜRNBERG INSTITUT FÜR INFORMATIK (MATHEMATISCHE MASCHINEN UND DATENVERARBEITUNG) Lehrstuhl für Informatik 10 (Systemsimulation) Massively Parallel Multilevel Finite
More informationGPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
More informationİSTANBUL AYDIN UNIVERSITY
İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER
More informationOperation Count; Numerical Linear Algebra
10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floatingpoint
More informationCapacity Planning Process Estimating the load Initial configuration
Capacity Planning Any data warehouse solution will grow over time, sometimes quite dramatically. It is essential that the components of the solution (hardware, software, and database) are capable of supporting
More informationLecture 1: the anatomy of a supercomputer
Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949
More informationThree Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture
White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive
More informationMixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State
More informationAPPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM
152 APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM A1.1 INTRODUCTION PPATPAN is implemented in a test bed with five Linux system arranged in a multihop topology. The system is implemented
More informationReliable Systolic Computing through Redundancy
Reliable Systolic Computing through Redundancy Kunio Okuda 1, Siang Wun Song 1, and Marcos Tatsuo Yamamoto 1 Universidade de São Paulo, Brazil, {kunio,song,mty}@ime.usp.br, http://www.ime.usp.br/ song/
More informationCUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute OxfordMan Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
More informationDisk Storage Shortfall
Understanding the root cause of the I/O bottleneck November 2010 2 Introduction Many data centers have performance bottlenecks that impact application performance and service delivery to users. These bottlenecks
More informationTechnology Update White Paper. High Speed RAID 6. Powered by Custom ASIC Parity Chips
Technology Update White Paper High Speed RAID 6 Powered by Custom ASIC Parity Chips High Speed RAID 6 Powered by Custom ASIC Parity Chips Why High Speed RAID 6? Winchester Systems has developed High Speed
More informationLow Power AMD Athlon 64 and AMD Opteron Processors
Low Power AMD Athlon 64 and AMD Opteron Processors Hot Chips 2004 Presenter: Marius Evers Block Diagram of AMD Athlon 64 and AMD Opteron Based on AMD s 8 th generation architecture AMD Athlon 64 and AMD
More informationLBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
More informationEfficient Parallel Graph Exploration on MultiCore CPU and GPU
Efficient Parallel Graph Exploration on MultiCore CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun Graph and its Applications Graph Fundamental
More informationCHAPTER 7: The CPU and Memory
CHAPTER 7: The CPU and Memory The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 4th Edition, Irv Englander John Wiley and Sons 2010 PowerPoint slides
More informationBig Ideas in Mathematics
Big Ideas in Mathematics which are important to all mathematics learning. (Adapted from the NCTM Curriculum Focal Points, 2006) The Mathematics Big Ideas are organized using the PA Mathematics Standards
More informationALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite
ALGEBRA Pupils should be taught to: Generate and describe sequences As outcomes, Year 7 pupils should, for example: Use, read and write, spelling correctly: sequence, term, nth term, consecutive, rule,
More informationGPU Computing with CUDA Lecture 2  CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2  CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
More informationBindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
More informationHow To Build A Cloud Computer
Introducing the Singlechip Cloud Computer Exploring the Future of Manycore Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Terascale Computing Research Sean Koehl Technology
More informationSParameters and Related Quantities Sam Wetterlin 10/20/09
SParameters and Related Quantities Sam Wetterlin 10/20/09 Basic Concept of SParameters SParameters are a type of network parameter, based on the concept of scattering. The more familiar network parameters
More informationParallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
More informationIntroduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and PreRequisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The prerequisites are significant
More informationEMC Business Continuity for Microsoft SQL Server Enabled by SQL DB Mirroring Celerra Unified Storage Platforms Using iscsi
EMC Business Continuity for Microsoft SQL Server Enabled by SQL DB Mirroring Applied Technology Abstract Microsoft SQL Server includes a powerful capability to protect active databases by using either
More informationParallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
More informationComputers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer
Computers CMPT 125: Lecture 1: Understanding the Computer Tamara Smyth, tamaras@cs.sfu.ca School of Computing Science, Simon Fraser University January 3, 2009 A computer performs 2 basic functions: 1.
More informationPARALLELS CLOUD STORAGE
PARALLELS CLOUD STORAGE Performance Benchmark Results 1 Table of Contents Executive Summary... Error! Bookmark not defined. Architecture Overview... 3 Key Features... 5 No Special Hardware Requirements...
More informationAn Introduction to Parallel Computing/ Programming
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
More informationEFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES
ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com
More informationChapter 2 Basic Structure of Computers. JinFu Li Department of Electrical Engineering National Central University Jungli, Taiwan
Chapter 2 Basic Structure of Computers JinFu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Functional Units Basic Operational Concepts Bus Structures Software
More informationComputer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on rasterscan TV technology The screen (and
More informationCurrent Standard: Mathematical Concepts and Applications Shape, Space, and Measurement Primary
Shape, Space, and Measurement Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two and threedimensional shapes by demonstrating an understanding of:
More informationPerformance of the JMA NWP models on the PC cluster TSUBAME.
Performance of the JMA NWP models on the PC cluster TSUBAME. K.Takenouchi 1), S.Yokoi 1), T.Hara 1) *, T.Aoki 2), C.Muroi 1), K.Aranami 1), K.Iwamura 1), Y.Aikawa 1) 1) Japan Meteorological Agency (JMA)
More informationVISUAL ALGEBRA FOR COLLEGE STUDENTS. Laurie J. Burton Western Oregon University
VISUAL ALGEBRA FOR COLLEGE STUDENTS Laurie J. Burton Western Oregon University VISUAL ALGEBRA FOR COLLEGE STUDENTS TABLE OF CONTENTS Welcome and Introduction 1 Chapter 1: INTEGERS AND INTEGER OPERATIONS
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford eresearch Centre Lecture 1 p. 1 Lecture 1 p.
More informationQuiz for Chapter 6 Storage and Other I/O Topics 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [6 points] Give a concise answer to each
More information2) Write in detail the issues in the design of code generator.
COMPUTER SCIENCE AND ENGINEERING VI SEM CSE Principles of Compiler Design UnitIV Question and answers UNIT IV CODE GENERATION 9 Issues in the design of code generator The target machine Runtime Storage
More informationChapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
More informationHow A V3 Appliance Employs Superior VDI Architecture to Reduce Latency and Increase Performance
How A V3 Appliance Employs Superior VDI Architecture to Reduce Latency and Increase Performance www. iprocom.com/i t Contents Overview...3 Introduction...3 Understanding Latency...3 Network Latency...3
More informationSystem Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology  2. Terminology  1
System Interconnect Architectures CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.4 System Interconnect Architectures Direct networks for static connections Indirect
More informationOpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
More informationA Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling JingChiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University
More informationFacebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
More informationFPGAbased Multithreading for InMemory Hash Joins
FPGAbased Multithreading for InMemory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
More informationAssociation Between Variables
Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi
More informationPerformance Monitoring of Parallel Scientific Applications
Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure
More informationAnalysis of the UNH Data Center Using CFD Modeling
Applied Math Modeling White Paper Analysis of the UNH Data Center Using CFD Modeling By Jamie Bemis, Dana Etherington, and Mike Osienski, Department of Mechanical Engineering, University of New Hampshire,
More informationFour Keys to Successful Multicore Optimization for Machine Vision. White Paper
Four Keys to Successful Multicore Optimization for Machine Vision White Paper Optimizing a machine vision application for multicore PCs can be a complex process with unpredictable results. Developers need
More informationDistributed Dynamic Load Balancing for IterativeStencil Applications
Distributed Dynamic Load Balancing for IterativeStencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
More informationA SIMULATOR FOR LOAD BALANCING ANALYSIS IN DISTRIBUTED SYSTEMS
Mihai Horia Zaharia, Florin Leon, Dan Galea (3) A Simulator for Load Balancing Analysis in Distributed Systems in A. Valachi, D. Galea, A. M. Florea, M. Craus (eds.)  Tehnologii informationale, Editura
More informationHigh Performance Computing. Course Notes 20072008. HPC Fundamentals
High Performance Computing Course Notes 20072008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define  it s a moving target. Later 1980s, a supercomputer performs
More informationJan F. Prins. Workefficient Techniques for the Parallel Execution of Sparse Gridbased Computations TR91042
Workefficient Techniques for the Parallel Execution of Sparse Gridbased Computations TR91042 Jan F. Prins The University of North Carolina at Chapel Hill Department of Computer Science CB#3175, Sitterson
More informationEvaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
More informationMultiBlock Gridding Technique for FLOW3D Flow Science, Inc. July 2004
FSI02TN59R2 MultiBlock Gridding Technique for FLOW3D Flow Science, Inc. July 2004 1. Introduction A major new extension of the capabilities of FLOW3D  the multiblock grid model  has been incorporated
More informationA Strategy for Teaching Finite Element Analysis to Undergraduate Students
A Strategy for Teaching Finite Element Analysis to Undergraduate Students Gordon Smyrell, School of Computing and Mathematics, University of Teesside The analytical power and design flexibility offered
More informationTurbomachinery CFD on manycore platforms experiences and strategies
Turbomachinery CFD on manycore platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 2729
More informationThe Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA
The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques
More informationLecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCIGA.3033012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU
More informationYALES2 porting on the Xeon Phi Early results
YALES2 porting on the Xeon Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN  Demijournée calcul intensif, 16 juin
More informationIntroduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it
t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
More informationWindows Server Performance Monitoring
Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly
More informationMachine Learning over Big Data
Machine Learning over Big Presented by Fuhao Zou fuhao@hust.edu.cn Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationThe Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.
White Paper 0213133 Page 1 : A Software Framework for Parallel Programming* The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. ABSTRACT Programming for Multicore,
More informationAccelerating CFD using OpenFOAM with GPUs
Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide
More informationGPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
More informationResource Allocation Schemes for Gang Scheduling
Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian
More informationAdaptive Tolerance Algorithm for Distributed TopK Monitoring with Bandwidth Constraints
Adaptive Tolerance Algorithm for Distributed TopK Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of WisconsinMadison Department of Computer Sciences {bauer, srini}@cs.wisc.edu
More informationReady Time Observations
VMWARE PERFORMANCE STUDY VMware ESX Server 3 Ready Time Observations VMware ESX Server is a thin software layer designed to multiplex hardware resources efficiently among virtual machines running unmodified
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture SharedMemory
More informationInfiniBand Clustering
White Paper InfiniBand Clustering Delivering Better Price/Performance than Ethernet 1.0 Introduction High performance computing clusters typically utilize Clos networks, more commonly known as Fat Tree
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk OxfordMan Institute of Quantitative Finance Oxford University Mathematical Institute Oxford eresearch
More informationThe Bus (PCI and PCIExpress)
4 Jan, 2008 The Bus (PCI and PCIExpress) The CPU, memory, disks, and all the other devices in a computer have to be able to communicate and exchange data. The technology that connects them is called the
More informationModeling an AgentBased Decentralized File Sharing Network
Modeling an AgentBased Decentralized File Sharing Network Alex Gonopolskiy Benjamin Nash December 18, 2007 Abstract In this paper we propose a distributed file sharing network model. We take inspiration
More informationCompact Representations and Approximations for Compuation in Games
Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions
More information