Personal Profile

GPU

GPU-Accelerated Supercomputer Targets Tumors

A team of researchers from the  Helmholtz-Zentrum Dresden-Rossendorf (HZDR) research lab in Germany are using the Titan Supercomputer at Oak Ridge National Laboratory to advance laser-driven radiation treatment of cancerous tumors.

Recently, doctors have been using beams of heavy particles, such as protons or ions, to treat cancer tumors. These beams can deposit most of their energy inside the tumor, while at the same time leaving the healthy tissue unharmed. Unfortunately, these beams are generated by large particle accelerators, which make the treatment cost prohibitive for many patients.

The German lab is developing a new therapeutic approach using high-powered lasers instead of inconvenient and expensive particle accelerators.

Proton density after laser impact on a spherical solid density target: irradiated by an ultra-short, high intensity laser (not in picture) the intense electro-magnetic field rips electrons apart from their ions and creates a plasma.

Image Credits: Axel Huebl, HZDR, David Pugmire, ORNL

HZDR researcher Michael Bussmann explains in a recent blog post that they are only able to run such complex simulations because his team has access to GPU-accelerated supercomputers.

The team does all of its calculations on Titan’s Tesla GPUs at a rate 10 to 100 times faster than what is possible on CPU-only machines. “We no longer think of simulations in terms of CPU hours but rather frames per second,” Bussmann said, describing the effect this speed-up has had on the team’s research.

Using Machine Learning to Optimize Warehouse Operations

With thousands of orders placed every hour and each order assigned to a pick list, Europe’s leading online fashion retailer Zalando is using GPU-accelerated deep learning to identify the shortest route possible to products in their 1.3 million-square-foot distribution center.

Two schematics of a rope ladder warehouse zone with picks. The blue shelves denote shelves with items to be picked, so the goal is to find the shortest possible route that allows a worker to visit all blue shelves while starting and ending at the depot.

Two schematics of a rope ladder warehouse zone with picks. The blue shelves denote shelves with items to be picked, so the goal is to find the shortest possible route that allows a worker to visit all blue shelves while starting and ending at the depot.

Calvin Seward, a Data Scientist focused on warehouse logistics, shares how his team is using the Caffe deep learning framework and Tesla K80 GPUs to train their deep neural network to greatly accelerate a processing bottleneck, which in turn enabled the company to more efficiently split work between workers.

How GPUs are Revolutionizing Machine Learning

How GPUs are Revolutionizing Machine Learning

How GPUs are Revolutionizing Machine Learning

NVIDIA announced that Facebook will accelerate its next-generation computing system with the NVIDIA Tesla Accelerated Computing Platform which will enable them to drive a broad range of machine learning applications.

Facebook is the first company to train deep neural networks on the new Tesla M40 GPUs – introduced last month – this will play a large role in their new open source “Big Sur” computing platform, Facebook AI Research’s (FAIR) purpose-built system designed specifically for neural network training.

How GPUs are Revolutionizing Machine Learning---Open Rack V2 compatible 8-GPU server. Big Sur is two times faster than Facebook’s existing system and will enable the company to train twice as many neural networks which in return will help develop more accurate neural network models and new classes of advanced applications.

How GPUs are Revolutionizing Machine Learning—Open Rack V2 compatible 8-GPU server. Big Sur is two times faster than Facebook’s existing system and will enable the company to train twice as many neural networks which in return will help develop more accurate neural network models and new classes of advanced applications.

Training the sophisticated deep neural networks that power applications such as speech translation and autonomous vehicles requires a massive amount of computing performance.

With GPUs accelerating the training times from weeks to hours, it’s not surprising that nearly every leading machine learning researcher and developer is turning to the Tesla Accelerated Computing Platform and the NVIDIA Deep Learning software development kit.

A recent article on WIRED explains how GPUs have proven to be remarkably adept at deep learning and how large web companies like Facebook, Google and Baidu are shifting their computationally intensive applications to GPUs.

The artificial intelligence is on and it’s powered by GPU-accelerated machine learning.

Getting Started with OpenACC

This week NVIDIA has released the NVIDIA OpenACC Toolkit, a starting point for anyone interested in using OpenACC. OpenACC gives scientists and researchers a simple and powerful way to accelerate scientific computing without significant programming effort. The toolkit includes the PGI OpenACC Compiler, the NVIDIA Visual Profiler with CPU and GPU profiling, and the new OpenACC Programming and Best Practices Guide. Academics can get a free renewable license to the PGI C,C++ and Fortran compilers with support for OpenACC.

Figure 1: LS-DALTON: Benchmark on Oak Ridge Titan Supercomputer, AMD CPU vs Tesla K20X GPU. Test input: Alanine-3 on CCSD(T) module. Additional information: NICAM COSMO
Figure 1: LS-DALTON: Benchmark on Oak Ridge Titan Supercomputer, AMD CPU vs Tesla K20X GPU. Test input: Alanine-3 on CCSD(T) module. Additional information: NICAM COSMO

OpenACC is an open specification for compiler directives for parallel programming. By using OpenACC, developers can rapidly accelerate existing C, C++, and Fortran applications using high-level directives that help retain application portability across processor architectures. Figure 1 shows some examples of real code speedups with OpenACC. The OpenACC specification is designed and maintained with the cooperation of many industry and academic partners, such as Cray, AMD, PathScale, University of Houston, Oak Ridge National Laboratory and NVIDIA.

When I program with and teach OpenACC I like to use a 4 step cycle to progressively accelerate the code.

  1. Identify Parallelism: Profile the code to understand where the program is spending its time and how much parallelism is available to be accelerated in those important routines. GPUs excel when there’s a significant amount of parallelism to exploit, so look for loops and loop nests with a lot of independent iterations.
  2. Express Parallelism: Placing OpenACC directives on the loops identified in step 1 tells the compiler to parallelize them. OpenACC is all about giving the compiler enough information to effectively accelerate the code, so during this step I add directives to as many loops as I believe I can and move as much of the computation to the GPU as possible.
  3. Express Data Locality: The compiler needs to know not just what code to parallelize, but also which data will be needed on the accelerator by that code. After expressing available parallelism, I often find that the code has slowed down. As you’ll see later in this post, this slowdown comes from the compiler making cautious decisions about when data needs to be moved to the GPU for computation. During this step, I’ll express to the compiler my knowledge of when and how the data is really needed on the GPU.
  4. Optimize – The compiler usually does a very good job accelerating code, but sometimes you can get more performance by giving the compiler a little more information about the loops or by restructuring the code to increase parallelism or improve data access patterns. Most of the time this leads to small improvements, but sometimes gains can be bigger.

It’s really important after each of these steps to verify that the program still produces correct results, as it’s much easier to debug a mistake due to one small change to the code rather than waiting to the end after making many changes. This process is a cycle, so once you have an important hotspot running well with OpenACC, go back to the beginning and find the next place to express parallelism.

In this post I’ll review steps 2, 3, and 4 using a simple benchmark code, but first, let’s install the NVIDIA OpenACC Toolkit. If you already have an OpenACC compiler installed on your machine, you can skip this step.

Installing the OpenACC Toolkit

You can download the OpenACC Toolkit from http://www.nvidia.com/openacc by clicking the download link. In order to install it on your machine, you will need to be running a 64-bit Linux OS and have a CUDA-capable GPU installed. Once you’ve downloaded the package, untar it into a temporary directory and run the install script. When asked whether to install the CUDA Toolkit Components, be sure to answer yes, as they are required by the OpenACC compiler. If you’ve never installed the CUDA Toolkit on your machine, you will need to take one additional step: installing the NVIDIA CUDA driver. This driver is included in the installation directory, which defaults to /opt/pgi. Within that directory, go to the linux86-64/2015/OpenACC/driver directory and use bash ./*.run to install the CUDA driver or head to https://developer.nvidia.com/cuda-downloads to install the CUDA Toolkit and driver via your Linux package manager. This will walk you through installing the driver. If you have any trouble with this step, please post a question on the NVIDIA forums.

Before moving forward, be sure to take a look at the documentation included in the Toolkit installation directory under the linux86-64/2015/OpenACC/doc directory.

Jacobi Iteration

Laplace's Equation on a ring with boundary conditions.
Figure 2: Laplace’s Equation on a ring with boundary conditions. Laplace Equation Image via WikiMedia Commons

The benchmark I will use is a simple C code that solves the 2D Laplace equation with the iterative Jacobi solver. Iterative methods are a common technique to approximate the solution of elliptic PDEs, like the Laplace equation, within an allowable tolerance. In the example I’ll perform a simple stencil calculation where each point calculates its value as the mean of its neighbors’ values. The Jacobi iteration continues until either the maximum change in value between two iterations drops below the tolerance level or it exceeds the maximum number of iterations. For the sake of consistent comparison all the results I show are for 1000 complete iterations. Here’s the main Jacobi iteration loop.

while ( error > tol && iter < iter_max ) {
  error = 0.0;
  for( int j = 1; j < n-1; j++) {
    for( int i = 1; i < m-1; i++ ) {
      Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] +
                            A[j-1][i] + A[j+1][i]);
      error = fmax( error, fabs(Anew[j][i] - A[j][i]));
    }
  }

  for( int j = 1; j < n-1; j++) {
    for( int i = 1; i < m-1; i++ ) {
      A[j][i] = Anew[j][i];
    }
  }

  if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);
  iter++;
}

The outer loop iterates for 1000 iterations or until the results converge to within an acceptable tolerance (whichever comes first). I’ll refer to this loop as the convergence loop. The two j, i loop nests within the convergence loop do the main calculation. The first loop calculates the new values for each point in the matrix as the mean of the neighboring values. The second loop nest copies these values into the A array for the next iteration. Note that the code calculates an error value for each element, which is how much the value changed between iterations, and finds the maximum value of error to determine whether the solution has converged.

Express Parallelism

In this code it’s clear that the convergence loop cannot be parallelized because of the dependence of each iteration on the results of the previous. This is known as a data dependency. The two inner loop nests over i and j, however, can be parallelized, as each iteration writes its own unique value and does not read from the values calculated in other iterations. These loops can be executed forward, backward, in a random order, or all simultaneously and the results will be the same (modulo floating point error). Therefore, the two loop nests are candidates for acceleration. I’ll use the OpenACC kernels directive to accelerate them. The kernels directive tells the compiler to analyze the code in the specified region to look for parallel loops. In the following code, I’ve placed a kernels region within the convergence loop, around the two loop nests that perform the calculation and value swapping, since this is where the compiler is most likely to find parallelism.

while ( error > tol && iter < iter_max ) {
  error = 0.0;
  #pragma acc kernels
  {
    for( int j = 1; j < n-1; j++) {
      for( int i = 1; i < m-1; i++ ) {
        Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
                            + A[j-1][i] + A[j+1][i]);
        error = fmax( error, fabs(A[j][i] - Anew[j][i]));
      }
    }

    for( int j = 1; j < n-1; j++) {
      for( int i = 1; i < m-1; i++ ) {
        A[j][i] = Anew[j][i];
      }
    }
  }

  if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);
  iter++;
}

Having inserted an OpenACC directive I now want to build the code. For this I’ll use the PGI C/C++ compiler, pgcc, which is included in the OpenACC Toolkit. I want the compiler to generate code specific to NVIDIA GPUs, so I add the -ta=tesla compiler option (ta means target accelerator and tesla targets NVIDIA Tesla GPUs). I also use the optional -Minfo=accel compiler flag to tell the compiler to provide feedback about the code it generates. Below is the output from this command.

$ pgcc -acc -ta=tesla -Minfo=accel laplace2d-kernels.c
main:
     56, Generating copyout(Anew[1:4094][1:4094])
         Generating copyin(A[:][:])
         Generating copyout(A[1:4094][1:4094])
         Generating Tesla code
     58, Loop is parallelizable
     60, Loop is parallelizable
         Accelerator kernel generated
         58, #pragma acc loop gang /* blockIdx.y */
         60, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
         64, Max reduction generated for error
     68, Loop is parallelizable
     70, Loop is parallelizable
         Accelerator kernel generated
         68, #pragma acc loop gang /* blockIdx.y */
         70, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

Notice from the compiler output that the compiler identified two regions of code for generating an accelerator kernel. (A kernel is simply a function that is run in parallel on the accelerator / GPU). In this case, the loop nests starting at lines 58 and 68 have been identified as safe to parallelize and the compiler has generated kernels for these loops. The compiler also analyzed which arrays are used in the calculation and generated code to move A and Anew into GPU memory. The compiler even detected that it needs to perform a max reduction on the error variable.

A reduction is an operation that takes many input values and combines them into a single result using a binary operator. In this c+ase, each loop iteration calculates an error value, but the iteration logic only needs one result: the maximum error. Reductions can be tricky in some parallel programming languages, but fortunately the OpenACC compiler handled the reduction for me. Now that I’ve built the code, I can expect it to run significantly faster on my GPU, right? Figure 3 shows the performance of this code running on multiple cores of an Intel Xeon E5-2698 CPU to performance on an NVIDIA Tesla K40 GPU. The CPU code was parallelized using OpenMP and compiled with PGI 15.5 and -fast optimizations.

OpenACC Jacobi Iteration after adding a kernels directive.
Figure 3: OpenACC Jacobi Iteration after adding a kernels directive (rightmost bar), compared to the original serial code on the CPU and accelerated with OpenMP (2-16 CPU threads).

Figure 1 clearly shows a speedup from increasing the number of CPU threads—although not for more than 8 threads—but what about the GPU performance? The OpenACC code on the GPU slightly outperforms the original, single-threaded version, but not the multi-threaded version. I used the NVIDIA Visual Profiler to help understand what’s happening. Figure 4 shows the GPU timeline, zoomed in to just 2 iterations of the convergence loop. The profiler shows the kernels of the two loop nests in green and purple, but the run time of the kernels is overshadowed by the tan boxes that represent PCIe data copies (host to device and device to host). Now that I know the problem, I can fix it.

NVIDIA Visual Profiler of OpenACC Jacobi Iteration after adding kernels directive.
Figure 4: Running the NVIDIA Visual Profiler on the  OpenACC Jacobi iteration code after adding a kernels directive shows that run time is dominated by PCI-express data transfers.

Express Data Locality

Why is PCIe data transfer time overwhelming the speed-up from GPU execution of the Jacobi iteration? Looking closely at the Visual Profiler timeline it appears that the code copies data to the GPU each time it enters the kernels region, and copies the data back to the CPU each time it exits the kernels region. Looking at the algorithm, however, I only really need the initial value of A copied to the device at the beginning of the convergence loop and the final value copied out at the end. The data doesn’t change between iterations of the while loop, so I can leave the arrays on the device. As the programmer, it’s easy for me to see this because I know the algorithm, but it’s very difficult for the compiler. Compilers can’t afford to produce wrong answers, so they’d rather copy the data too often than not copy it and produce incorrect results. In fact, the OpenACC specification requires the compiler to default to copying arrays to and from the device at every kernels region when the programmer doesn’t specify how to handle them. Fortunately, OpenACC provides a way to tell the compiler when data really needs to move. The OpenACC data construct specifies data movement or residency options for a region of code, using clauses that inform the compiler explicitly how to handle arrays, as shown in the following code.

#pragma acc data copy(A) create(Anew)
while ( error > tol && iter < iter_max )
{
  error = 0.0;

  #pragma acc kernels
  {
    for( int j = 1; j < n-1; j++) {
      for( int i = 1; i < m-1; i++ ) {
        Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
                            + A[j-1][i] + A[j+1][i]);
        error = fmax( error, fabs(A[j][i] - Anew[j][i]));
      }
    }

    for( int j = 1; j < n-1; j++) {
      for( int i = 1; i < m-1; i++ ) {
        A[j][i] = Anew[j][i];
      }
    }
  }

  if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);
  iter++;
}

I added a data region around the while loop, since this is where I can reuse the data on the GPU. I used the copy clause to tell the compiler that it should copy the A array to and from the device as it enters and exits the region, respectively. Since the Anew array is only used within the convergence loop, I tell the compiler to create temporary space on the device, since I don’t care about the initial or final values of that array. There are other data clauses that I don’t use in this example:

  • copyin initializes the value of the array but does not copy the results back;
  • copyout allocates uninitialized space on the device and copies the final values back; and
  • present tells the compiler to assume that the array was moved or allocated on the device somewhere else in the program.

This single-line data clause makes a big difference to performance. Figure 5 shows that with data locality properly expressed the code runs six times faster than a full CPU socket with almost no data motion.

OpenACC Jacobi Iteration after adding data region.
Figure 5: OpenACC Jacobi Iteration after adding a `data` region to optimize data locality (rightmost bar), compared to the original serial code on the CPU and accelerated with OpenMP (2-16 CPU threads).

Figure 6 shows the NVIDIA Visual Profiler timeline, which shows that there’s no data motion between convergence iterations, just as expected.

NVIDIA Visual Profiler of OpenACC Jacobi Iteration after adding a data directive.
Figure 6: NVIDIA Visual Profiler timeline for  theOpenACC Jacobi Iteration after adding a `data` region to optimize locality.

The third step in the OpenACC APOD process is to further optimize the loops, but it turns out that the PGI compiler already does a great  job optimizing this code and there’s not much more that we can do. If you’d like to see more about optimizing your OpenACC code, however, you should check out my Advanced OpenACC Programming session from GTC.

Optimize

Given the simplicity of this code, the compiler already does a pretty good job generating fast code, but there’s still a little room for optimization. After some experimentation I found that by using the loop tile clause I can reduce the runtime by a few percent. The loop directive is a way to give the compiler additional information about the following loop, such as informing the compiler that all loop iterations are data independent, or guiding the compiler’s optimization within the loop.

It turns out that there’s a lot of data reuse between loop iterations in the i and j dimensions of A, so we’ll use the tile directive to tell the compiler it should operate on tiles of the array, rather than parallelizing strictly over columns. If I wanted to do this manually, I’d have to add two more loops to the calculation to break the work up into chunks (as one might when optimizing for CPU cache blocking), but the tile clause tells the compiler to do it automatically. On my machine, a 32×4 tile gives the best performance, reducing the runtime by an additional 6%. Below is the resulting code. Notice that I added a device_type(nvidia) clause to my loop directives, which informs the compiler to only apply this optimization on NVIDIA GPUs, where I’ve confirmed that it’s beneficial. Using the device_type clause helps keep my code portable to other devices.

#pragma acc data copy(A) create(Anew)
while ( error > tol && iter < iter_max )
{
  error = 0.0;

  #pragma acc kernels
  {
    #pragma acc loop tile(32,4) device_type(nvidia)
    for( int j = 1; j < n-1; j++) {
      for( int i = 1; i < m-1; i++ ) {
        Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
                            + A[j-1][i] + A[j+1][i]);
        error = fmax( error, fabs(A[j][i] - Anew[j][i]));
      }
    }

    #pragma acc loop tile(32,4) device_type(nvidia)
    for( int j = 1; j < n-1; j++) {
      for( int i = 1; i < m-1; i++ ) {
        A[j][i] = Anew[j][i];
      }
    }
  }

  if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);
  iter++;
}

Performance Portability from GPUs to CPUs with OpenACC

OpenACC gives scientists and researchers a simple and powerful way to accelerate scientific computing applications incrementally. The OpenACC API describes a collection of compiler directives to specify loops and regions of code in standard C, C++, and Fortran to be offloaded from a host CPU to an attached accelerator. OpenACC is designed for portability across operating systems, host CPUs, and a wide range of accelerators, including APUs, GPUs, and many-core coprocessors.

And starting today, with the PGI Compiler 15.10 release, OpenACC enables performance portability between accelerators and multicore CPUs. The new PGI Fortran, C and C++ compilers for the first time allow OpenACC-enabled source code to be compiled for parallel execution on either a multicore CPU or a GPU accelerator. This capability provides tremendous flexibility for programmers, enabling applications to take advantage of multiple system architectures with a single version of the source code.PGI 15.10

“Our goal is to enable HPC developers to easily port applications across all major CPU and accelerator platforms with uniformly high performance using a common source code base,” said Douglas Miles, director of PGI Compilers & Tools at NVIDIA. “This capability will be particularly important in the race towards exascale computing in which there will be a variety of system architectures requiring a more flexible application programming approach.”

OpenACC Portable PerformanceAs the chart above shows, performance on multicore CPUs for HPC apps using MPI + OpenACC is equivalent to MPI + OpenMP code. Compiling and running the same code on a Tesla K80 GPU can provide large speedups.

Key benefits of running OpenACC on multicore CPUs include:

  • Effective utilization of all cores of a multicore CPU or multi-socket server for parallel execution
  • Common programming model across CPUs and GPUs in Fortran, C, and C++
  • Rapid exploitation of existing multicore parallelism in a program using the KERNELS directive, which enables incremental optimization for parallel execution
  • Scalable performance across multicore CPUs and GPUs

PGI’s compiler roadmap, shown below, includes plans to support all of the compute processors that are likely to be viable building blocks for Exascale systems.

PGI Roadmap

How to Compile OpenACC Applications for Multicore CPUs

Passing the flag -ta=multicore on the PGI compiler (pgcc, pgc++ or pgfortran) command line tells the compiler to generate parallel multicore code for OpenACC compute regions, instead of the default of generating parallel GPU kernels. The parallel multicore code will execute in much the same fashion as if you had used OpenMP omp parallel directives instead of OpenACC compute regions.

Adding -Minfo or -Minfo=accel will enable compiler feedback messages, giving details about the parallel code generated, as in the following.

    ninvr:
       59, Loop is parallelizable
           Generating Multicore code
       59, #pragma acc loop gang
    pinvr:
       90, Loop is parallelizable
           Generating Multicore code
       90, #pragma acc loop gang

You can control how many threads the program will use to run the parallel compute regions with the environment variable ACC_NUM_CORES. The default is to use all available cores on the system. For Linux targets, the runtime will launch as many threads as physical cores (not hyper-threaded logical cores). OpenACC gang-parallel loops run in parallel across the threads. If you have an OpenACC parallel construct with a num_gangs(200) clause, the runtime will take the minimum of the num_gangs argument and the number of cores on the system, and launch that many threads. That avoids the problem of launching hundreds or thousands of gangs, which makes sense on a GPU but which would overload a multicore CPU.

Single Programming Model, Portable High Performance

The goal of OpenACC is to have a single programming model that allows developers to write a single program that runs with high performance in parallel across a wide range of target systems. For the last few years, PGI has been developing and delivering OpenACC compilers targeting NVIDIA Tesla and AMD Radeon GPUs, but performance portability requires being able to run the same program with high performance in parallel on non-GPU targets, and in particular on multicore and manycore CPUs. So, the first reason to use OpenACC with -ta=multicore is if you have an application that you want to use on systems with GPUs, and on other systems without GPUs but with multicore CPUs. This allows you to develop your program once, without having to include compile-time conditionals (ifdefs) or special modules for each target with the increased development and maintenance cost.

Even if you are only interested in GPU-accelerated targets, you can do parallel OpenACC code development and testing on your multicore laptop or workstation without a GPU. This can separate algorithm development from GPU performance tuning. Debugging is often easier on the host than with a heterogeneous binary with both host and GPU code.

Working Through an Example: Please do Try This at Home!

To demonstrate the performance shown in the chart above, you can download the version of miniGhost used to generate the performance numbers from the PGI website.

To build the OpenMP version for execution on multicore, issue the following commands.

% make build_omp
…
mg_stencil_3d7pt:
   197, Parallel region activated
   200, Parallel loop activated with static block schedule
   202, Generated 4 alternate versions of the loop
        Generated vector sse code for the loop
        Generated 5 prefetch instructions for the loop
   213, Barrier
   216, Parallel loop activated with static block schedule
   218, Mem copy idiom, loop replaced by call to __c_mcopy8
   224, Barrier
        Parallel region terminated
…
% export MP_BIND=yes; make NCPUS=32 run_omp
env OMP_NUM_THREADS=32 time sh -x ./miniGhost.run ./miniGhost.omp >& miniGhost.omp.log
grep elapsed miniGhost.omp.log
8527.57user 5.96system 4:27.43elapsed 3190%CPU (0avgtext+0avgdata 6650048maxresident)k

This example is using the PGI OpenMP compiler, but the OpenMP time in the chart above uses the Intel OpenMP compiler. You’ll see about the same execution time using either of these two OpenMP compilers.

To build the OpenACC version for multicore using PGI, issue the following commands.

% make build_multicore
…
mg_stencil_3d7pt:
   219, Loop is parallelizable
        Generating Multicore code
   219, !$acc loop gang
   220, Loop is parallelizable
   221, Loop is parallelizable
   232, Loop is parallelizable
        Generating Multicore code
   232, !$acc loop gang
   233, Loop is parallelizable
   234, Loop is parallelizable
…
% export MP_BIND=yes; make NCPUS=32 run_multicore
env ACC_NUM_CORES=32 time sh -x ./miniGhost.run ./miniGhost.multi >& miniGhost.multi.log
grep elapsed miniGhost.multi.log
8006.06user 4.88system 4:14.04elapsed 3153%CPU (0avgtext+0avgdata 6652288maxresident)k

Finally, to build the OpenACC version for execution on an NVIDIA GPU using PGI, issue the following commands.

% make build_tesla
…
mg_stencil_3d7pt:
   216, Generating present(work(:,:,:),grid(:,:,:))
   219, Loop is parallelizable
   220, Loop is parallelizable
   221, Loop is parallelizable
        Accelerator kernel generated
        Generating Tesla code
   220, !$acc loop gang, vector(2) ! blockidx%y threadidx%y
   221, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
   232, Loop is parallelizable
   233, Loop is parallelizable
   234, Loop is parallelizable
        Accelerator kernel generated
        Generating Tesla code
   233, !$acc loop gang ! blockidx%y
   234, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
…
% make DEV_NUM=0 run_tesla
env ACC_DEVICE_NUM=0 time ./miniGhost.run ./miniGhost.tesla >& miniGhost.tesla.log
grep elapsed miniGhost.tesla.log
122.25user 30.12system 2:32.61elapsed 99%CPU (0avgtext+0avgdata 7542656maxresident)k

OpenACC Data Clauses on Multicore

In the OpenACC execution model, the multicore CPU is treated like an accelerator device that shares memory with the initial host thread. With a shared-memory device, most of the OpenACC data clauses (copy, copyin, copyout, create) are ignored, and the accelerator device (the parallel multicore) uses the same data as the initial host thread. Similarly, update directives and most OpenACC data API routines will not generate data allocation or movement. Other data clauses are still honored, such as private and reduction, which may require some dynamic memory allocation and data movement, but no more than the corresponding OpenMP data clauses.

When using OpenACC with a GPU, data gets copied from the system memory to device memory (and back). The user is responsible for keeping the two copies of data coherent, as needed. When using OpenACC on a multicore, there is only one copy of the data, so there is no coherence problem. However, the GPU OpenACC program can produce different results than a multicore OpenACC program if the program depends on the parallel compute regions updating a different copy of the data than the sequential initial host thread regions.

#pragma acc data create(a[0:n]) present(x[0:n],b[0:n])
{
    // following loop executed on device
    #pragma acc parallel loop
    for(i=0;i<n;++i) a[i] = b[i];

    // following loop executed on host
    for(i=0;i<n;++i) a[i] = c[i];

    // following loop executed on device
    #pragma acc parallel loop
    for(i=0;i<n;++i) x[i] = a[i];
    ...
}

On a GPU, the above code fragment allocates a copy of the array a on the device. It then fills in the device copy and the host copy with different values. The last loop will get the values from the device copy of a, so it’s equivalent to x[i]=b[i]; When compiled for a multicore, the first two loops are both executed on the CPU, the first with all multicore threads and the second with a single thread. Both loops update the same copy of a, and the last loop will be equivalent to x[i]=c[i].

Requirements and Limitations

PGI compilers on Linux, Windows, and Mac OS X support OpenACC for multicore. It works with any supported PGI target, including targets for which GPUs are not supported. This feature will work with any valid PGI license.

There are a few limitations in this release, which will be removed in future releases. In this release, the collapse clause is ignored, so only the outer loop is parallelized. The worker level of parallelism is ignored; PGI is still exploring how best to generate parallel code that includes gang, worker and vector parallelism. Also, no optimization or tuning of the loop code is done. For instance, when compiling for a GPU, the compiler will reorder loops to optimize array strides for the parallelism profile. None of this is implemented in the multicore target in this release. Finally, the vector level of parallelism is not being used to generate SIMD code in this release. PGI expects application performance will improve as these limitations are relaxed.

Conclusion

The PGI 15.10 release allows you to generate multicore CPU code from your OpenACC programs, enabling truly performance-portable parallel programs across CPUs and GPUs.

Register to download a free trial of the PGI 15.10 compilers and check it out for yourself. If you’re new to OpenACC, you can register for a free online OpenACC training course. To get started developing with OpenACC, try the NVIDIA OpenACC Toolkit, and read this introductory Parallel Forall post on OpenACC. A detailed article on OpenACC for CPUs with PGI 15.10 will be included in an upcoming PGInsider Newsletter from PGI.

Learn Why Virtual GPUs Are the Talk of Citrix Synergy

Graphics acceleration is a hot topic at this year’s Citrix Synergy conference for virtualization, mobility, networking and cloud solutions, May 12-14 in Orlando, Fla.

That’s because customers like Bell Helicopter, City of Waukesha, Drake University and  PSA Peugeot  are sharing stories of improved mobility, scalability and employee productivity due to NVIDIA GRID graphics acceleration for virtual desktop infrastructure (VDI).

Virtualization experts like Rachel Berry @rhbBSE, Thomas Poppelgaard @POPPELGAARD and Dane Young @youngtech will highlight VDI with NVIDIA GRID in their talks.

Among them, you can take a “Deep Dive on HDX 3D Pro.” Uncover “The Anatomy of a High-Performance, GPU-enabled Virtual Desktop.” Or learn “How to be Successful with GPU Virtualization 2.0.”

Partners including Cisco, Dell, HP and NetApp will also be talking about GRID.

text
Taking off: Customers like Bell Helicopter are sharing stories of improved mobility, scalability and employee productivity due to NVIDIA GRID graphics acceleration for VDI.

Stop by Cisco’s booth to learn about the Cisco Unified Computing System with NVIDIA GRID and Tesla.

Visit Dell for a discussion of dedicated GPU, vGPU and reference architecture.

Hear from HP on how to enable the most productive workplace for your users’ accelerated graphics for VDI.

Or learn from NetApp about remote visualization with FlexPod for Citrix and NVIDIA GRID.

Finally, Citrix, Dell and NVIDIA will show – live – how easy it is to set up VDI with NVIDIA GRID with the “#60in60 Challenge.” Four small teams – from one each from Dell and NVIDIA and two groups of Citrix technology professionals led by Dane Young and Thomas Poppelgaard – will  set up 60 Citrix XenDesktop with NVIDIA GRID vGPU virtual desktops.

Each team will get just 60 minutes. Watch them battle it out in the Solutions Expo Hall May 12 at 7pm, May 13 at 1pm and May 14 at 3pm.

Want to setup your own GPU enabled server in 60 minutes or less? Download the vGPU deployment guide to learn how.

So if you’re attending Citrix Synergy, stop by NVIDIA booth 305 to see live demonstrations of NVIDIA GRID in action. And tweet us to win an NVIDIA SHIELD. Just mention @NVIDIAGRID and use #citrixsynergy in your tweet.

The Great Data Center Migration: Why Virtualized 3D Graphics Are Moving to GPUs

Behind the walls of data centers around the world, a migration’s taking place. Virtual delivery of 3D graphics is moving from the CPU to the GPU.

This was clear last week at Citrix Synergy, a conference for virtualization, mobility, networking and cloud solutions.

In separate talks, virtualization experts Rachel Berry, Thomas Poppelgaard and Dane Young each featured NVIDIA GRID vGPU graphics acceleration. It was also in sessions and demos throughout the show, including those from our partners Cisco, Dell, HP and NetApp.

Traditional virtual desktop infrastructure (VDI) offerings relied solely on the support of server CPUs. But limits imposed by the CPU made it nearly impossible to get a satisfactory user experience from virtualized, interactive, media-rich applications.

As a result, virtualization had worked well only for some—primarily task workers and certain knowledge workers. Left out were those with more graphically intense workloads—graphic designers, developers, and video producers and editors.

That’s now changing. GRID technology is opening new pathways for these users by offloading graphics processing from the CPU to the GPU.

Dell, Citrix and NVIDIA technologies offer a powerful combination to get this done. With Dell PowerEdge R730 servers running Citrix XenDesktop 7 and NVIDIA GRID vGPUs, IT staff can deliver rich, PC-graphics experiences and applications to more users. Meanwhile, applications and critical data remain protected and secure in the data center.

60in60 Challenge
Easy as VDI: Teams from Dell, NVIDIA and Citrix raced to set up 60 virtual desktops in 60 minutes.

At Citrix Synergy, Dell, Citrix and NVIDIA showed just how easy it is to set up VDI with NVIDIA GRID with the “#60in60 Challenge.” Four small teams—from Dell, NVIDIA and two groups of Citrix Technology Professionals—raced to set up 60 Citrix XenDesktop with NVIDIA GRID vGPU virtual desktops.

Each team had just 60 minutes using off-the-shelf hardware and software. After three rounds over the course of three days, all the teams finished within minutes of each other. NVIDIA’s Team Green achieved the fastest time with 60 desktops in 53 minutes.

Want to setup your own GPU-enabled server? Download the GRID vGPU deployment guide to learn how.

GRID delivers on the promise of instant access to, and collaboration on, powerful applications while users are on the go. Plus, GRID allows many virtual machines to share the power of a single GPU, with no compromises in performance.

Learn how customers like Bell Helicopter, City of Waukesha, Drake University and PSA Peugeot are using VDI with NVIDIA GRID graphics acceleration to improve mobility, scalability and employee productivity.

With access to such high-quality virtualized graphics, whether delivered via desktop or to devices far afield, the great data center migration looks to continue.



Popular Pages
  • CV Resume Ahmadrezar Razian-سید احمدرضا رضیان-رزومه Resume Full name Sayed Ahmadreza Razian Nationality Iran Age 36 (Sep 1982) Website ahmadrezarazian.ir  Email ...
  • CV Resume Ahmadrezar Razian-سید احمدرضا رضیان-رزومه معرفی نام و نام خانوادگی سید احمدرضا رضیان محل اقامت ایران - اصفهان سن 33 (متولد 1361) پست الکترونیکی ahmadrezarazian@gmail.com درجات علمی...
  • Nokte feature image Nokte – نکته نرم افزار کاربردی نکته نسخه 1.0.8 (رایگان) نرم افزار نکته جهت یادداشت برداری سریع در میزکار ویندوز با قابلیت ذخیره سازی خودکار با پنل ساده و کم ح...
  • Tianchi-The Purchase and Redemption Forecasts-Big Data-Featured Tianchi-The Purchase and Redemption Forecasts 2015 Special Prize – Tianchi Golden Competition (2015)  “The Purchase and Redemption Forecasts” in Big data (Alibaba Group) Among 4868 teams. Introd...
  • Shangul Mangul Habeangur,3d Game,AI,Ahmadreza razian,boz,boz boze ghandi,شنگول منگول حبه انگور,بازی آموزشی کودکان,آموزش شهروندی,آموزش ترافیک,آموزش بازیافت Shangul Mangul HabeAngur Shangul Mangul HabeAngur (City of Goats) is a game for child (4-8 years). they learn how be useful in the city and respect to people. Persian n...
  • Brick and Mortar Store Recommendation with Budget Constraints-Featured Tianchi-Brick and Mortar Store Recommendation with Budget Constraints Ranked 5th – Tianchi Competition (2016) “Brick and Mortar Store Recommendation with Budget Constraints” (IJCAI Socinf 2016-New York,USA)(Alibaba Group...
  • Drowning Detection by Image Processing-Featured Drowning Detection by Image Processing In this research, I design an algorithm for image processing of a swimmer in pool. This algorithm diagnostics the swimmer status. Every time graph sho...
  • 1st National Conference on Computer Games-Challenges and Opportunities 2016-Featured 1st National Conference on Computer Games-Challenges and Opportunities 2016 According to the public relations and information center of the presidency vice presidency for science and technology affairs, the University of Isfah...
  • Design an algorithm to improve edges and image enhancement for under-sea color images in Persian Gulf-Featured 3rd International Conference on The Persian Gulf Oceanography 2016 Persian Gulf and Hormuz strait is one of important world geographical areas because of large oil mines and oil transportation,so it has strategic and...
  • 2nd Symposium on psychological disorders in children and adolescents 2016 2nd Symposium on psychological disorders in children and adolescents 2016 2nd Symposium on psychological disorders in children and adolescents 2016 Faculty of Nursing and Midwifery – University of Isfahan – 2 Aug 2016 - Ass...
  • GPU vs CPU Featured CUDA Optimizing raytracing algorithm using CUDA Abstract Now, there are many codes to generate images using raytracing algorithm, which can run on CPU or GPU in single or multi-thread methods. In t...
  • MyCity-Featured My City This game is a city simulation in 3d view. Gamer must progress the city and create building for people. This game is simular the Simcity.
Popular posts
Interested
About me

My name is Sayed Ahmadreza Razian and I am a graduate of the master degree in Artificial intelligence .
Click here to CV Resume page

Related topics such as image processing, machine vision, virtual reality, machine learning, data mining, and monitoring systems are my research interests, and I intend to pursue a PhD in one of these fields.

جهت نمایش صفحه معرفی و رزومه کلیک کنید

My Scientific expertise
  • Image processing
  • Machine vision
  • Machine learning
  • Pattern recognition
  • Data mining - Big Data
  • CUDA Programming
  • Game and Virtual reality

Download Nokte as Free


Coming Soon....

Greatest hits

One day you will wake up and there won’t be any more time to do the things you’ve always wanted. Do it now.

Paulo Coelho

You are what you believe yourself to be.

Paulo Coelho

Anyone who has never made a mistake has never tried anything new.

Albert Einstein

It’s the possibility of having a dream come true that makes life interesting.

Paulo Coelho

The fear of death is the most unjustified of all fears, for there’s no risk of accident for someone who’s dead.

Albert Einstein

Imagination is more important than knowledge.

Albert Einstein

Waiting hurts. Forgetting hurts. But not knowing which decision to take can sometimes be the most painful.

Paulo Coelho

Gravitation is not responsible for people falling in love.

Albert Einstein


Site by images
Recent News Posts