Programming
Python Primer for the Impatient
Python is one of the scripting languages supported by GLSL Hacker. Here is a quick introduction to the essential notions and syntax of Python programming language. All following notions are general and are not specific to GLSL Hacker. Thanks to this primer, you will be able to quickly tweak and hack GLSL Hacker demos. GLSL Hacker 0.5.0 is available with a Python 2.7 plugin.
The reference manual of Python 2 can be found HERE.
This primer does not cover advanced topics like object oriented programming. It follows the same line than Lua Primer fro the Impatient: providing a quick way to read and write basic Python programs or scripts.
1 – Comments
Python supports two kinds of comments: single line comments and multi-line comments.
Single line comment: #
# this is a single line comment in Python
Multi-line comment with triple quotes: “””
""" this is a multi-line comment in Python """
2 – Variables
Variables in Python have a type but there is no type declaration. Common types are numbers (float, integer), strings, lists, tuples and dictionnaries. Variables defined in a function have local scope while Variables defined outside functions have global scope.
num_vertices = 0 # integer variable width = 10.25 # float variable mesh_name = "my_kool_mesh" # string variable
Tuples use parentheses while lists use brackets. Tuples can’t be updated, their size can’t be changed while lists can be updated, elements can be added or removed.
shader_tuple = ('Vertex', 'Fragment', 'Geometry') # tuple program_list = ['prog01', 'prog02', 'prog03'] # list program_list.append('prog04') program_list.append('prog05')
Dictionnaries is a kind of hash table and are made up of pairs of key:value. The key is usually a number or a string.
node_dict = {} # dictionnary node_dict['node01'] = 100 node_dict['node02'] = 101 node_dict['node03'] = 102
To sum up:
my_tuple = () my_list = [] my_dict = {}
You can convert a type to another type with float(), int(), str(), tuple() or dict():
x = 10 x_str = str(x)
To concatenate strings, just use the + operator:
space = " " app_name = "GLSL" + space + "Hacker"
3 – Indentation
Unlike C, PHP or Lua, Python does not have braces to delimit blocks of code like functions or tests. To delimit blocks of code, Python uses indentation and indentation is severely inspected by Python, The smallest difference in indentation leads to fatal compilation error! So be careful with indentation and use a correct text editor to handle indentation properly.
4 – Functions
Functions in Python can take multiple arguments and can return multiple results.
def myKoolFunc(a, b, c): # indentation!!! # Do something useful with a, b and c: sum = a+b+c avg = sum/3 return sum, avg x, y = myKoolFunc(1, 2, 3)
5 – Operators
Comparison: The comparison operators are the same than in C language:
- equal: (a == b)
- not equal: (a != b)
- greater than: (a > b)
- greater than or equal: (a >= b)
- lesser than: (a < b)
- lesser than or equal: (a <= b)
Logical:
- and: (a and b)
- or: (a or b)
Bitwise:
- binary AND: (a & b)
- binary OR: (a | b)
- binary left shift: a << b
- binary right shift: a >> b
- binary XOR: (a ^ b)
- binary complement: ~a
6 – Control Structures
Conditional tests:
if (a == x): # do something elif (a == y): # do something else: # do something
Loops
for:
for i in range(0, 10): # do something for i in "GLSL Hacker": print(i) # print current letter for v in range(len(vertices_list)): Update_Position(v)
while:
i=0 while (i < 10): # do something i += 1
7 – Built-in functions
Python comes with a ton of modules and that’s one of the strength of Python:
Fans of Python use the phrase batteries included to describe the standard library, which covers everything from asynchronous processing to zip files. The language itself is a flexible powerhouse that can handle practically any problem domain. Build your own web server in three lines of code. Build flexible data-driven code using Python’s powerful and dynamic introspection capabilities and advanced language features such as meta-classes, duck typing and decorators.
There are so many modules and functions in Python standard library and I’m not going to list them here. Nonetheless, I can give you an example of the use of platform module to get the version of Python. To use a module, use the import keyword:
import platform py_version = str(platform.python_version())
There is a demo in GLSL Hacker Code Sample Pack that shows the use of the platform module:

GLSL Hacker – Python platform module test
import math s = math.sin(radians) c = math.cos(radians) p = math.pow(x, y)
There’s also a random module:
import random # possible values for y: 2, 3, 4 and 5. y = math.randint(2, 5)
The string module is also very useful…
What’s new in the .NET Framework 4.5.2
- New APIs for ASP.NET apps. The new HttpResponse. AddOnSendingHeaders and HttpResponseBase. AddOnSendingHeaders methods let you inspect and modify response headers and status code as the response is being flushed to the client app. Consider using these methods instead of the PreSendRequestHeaders and PreSendRequestContent events; they are more efficient and reliable.
The HostingEnvironment. QueueBackgroundWorkItem method lets you schedule small background work items. ASP.NET tracks these items and prevents IIS from abruptly terminating the worker process until all background work items have completed. This method can’t be called outside an ASP.NET managed app domain.
The new HttpResponse. HeadersWritten and HttpResponseBase. HeadersWritten properties return Boolean values that indicate whether the response headers have been written. You can use these properties to make sure that calls to APIs such as HttpResponse. StatusCode (which throw exceptions if the headers have been written) will succeed.
- Resizing in Windows Forms controls. This feature has been expanded. You can now use the system DPI setting to resize components of the following additional controls (for example, the drop-down arrow in combo boxes):
ComboBox
ToolStripComboBox
ToolStripMenuItem
Cursor
DataGridView
DataGridViewComboBoxColumnThis is an opt-in feature. To enable it, set the EnableWindowsFormsHighDpiAutoResizing element to true in the application configuration (app.config) file:
<appSettings> <add key="EnableWindowsFormsHighDpiAutoResizing" value="true" /> </appSettings>
- New workflow feature. A resource manager that’s using the EnlistPromotableSinglePhase method (and therefore implementing the IPromotableSinglePhaseNotification interface) can use the new Transaction. PromoteAndEnlistDurable method to request the following:
- Promote the transaction to a Microsoft Distributed Transaction Coordinator (MSDTC) transaction.
- Replace IPromotableSinglePhaseNotification with an ISinglePhaseNotification, which is a durable enlistment that supports single phase commits.
This can be done within the same app domain, and doesn’t require any extra unmanaged code to interact with MSDTC to perform the promotion. The new method can be called only when there’s an outstanding call from System.Transactions to the IPromotableSinglePhaseNotification Promote method that’s implemented by the promotable enlistment.
- Profiling improvements. The following new unmanaged profiling APIs provide more robust profiling:
COR_PRF_ASSEMBLY_REFERENCE_INFO Structure
COR_PRF_HIGH_MONITOR Enumeration
ICorProfilerCallback6::GetAssemblyReferences Method
ICorProfilerInfo5::GetEventMask2 Method
ICorProfilerInfo5::SetEventMask2 Method
ICorProfilerAssemblyReferenceProvider::AddAssemblyReference MethodPrevious ICorProfiler implementations supported lazy loading of dependent assemblies. The new profiling APIs require dependent assemblies that are injected by the profiler to be loadable immediately, instead of being loaded after the app is fully initialized. This change doesn’t affect users of the existing ICorProfiler APIs.
- Debugging improvements. The following new unmanaged debugging APIs provide better integration with a profiler. You can now access metadata inserted by the profiler as well as local variables and code produced by compiler ReJIT requests when dump debugging.
ICorDebugProcess7::SetWriteableMetadataUpdateMode Method
ICorDebugILFrame4::EnumerateLocalVariablesEx Method
ICorDebugILFrame4::GetLocalVariableEx Method
ICorDebugILFrame4::GetCodeEx Method
ICorDebugFunction3::GetActiveReJitRequestILCode Method
ICorDebugILCode2::GetInstrumentedILMap Method - Event tracing changes. The .NET Framework 4.5.2 enables out-of-process, Event Tracing for Windows (ETW)-based activity tracing for a larger surface area. This enables Advanced Power Management (APM) vendors to provide lightweight tools that accurately track the costs of individual requests and activities that cross threads. These events are raised only when ETW controllers enable them; therefore, the changes don’t affect previously written ETW code or code that runs with ETW disabled.
About CUDA – More Than A Programming Model
The CUDA compute platform extends from the 1000s of general purpose compute processors featured in our GPU’s compute architecture, parallel computing extensions to many popular languages, powerful drop-in accelerated libraries to turn key applications and cloud based compute appliances. CUDA extends beyond the popular CUDA Toolkit and the CUDA C/C++ programming language, we invite you to explore the CUDA Ecosystem and learn how you can accelerate your applications.
Widely Used By Researchers
Since its introduction in 2006, CUDA has been widely deployed through thousands of applications and published research papers, and supported by an installed base of over 500 million CUDA-enabled GPUs in notebooks, workstations, compute clusters and supercomputers.
Real People – Real Success Stories
Many researchers and developers have used the CUDA Platform to push the state of the art of their work, read some of their stories in the CUDA In Action Spotlight Series.
Acceleration For All Domains
Learn more about GPU-accelerated applications available for astronomy, biology, chemistry, physics, data mining, manufacturing, finance, and more on the software solutions page and industry solutions page. Check out our dedicated Geo-Intelligence for Developers page. Read some real industrial application case studies.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
How to get started
Software developers, scientists and researchers can add support for GPU acceleration in their own applications using one of three simple approaches:
- Drop in a GPU-accelerated library to replace or augment CPU-only libraries such as MKL BLAS, IPP, FFTW and other widely-used libraries
- Automatically parallelize loops in Fortran or C code using OpenACC directives for accelerators
- Develop custom parallel algorithms and libraries using a familiar programming language such as C, C++, C#, Fortran, Java, Python, etc.
Start accelerating your application today, learn how by visiting the Getting Started Page.
Learn More
Deep Learning for Computer Vision with MATLAB and cuDNN
Deep learning is becoming ubiquitous. With recent advancements in deep learning algorithms and GPU technology, we are able to solve problems once considered impossible in fields such as computer vision, natural language processing, and robotics.
Deep learning uses deep neural networks which have been around for a few decades; what’s changed in recent years is the availability of large labeled datasets and powerful GPUs. Neural networks are inherently parallel algorithms and GPUs with thousands of cores can take advantage of this parallelism to dramatically reduce computation time needed for training deep learning networks. In this post, I will discuss how you can use MATLAB to develop an object recognition system using deep convolutional neural networks and GPUs.
Why Deep Learning for Computer Vision?
Machine learning techniques use data (images, signals, text) to train a machine (or model) to perform a task such as image classification, object detection, or language translation. Classical machine learning techniques are still being used to solve challenging image classification problems. However, they don’t work well when applied directly to images, because they ignore the structure and compositional nature of images. Until recently, state-of-the-art techniques made use of feature extraction algorithms that extract interesting parts of an image as compact low-dimensional feature vectors. These were then used along with traditional machine learning algorithms.
Enter Deep learning. Deep convolutional neural networks (CNNs), a specific type of deep learning algorithm, address the gaps in traditional machine learning techniques, changing the way we solve these problems. CNNs not only perform classification, but they can also learn to extract features directly from raw images, eliminating the need for manual feature extraction. For computer vision applications you often need more than just image classification; you need state-of-the-art computer vision techniques for object detection, a bit of domain expertise, and the know-how to set up and use GPUs efficiently. Through the rest of this post, I will use an object recognition example to illustrate how easy it is to use MATLAB for deep learning, even if you don’t have extensive knowledge of computer vision or GPU programming.
Example: Object Detection and Recognition
The goal in this example is to detect a pet in a video and correctly label the pet as a cat or a dog. To run this example, you will need MATLAB®, Parallel Computing Toolbox™, Computer Vision System Toolbox™ and Statistics and Machine Learning Toolbox™. If you don’t have these tools, request a trial at www.mathworks.com/trial. For this problem I used an NVIDIA Tesla K40 GPU; you can run it on any MATLAB compatible CUDA-enabled NVIDIA GPU.
Our approach involves two steps:
- Object Detection: “Where is the pet in the video?”
- Object Recognition: “Now that I know where it is, is it a cat or a dog?”
Figure 1 shows what the final result looks like.
Using a Pretrained CNN Classifier
The first step is to train a classifier that can classify images of cats and dogs. I could either:
- Collect a massive amount of cropped, resized and labeled images of cats and dogs in a reasonable amount of time (good luck!), or
- Use a model that has already been trained on a variety of common objects and adapt it for my problem.

For this example, I’m going to go with option (2) which is common in practice. To do that I’m going to first start with a pretrained CNN classifier that has been trained on the ImageNet dataset.
I will be using MatConvNet, a CNN package for MATLAB that uses the NVIDIA cuDNN library for accelerated training and prediction. [To learn more about cuDNN, see this Parallel Forall post.] Download and install instructions for MatConvNet are available on its home page. Once I’ve installed MatConvNet on my computer, I can use the following MATLAB code to download and make predictions using the pretrained CNN classifier. Note: I also use the cnnPredict()
helper function, which I’ve made available on Github.
%% Download and predict using a pretrained ImageNet model % Setup MatConvNet run(fullfile('matconvnet-1.0-beta15','matlab','vl_setupnn.m')); % Download ImageNet model from MatConvNet pretrained networks repository urlwrite('http://www.vlfeat.org/matconvnet/models/imagenet-vgg-f.mat', 'imagenet-vgg-f.mat'); cnnModel.net = load('imagenet-vgg-f.mat'); % Load and display an example image imshow('dog_example.png'); img = imread('dog_example.png'); % Predict label using ImageNet trained vgg-f CNN model label = cnnPredict(cnnModel,img); title(label,'FontSize',20)
The pretrained CNN classifier works great out of the box at object classification. The CNN model is able to tell me that there is a beagle in the example image (Figure 2). While this is certainly a great starting point, our problem is a little different. I want to be able to (1) put a box around where the pet is (object detection) and then (2) label it accurately as a dog or a cat (classification). Let’s start by building a dog vs cat classifier from the pretrained CNN model.
Training a Dog vs. Cat Classifier
The objective is simple. I want to solve a simple classification task: given an image I’d like to train a classifier that can accurately tell me if it’s an image of a dog or a cat. I can do that easily with this pretrained classifier and a few dog and cat images.
To get a small collection of labeled images for this project, I went around my office asking colleagues to send me pictures of their pets. I segregated the images and put them into separate ‘cat’ and ‘dog’ folders under a parent called ‘pet_images’. The advantage of using this folder structure is that the imageSet
function can automatically manage image locations and labels. I loaded them all into MATLAB using the following code.
%% Load images from folder % Use imageSet to load images stored in pet_images folder imset = imageSet('pet_images','recursive'); % Preallocate arrays with fixed size for prediction imageSize = cnnModel.net.normalization.imageSize; trainingImages = zeros([imageSize sum([imset(:).Count])],'single'); % Load and resize images for prediction for ii = 1:numel(imset) for jj = 1:imset(ii).Count trainingImages(:,:,:,jj) = imresize(single(read(imset(ii),jj)),imageSize(1:2)); end end % Get the image labels trainingLabels = getImageLabels(imset); summary(trainingLabels) % Display class label distribution
Feature Extraction using a CNN
What I’d like to do next is use this new dataset along with the pretrained ImageNet to extract features. As I mentioned earlier, CNNs can learn to extract generic features from images. These features can be used to train a new classifier to solve a different problem, like classifying cats and dogs in our problem.
CNN algorithms are compute-intensive and can be slow to run. Since they are inherently parallel algorithms, I can use GPUs to speed up the computation. Here is the code that performs the feature extraction using the pretrained model, and a comparison of multithreaded CPU (Intel Core i7-3770 CPU) and GPU (NVIDIA Tesla K40 GPU) implementations.
%% Extract features using pretrained CNN % Depending on how much memory you have on your GPU you may use a larger % batch size. I have 400 images, so I choose 200 as my batch size cnnModel.info.opts.batchSize = 200; % Make prediction on a CPU [~, cnnFeatures, timeCPU] = cnnPredict(cnnModel,trainingImages,'UseGPU',false); % Make prediction on a GPU [~, cnnFeatures, timeGPU] = cnnPredict(cnnModel,trainingImages,'UseGPU',true); % Compare the performance increase bar([sum(timeCPU),sum(timeGPU)],0.5) title(sprintf('Approximate speedup: %2.00f x ',sum(timeCPU)/sum(timeGPU))) set(gca,'XTickLabel',{'CPU','GPU'},'FontSize',18) ylabel('Time(sec)'), grid on, grid minor


As you can see the performance boost you get from using a GPU is significant, about 15x for this feature extraction problem.
The function cnnPredict
is a wrapper around MatConvNet’s vl_simplenn
predict function. The highlighted line of code in Figure 5 is the only modification you need to make to run the prediction on a GPU. Functions like gpuArray
in the Parallel Computing Toolbox make it easy to prototype your algorithms using a CPU and quickly switch to GPUs with minimal code changes.

Train a Classifier Using CNN Features
With the features I extracted in the previous step, I’m now ready to train a “shallow” classifier. To train and compare multiple models interactively, I can use the Classification Learner app in the Statistics and Machine Learning Toolbox. Note: for an introduction to machine learning and classification workflows in MATLAB, check out this Machine Learning Made Easy webinar.
Next, I will directly train an SVM classifier using the extracted features by calling the fitcsvm
function using cnnFeatures
as the input or predictors and trainingLabels
as the output or response values. I will also cross-validate the classifier to test its validation accuracy. The validation accuracy is an unbiased estimate of how the classifier would perform in practice on unseen data.
%% Train a classifier using extracted features % Here I train a linear support vector machine (SVM) classifier. svmmdl = fitcsvm(cnnFeatures,trainingLabels); % Perform crossvalidation and check accuracy cvmdl = crossval(svmmdl,'KFold',10); fprintf('kFold CV accuracy: %2.2f\n',1-cvmdl.kfoldLoss)
svmmdl
is my classifier that I can now use to classify an image as a cat or a dog.
Object Detection
Most images and videos frames have a lot going on in them. In addition to a dog, there may be a tree or a raccoon chasing the dog. Even with a great image classifier, like the one I built in the previous step, it will only work well if I can locate the object of interest in an image (dog or cat), crop the object and then feed it to a classifier. The step of locating the object is called object detection.
For object detection, I will use a technique called Optical Flow that uses the motion of pixels in a video from frame to frame. Figure 6 shows a single frame of video with the motion vectors overlaid.

The next step in the detection process is to separate out pixels that are moving, and then use the Image Region Analyzer app to analyze the connected components in the binary image to filter out the noisy pixels caused by the camera motion. The output of the app is a MATLAB function (I’m going to call it findPet) that can locate where the pet is in the field of view.
Tying the Workflow Together
I now have all the pieces I need to build a pet detection and recognition system.
To quickly recap, I can:
- Detect the location of the pet in new images;
- Crop the pet from the image and extract features using a pretrained CNN;
- Classify the features using an SVM classifier.
Pet Detection and Recognition
Tying all these pieces together, the following code shows my complete MATLAB pet detection and recognition system.
%% Tying the workflow together vr = VideoReader(fullfile('PetVideos','videoExample.mov')); vw = VideoWriter('test.avi','Motion JPEG AVI'); opticFlow = opticalFlowFarneback; open(vw); while hasFrame(vr) % Count frames frameNumber = frameNumber + 1; % Step 1. Read Frame videoFrame = readFrame(vr); % Step 2. Detect ROI vFrame = imresize(videoFrame,0.25); % Get video frame frameGray = rgb2gray(vFrame); % Convert to gray for detection bboxes = findPet(frameGray,opticFlow); % Find bounding boxes if ~isempty(bboxes) img = zeros([imageSize size(bboxes,1)]); for ii = 1:size(bboxes,1) img(:,:,:,ii) = imresize(imcrop(videoFrame,bboxes(ii,:)),imageSize(1:2)); end % Step 3. Recognize object % (a) Extract features using a CNN [~, scores] = cnnPredict(cnnModel,img,'UseGPU',true,'display',false); % (b) Predict using the trained SVM Classifier label = predict(svmmdl,scores); % Step 4. Annotate object videoFrame = insertObjectAnnotation(videoFrame,'Rectangle',bboxes,cellstr(label),'FontSize',40); end % Step 5. Write video to file writeVideo(vw,videoFrame); fprintf('Frames processed: %d of %d\n',frameNumber,ceil(vr.FrameRate*vr.Duration)); end close(vw);
Conclusion
Solutions to real-world computer vision problems often require tradeoffs depending on your application: performance, accuracy, and simplicity of the solution. Advances in techniques such as deep learning have significantly raised the bar in terms of the accuracy of tasks like visual recognition, but the performance costs were too significant for mainstream adoption. GPU technology has closed this gap by accelerating training and prediction speeds by orders of magnitude.
MATLAB makes computer vision with deep learning much more accessible. The combination of an easy-to-use application and programming environment, a complete library of standard computer vision and machine learning algorithms, and tightly integrated support for CUDA-enabled GPUs makes MATLAB an ideal platform for designing and prototyping computer vision solutions.
Getting Started with OpenACC
This week NVIDIA has released the NVIDIA OpenACC Toolkit, a starting point for anyone interested in using OpenACC. OpenACC gives scientists and researchers a simple and powerful way to accelerate scientific computing without significant programming effort. The toolkit includes the PGI OpenACC Compiler, the NVIDIA Visual Profiler with CPU and GPU profiling, and the new OpenACC Programming and Best Practices Guide. Academics can get a free renewable license to the PGI C,C++ and Fortran compilers with support for OpenACC.

OpenACC is an open specification for compiler directives for parallel programming. By using OpenACC, developers can rapidly accelerate existing C, C++, and Fortran applications using high-level directives that help retain application portability across processor architectures. Figure 1 shows some examples of real code speedups with OpenACC. The OpenACC specification is designed and maintained with the cooperation of many industry and academic partners, such as Cray, AMD, PathScale, University of Houston, Oak Ridge National Laboratory and NVIDIA.
When I program with and teach OpenACC I like to use a 4 step cycle to progressively accelerate the code.
- Identify Parallelism: Profile the code to understand where the program is spending its time and how much parallelism is available to be accelerated in those important routines. GPUs excel when there’s a significant amount of parallelism to exploit, so look for loops and loop nests with a lot of independent iterations.
- Express Parallelism: Placing OpenACC directives on the loops identified in step 1 tells the compiler to parallelize them. OpenACC is all about giving the compiler enough information to effectively accelerate the code, so during this step I add directives to as many loops as I believe I can and move as much of the computation to the GPU as possible.
- Express Data Locality: The compiler needs to know not just what code to parallelize, but also which data will be needed on the accelerator by that code. After expressing available parallelism, I often find that the code has slowed down. As you’ll see later in this post, this slowdown comes from the compiler making cautious decisions about when data needs to be moved to the GPU for computation. During this step, I’ll express to the compiler my knowledge of when and how the data is really needed on the GPU.
- Optimize – The compiler usually does a very good job accelerating code, but sometimes you can get more performance by giving the compiler a little more information about the loops or by restructuring the code to increase parallelism or improve data access patterns. Most of the time this leads to small improvements, but sometimes gains can be bigger.
It’s really important after each of these steps to verify that the program still produces correct results, as it’s much easier to debug a mistake due to one small change to the code rather than waiting to the end after making many changes. This process is a cycle, so once you have an important hotspot running well with OpenACC, go back to the beginning and find the next place to express parallelism.
In this post I’ll review steps 2, 3, and 4 using a simple benchmark code, but first, let’s install the NVIDIA OpenACC Toolkit. If you already have an OpenACC compiler installed on your machine, you can skip this step.
Installing the OpenACC Toolkit
You can download the OpenACC Toolkit from http://www.nvidia.com/openacc by clicking the download link. In order to install it on your machine, you will need to be running a 64-bit Linux OS and have a CUDA-capable GPU installed. Once you’ve downloaded the package, untar it into a temporary directory and run the install
script. When asked whether to install the CUDA Toolkit Components, be sure to answer yes, as they are required by the OpenACC compiler. If you’ve never installed the CUDA Toolkit on your machine, you will need to take one additional step: installing the NVIDIA CUDA driver. This driver is included in the installation directory, which defaults to /opt/pgi
. Within that directory, go to the linux86-64/2015/OpenACC/driver
directory and use bash ./*.run
to install the CUDA driver or head to https://developer.nvidia.com/cuda-downloads to install the CUDA Toolkit and driver via your Linux package manager. This will walk you through installing the driver. If you have any trouble with this step, please post a question on the NVIDIA forums.
Before moving forward, be sure to take a look at the documentation included in the Toolkit installation directory under the linux86-64/2015/OpenACC/doc
directory.
Jacobi Iteration

The benchmark I will use is a simple C code that solves the 2D Laplace equation with the iterative Jacobi solver. Iterative methods are a common technique to approximate the solution of elliptic PDEs, like the Laplace equation, within an allowable tolerance. In the example I’ll perform a simple stencil calculation where each point calculates its value as the mean of its neighbors’ values. The Jacobi iteration continues until either the maximum change in value between two iterations drops below the tolerance level or it exceeds the maximum number of iterations. For the sake of consistent comparison all the results I show are for 1000 complete iterations. Here’s the main Jacobi iteration loop.
while ( error > tol && iter < iter_max ) { error = 0.0; for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error); iter++; }
The outer loop iterates for 1000 iterations or until the results converge to within an acceptable tolerance (whichever comes first). I’ll refer to this loop as the convergence loop. The two j
, i
loop nests within the convergence loop do the main calculation. The first loop calculates the new values for each point in the matrix as the mean of the neighboring values. The second loop nest copies these values into the A
array for the next iteration. Note that the code calculates an error
value for each element, which is how much the value changed between iterations, and finds the maximum value of error
to determine whether the solution has converged.
Express Parallelism
In this code it’s clear that the convergence loop cannot be parallelized because of the dependence of each iteration on the results of the previous. This is known as a data dependency. The two inner loop nests over i
and j
, however, can be parallelized, as each iteration writes its own unique value and does not read from the values calculated in other iterations. These loops can be executed forward, backward, in a random order, or all simultaneously and the results will be the same (modulo floating point error). Therefore, the two loop nests are candidates for acceleration. I’ll use the OpenACC kernels
directive to accelerate them. The kernels
directive tells the compiler to analyze the code in the specified region to look for parallel loops. In the following code, I’ve placed a kernels
region within the convergence loop, around the two loop nests that perform the calculation and value swapping, since this is where the compiler is most likely to find parallelism.
while ( error > tol && iter < iter_max ) { error = 0.0; #pragma acc kernels { for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(A[j][i] - Anew[j][i])); } } for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } } if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error); iter++; }
Having inserted an OpenACC directive I now want to build the code. For this I’ll use the PGI C/C++ compiler, pgcc
, which is included in the OpenACC Toolkit. I want the compiler to generate code specific to NVIDIA GPUs, so I add the -ta=tesla
compiler option (ta
means target accelerator and tesla
targets NVIDIA Tesla GPUs). I also use the optional -Minfo=accel
compiler flag to tell the compiler to provide feedback about the code it generates. Below is the output from this command.
$ pgcc -acc -ta=tesla -Minfo=accel laplace2d-kernels.c main: 56, Generating copyout(Anew[1:4094][1:4094]) Generating copyin(A[:][:]) Generating copyout(A[1:4094][1:4094]) Generating Tesla code 58, Loop is parallelizable 60, Loop is parallelizable Accelerator kernel generated 58, #pragma acc loop gang /* blockIdx.y */ 60, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 64, Max reduction generated for error 68, Loop is parallelizable 70, Loop is parallelizable Accelerator kernel generated 68, #pragma acc loop gang /* blockIdx.y */ 70, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
Notice from the compiler output that the compiler identified two regions of code for generating an accelerator kernel. (A kernel is simply a function that is run in parallel on the accelerator / GPU). In this case, the loop nests starting at lines 58 and 68 have been identified as safe to parallelize and the compiler has generated kernels for these loops. The compiler also analyzed which arrays are used in the calculation and generated code to move A
and Anew
into GPU memory. The compiler even detected that it needs to perform a max reduction on the error
variable.
A reduction is an operation that takes many input values and combines them into a single result using a binary operator. In this c+ase, each loop iteration calculates an error
value, but the iteration logic only needs one result: the maximum error. Reductions can be tricky in some parallel programming languages, but fortunately the OpenACC compiler handled the reduction for me. Now that I’ve built the code, I can expect it to run significantly faster on my GPU, right? Figure 3 shows the performance of this code running on multiple cores of an Intel Xeon E5-2698 CPU to performance on an NVIDIA Tesla K40 GPU. The CPU code was parallelized using OpenMP and compiled with PGI 15.5 and -fast
optimizations.

Figure 1 clearly shows a speedup from increasing the number of CPU threads—although not for more than 8 threads—but what about the GPU performance? The OpenACC code on the GPU slightly outperforms the original, single-threaded version, but not the multi-threaded version. I used the NVIDIA Visual Profiler to help understand what’s happening. Figure 4 shows the GPU timeline, zoomed in to just 2 iterations of the convergence loop. The profiler shows the kernels of the two loop nests in green and purple, but the run time of the kernels is overshadowed by the tan boxes that represent PCIe data copies (host to device and device to host). Now that I know the problem, I can fix it.

Express Data Locality
Why is PCIe data transfer time overwhelming the speed-up from GPU execution of the Jacobi iteration? Looking closely at the Visual Profiler timeline it appears that the code copies data to the GPU each time it enters the kernels
region, and copies the data back to the CPU each time it exits the kernels
region. Looking at the algorithm, however, I only really need the initial value of A
copied to the device at the beginning of the convergence loop and the final value copied out at the end. The data doesn’t change between iterations of the while
loop, so I can leave the arrays on the device. As the programmer, it’s easy for me to see this because I know the algorithm, but it’s very difficult for the compiler. Compilers can’t afford to produce wrong answers, so they’d rather copy the data too often than not copy it and produce incorrect results. In fact, the OpenACC specification requires the compiler to default to copying arrays to and from the device at every kernels
region when the programmer doesn’t specify how to handle them. Fortunately, OpenACC provides a way to tell the compiler when data really needs to move. The OpenACC data
construct specifies data movement or residency options for a region of code, using clauses that inform the compiler explicitly how to handle arrays, as shown in the following code.
#pragma acc data copy(A) create(Anew) while ( error > tol && iter < iter_max ) { error = 0.0; #pragma acc kernels { for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(A[j][i] - Anew[j][i])); } } for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } } if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error); iter++; }
I added a data
region around the while
loop, since this is where I can reuse the data on the GPU. I used the copy
clause to tell the compiler that it should copy the A array to and from the device as it enters and exits the region, respectively. Since the Anew
array is only used within the convergence loop, I tell the compiler to create
temporary space on the device, since I don’t care about the initial or final values of that array. There are other data
clauses that I don’t use in this example:
copyin
initializes the value of the array but does not copy the results back;copyout
allocates uninitialized space on the device and copies the final values back; andpresent
tells the compiler to assume that the array was moved or allocated on the device somewhere else in the program.
This single-line data clause makes a big difference to performance. Figure 5 shows that with data locality properly expressed the code runs six times faster than a full CPU socket with almost no data motion.

Figure 6 shows the NVIDIA Visual Profiler timeline, which shows that there’s no data motion between convergence iterations, just as expected.

The third step in the OpenACC APOD process is to further optimize the loops, but it turns out that the PGI compiler already does a great job optimizing this code and there’s not much more that we can do. If you’d like to see more about optimizing your OpenACC code, however, you should check out my Advanced OpenACC Programming session from GTC.
Optimize
Given the simplicity of this code, the compiler already does a pretty good job generating fast code, but there’s still a little room for optimization. After some experimentation I found that by using the loop tile
clause I can reduce the runtime by a few percent. The loop
directive is a way to give the compiler additional information about the following loop, such as informing the compiler that all loop iterations are data independent, or guiding the compiler’s optimization within the loop.
It turns out that there’s a lot of data reuse between loop iterations in the i
and j
dimensions of A
, so we’ll use the tile directive to tell the compiler it should operate on tiles of the array, rather than parallelizing strictly over columns. If I wanted to do this manually, I’d have to add two more loops to the calculation to break the work up into chunks (as one might when optimizing for CPU cache blocking), but the tile
clause tells the compiler to do it automatically. On my machine, a 32×4 tile gives the best performance, reducing the runtime by an additional 6%. Below is the resulting code. Notice that I added a device_type(nvidia)
clause to my loop
directives, which informs the compiler to only apply this optimization on NVIDIA GPUs, where I’ve confirmed that it’s beneficial. Using the device_type
clause helps keep my code portable to other devices.
#pragma acc data copy(A) create(Anew) while ( error > tol && iter < iter_max ) { error = 0.0; #pragma acc kernels { #pragma acc loop tile(32,4) device_type(nvidia) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(A[j][i] - Anew[j][i])); } } #pragma acc loop tile(32,4) device_type(nvidia) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } } if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error); iter++; }
Performance Portability from GPUs to CPUs with OpenACC
OpenACC gives scientists and researchers a simple and powerful way to accelerate scientific computing applications incrementally. The OpenACC API describes a collection of compiler directives to specify loops and regions of code in standard C, C++, and Fortran to be offloaded from a host CPU to an attached accelerator. OpenACC is designed for portability across operating systems, host CPUs, and a wide range of accelerators, including APUs, GPUs, and many-core coprocessors.
And starting today, with the PGI Compiler 15.10 release, OpenACC enables performance portability between accelerators and multicore CPUs. The new PGI Fortran, C and C++ compilers for the first time allow OpenACC-enabled source code to be compiled for parallel execution on either a multicore CPU or a GPU accelerator. This capability provides tremendous flexibility for programmers, enabling applications to take advantage of multiple system architectures with a single version of the source code.
“Our goal is to enable HPC developers to easily port applications across all major CPU and accelerator platforms with uniformly high performance using a common source code base,” said Douglas Miles, director of PGI Compilers & Tools at NVIDIA. “This capability will be particularly important in the race towards exascale computing in which there will be a variety of system architectures requiring a more flexible application programming approach.”
As the chart above shows, performance on multicore CPUs for HPC apps using MPI + OpenACC is equivalent to MPI + OpenMP code. Compiling and running the same code on a Tesla K80 GPU can provide large speedups.
Key benefits of running OpenACC on multicore CPUs include:
- Effective utilization of all cores of a multicore CPU or multi-socket server for parallel execution
- Common programming model across CPUs and GPUs in Fortran, C, and C++
- Rapid exploitation of existing multicore parallelism in a program using the KERNELS directive, which enables incremental optimization for parallel execution
- Scalable performance across multicore CPUs and GPUs
PGI’s compiler roadmap, shown below, includes plans to support all of the compute processors that are likely to be viable building blocks for Exascale systems.
How to Compile OpenACC Applications for Multicore CPUs
Passing the flag -ta=multicore
on the PGI compiler (pgcc
, pgc++
or pgfortran
) command line tells the compiler to generate parallel multicore code for OpenACC compute regions, instead of the default of generating parallel GPU kernels. The parallel multicore code will execute in much the same fashion as if you had used OpenMP omp parallel
directives instead of OpenACC compute regions.
Adding -Minfo
or -Minfo=accel
will enable compiler feedback messages, giving details about the parallel code generated, as in the following.
ninvr: 59, Loop is parallelizable Generating Multicore code 59, #pragma acc loop gang pinvr: 90, Loop is parallelizable Generating Multicore code 90, #pragma acc loop gang
You can control how many threads the program will use to run the parallel compute regions with the environment variable ACC_NUM_CORES
. The default is to use all available cores on the system. For Linux targets, the runtime will launch as many threads as physical cores (not hyper-threaded logical cores). OpenACC gang
-parallel loops run in parallel across the threads. If you have an OpenACC parallel construct with a num_gangs(200)
clause, the runtime will take the minimum of the num_gangs
argument and the number of cores on the system, and launch that many threads. That avoids the problem of launching hundreds or thousands of gangs, which makes sense on a GPU but which would overload a multicore CPU.
Single Programming Model, Portable High Performance
The goal of OpenACC is to have a single programming model that allows developers to write a single program that runs with high performance in parallel across a wide range of target systems. For the last few years, PGI has been developing and delivering OpenACC compilers targeting NVIDIA Tesla and AMD Radeon GPUs, but performance portability requires being able to run the same program with high performance in parallel on non-GPU targets, and in particular on multicore and manycore CPUs. So, the first reason to use OpenACC with -ta=multicore
is if you have an application that you want to use on systems with GPUs, and on other systems without GPUs but with multicore CPUs. This allows you to develop your program once, without having to include compile-time conditionals (ifdefs
) or special modules for each target with the increased development and maintenance cost.
Even if you are only interested in GPU-accelerated targets, you can do parallel OpenACC code development and testing on your multicore laptop or workstation without a GPU. This can separate algorithm development from GPU performance tuning. Debugging is often easier on the host than with a heterogeneous binary with both host and GPU code.
Working Through an Example: Please do Try This at Home!
To demonstrate the performance shown in the chart above, you can download the version of miniGhost used to generate the performance numbers from the PGI website.
To build the OpenMP version for execution on multicore, issue the following commands.
% make build_omp … mg_stencil_3d7pt: 197, Parallel region activated 200, Parallel loop activated with static block schedule 202, Generated 4 alternate versions of the loop Generated vector sse code for the loop Generated 5 prefetch instructions for the loop 213, Barrier 216, Parallel loop activated with static block schedule 218, Mem copy idiom, loop replaced by call to __c_mcopy8 224, Barrier Parallel region terminated … % export MP_BIND=yes; make NCPUS=32 run_omp env OMP_NUM_THREADS=32 time sh -x ./miniGhost.run ./miniGhost.omp >& miniGhost.omp.log grep elapsed miniGhost.omp.log 8527.57user 5.96system 4:27.43elapsed 3190%CPU (0avgtext+0avgdata 6650048maxresident)k
This example is using the PGI OpenMP compiler, but the OpenMP time in the chart above uses the Intel OpenMP compiler. You’ll see about the same execution time using either of these two OpenMP compilers.
To build the OpenACC version for multicore using PGI, issue the following commands.
% make build_multicore … mg_stencil_3d7pt: 219, Loop is parallelizable Generating Multicore code 219, !$acc loop gang 220, Loop is parallelizable 221, Loop is parallelizable 232, Loop is parallelizable Generating Multicore code 232, !$acc loop gang 233, Loop is parallelizable 234, Loop is parallelizable … % export MP_BIND=yes; make NCPUS=32 run_multicore env ACC_NUM_CORES=32 time sh -x ./miniGhost.run ./miniGhost.multi >& miniGhost.multi.log grep elapsed miniGhost.multi.log 8006.06user 4.88system 4:14.04elapsed 3153%CPU (0avgtext+0avgdata 6652288maxresident)k
Finally, to build the OpenACC version for execution on an NVIDIA GPU using PGI, issue the following commands.
% make build_tesla … mg_stencil_3d7pt: 216, Generating present(work(:,:,:),grid(:,:,:)) 219, Loop is parallelizable 220, Loop is parallelizable 221, Loop is parallelizable Accelerator kernel generated Generating Tesla code 220, !$acc loop gang, vector(2) ! blockidx%y threadidx%y 221, !$acc loop gang, vector(64) ! blockidx%x threadidx%x 232, Loop is parallelizable 233, Loop is parallelizable 234, Loop is parallelizable Accelerator kernel generated Generating Tesla code 233, !$acc loop gang ! blockidx%y 234, !$acc loop gang, vector(128) ! blockidx%x threadidx%x … % make DEV_NUM=0 run_tesla env ACC_DEVICE_NUM=0 time ./miniGhost.run ./miniGhost.tesla >& miniGhost.tesla.log grep elapsed miniGhost.tesla.log 122.25user 30.12system 2:32.61elapsed 99%CPU (0avgtext+0avgdata 7542656maxresident)k
OpenACC Data Clauses on Multicore
In the OpenACC execution model, the multicore CPU is treated like an accelerator device that shares memory with the initial host thread. With a shared-memory device, most of the OpenACC data clauses (copy
, copyin
, copyout
, create
) are ignored, and the accelerator device (the parallel multicore) uses the same data as the initial host thread. Similarly, update
directives and most OpenACC data API routines will not generate data allocation or movement. Other data clauses are still honored, such as private
and reduction
, which may require some dynamic memory allocation and data movement, but no more than the corresponding OpenMP data clauses.
When using OpenACC with a GPU, data gets copied from the system memory to device memory (and back). The user is responsible for keeping the two copies of data coherent, as needed. When using OpenACC on a multicore, there is only one copy of the data, so there is no coherence problem. However, the GPU OpenACC program can produce different results than a multicore OpenACC program if the program depends on the parallel compute regions updating a different copy of the data than the sequential initial host thread regions.
#pragma acc data create(a[0:n]) present(x[0:n],b[0:n]) { // following loop executed on device #pragma acc parallel loop for(i=0;i<n;++i) a[i] = b[i]; // following loop executed on host for(i=0;i<n;++i) a[i] = c[i]; // following loop executed on device #pragma acc parallel loop for(i=0;i<n;++i) x[i] = a[i]; ... }
On a GPU, the above code fragment allocates a copy of the array a on the device. It then fills in the device copy and the host copy with different values. The last loop will get the values from the device copy of a
, so it’s equivalent to x[i]=b[i]
; When compiled for a multicore, the first two loops are both executed on the CPU, the first with all multicore threads and the second with a single thread. Both loops update the same copy of a
, and the last loop will be equivalent to x[i]=c[i]
.
Requirements and Limitations
PGI compilers on Linux, Windows, and Mac OS X support OpenACC for multicore. It works with any supported PGI target, including targets for which GPUs are not supported. This feature will work with any valid PGI license.
There are a few limitations in this release, which will be removed in future releases. In this release, the collapse
clause is ignored, so only the outer loop is parallelized. The worker
level of parallelism is ignored; PGI is still exploring how best to generate parallel code that includes gang
, worker
and vector
parallelism. Also, no optimization or tuning of the loop code is done. For instance, when compiling for a GPU, the compiler will reorder loops to optimize array strides for the parallelism profile. None of this is implemented in the multicore target in this release. Finally, the vector
level of parallelism is not being used to generate SIMD code in this release. PGI expects application performance will improve as these limitations are relaxed.
Conclusion
The PGI 15.10 release allows you to generate multicore CPU code from your OpenACC programs, enabling truly performance-portable parallel programs across CPUs and GPUs.
Register to download a free trial of the PGI 15.10 compilers and check it out for yourself. If you’re new to OpenACC, you can register for a free online OpenACC training course. To get started developing with OpenACC, try the NVIDIA OpenACC Toolkit, and read this introductory Parallel Forall post on OpenACC. A detailed article on OpenACC for CPUs with PGI 15.10 will be included in an upcoming PGInsider Newsletter from PGI.
-
Resume Full name Sayed Ahmadreza Razian Nationality Iran Age 36 (Sep 1982) Website ahmadrezarazian.ir Email ...
-
Shangul Mangul HabeAngur Shangul Mangul HabeAngur (City of Goats) is a game for child (4-8 years). they learn how be useful in the city and respect to people. Persian n...
-
معرفی نام و نام خانوادگی سید احمدرضا رضیان محل اقامت ایران - اصفهان سن 33 (متولد 1361) پست الکترونیکی ahmadrezarazian@gmail.com درجات علمی...
-
Tianchi-The Purchase and Redemption Forecasts 2015 Special Prize – Tianchi Golden Competition (2015) “The Purchase and Redemption Forecasts” in Big data (Alibaba Group) Among 4868 teams. Introd...
-
Drowning Detection by Image Processing In this research, I design an algorithm for image processing of a swimmer in pool. This algorithm diagnostics the swimmer status. Every time graph sho...
-
Nokte – نکته نرم افزار کاربردی نکته نسخه 1.0.8 (رایگان) نرم افزار نکته جهت یادداشت برداری سریع در میزکار ویندوز با قابلیت ذخیره سازی خودکار با پنل ساده و کم ح...
-
Tianchi-Brick and Mortar Store Recommendation with Budget Constraints Ranked 5th – Tianchi Competition (2016) “Brick and Mortar Store Recommendation with Budget Constraints” (IJCAI Socinf 2016-New York,USA)(Alibaba Group...
-
1st National Conference on Computer Games-Challenges and Opportunities 2016 According to the public relations and information center of the presidency vice presidency for science and technology affairs, the University of Isfah...
-
3rd International Conference on The Persian Gulf Oceanography 2016 Persian Gulf and Hormuz strait is one of important world geographical areas because of large oil mines and oil transportation,so it has strategic and...
-
2nd Symposium on psychological disorders in children and adolescents 2016 2nd Symposium on psychological disorders in children and adolescents 2016 Faculty of Nursing and Midwifery – University of Isfahan – 2 Aug 2016 - Ass...
-
Optimizing raytracing algorithm using CUDA Abstract Now, there are many codes to generate images using raytracing algorithm, which can run on CPU or GPU in single or multi-thread methods. In t...
-
My City This game is a city simulation in 3d view. Gamer must progress the city and create building for people. This game is simular the Simcity.
- AMD Ryzen Downcore Control AMD Ryzen 7 processors comes with a nice feature: the downcore control. This feature allows to enable / disabl...
- Detecting and Labeling Diseases in Chest X-Rays with Deep Learning Researchers from the National Institutes of Health in Bethesda, Maryland are using NVIDIA GPUs and deep learni...
- کودا – CUDA کودا به انگلیسی (CUDA) که مخفف عبارت انگلیسی Compute Unified Device Architecture است یک سکوی پردازش موازی و مد...
- Fallout 4 Patch 1.3 Adds NVIDIA HBAO+ and FleX-Powered Weapon Debris Fallout 4 launched last November to record player numbers, swiftly becoming the most popular third-party game...
- Deep Learning for Computer Vision with MATLAB and cuDNN Deep learning is becoming ubiquitous. With recent advancements in deep learning algorithms and GPU technology...
- Automatic Colorization Automatic Colorization of Grayscale Images Researchers from the Toyota Technological Institute at Chicago and University of Chicago developed a fully aut...
- Diagnosing Cancer with Deep Learning and GPUs Using GPU-accelerated deep learning, researchers at The Chinese University of Hong Kong pushed the boundaries...
- NVIDIA TITAN Xp vs TITAN X NVIDIA has more or less silently launched a new high end graphics card around 10 days ago. Here are some pictu...
- Back to Dinosaur Island Back to Dinosaur Island takes advantage of 15 years of CRYENGINE development to show users the sheer endless p...
- Assisting Farmers with Artificial Intelligence With our planet getting warmer and warmer, and carbon dioxide levels steadily creeping up, companies are using...
- IBM Watson Chief Technology Officer Rob High to Speak at GPU Technology Conference Highlighting the key role GPUs will play in creating systems that understand data in human-like ways, Rob High...
- Unity – What’s new in Unity 5.3.3 The Unity 5.3.3 public release brings you a few improvements and a large number of fixes. Read the release not...
- Real-Time Pedestrian Detection using Cascades of Deep Neural Networks Google Research presents a new real-time approach to object detection that exploits the efficiency o...
- Draft the Ultimate Team in FIFA 16 FUT Draft showcases some excellent mechanics as you attempt to win four matches in a row with your hand-picked...
- Nvidia releases driver for MGS5 and Mad MaxHappy Metal Gear Solid V: The Phantom Pain day! Or …
- Microsoft promises to improve the Windows StoreYou can buy Rise of the Tomb Raider from Steam, …
- NVIDIA TITAN Xp vs TITAN XNVIDIA has more or less silently launched a new high …
- Albert Einstein-04The fear of death is the most unjustified of all …
- Intel Graphics Driver v4501 for WindowsA new set of graphics driver is available for Intel …
- Download Crytek’s “Back to Dinosaur Island 2” Virtual Reality Demo for FreePlayers are invited to interact with another world in Crytek’s …
- Share Your Science: Artificial Intelligent Robot for ChildrenYi-Jian Wu, Founder & CEO of Yuanqu Tech in China, …
- Monash University Upgrades MASSIVE GPU-Accelerated SupercomputerTo accelerate biomedical research, Australia’s Monash University boosted its research …
- It’s happening: ‘Pepper’ robot gains emotional intelligenceLast week we weighed in on the rise of robotica …
- Performance Portability from GPUs to CPUs with OpenACCOpenACC gives scientists and researchers a simple and powerful way to …