Personal Profile

CUDA

Share Your Science: Real-Time Facial Reenactment of YouTube Videos

Share Your Science: Real-Time Facial Reenactment of YouTube Videos

Share Your Science: Real-Time Facial Reenactment of YouTube Videos

Matthias Niessner of Stanford University shares how his team of researchers are using TITAN X GPUs and CUDA to manipulate YouTube videos with real-time facial reenactment that works with any commodity webcam.

The project called ‘Face2Face’ captures the facial expressions of both the source and target video using a dense photometric consistency measure. Reenactment is then achieved by fast and efficient deformation transfer between source and target. The mouth interior that best matches the re-targeted expression is retrieved from the target sequence and warped to produce an accurate fit. Finally, their approach re-renders the synthesized target face on top of the corresponding video stream that seamlessly blends with the real-world illumination.

For more details, read the research paper ‘Face2Face: Real-time Face Capture and Reenactment of RGB Videos’.

Share your GPU-accelerated science with us at http://nvda.ly/Vpjxr and with the world on #ShareYourScience.

Watch more scientists and researchers share how accelerated computing is benefiting their work at http://nvda.ly/X7WpH

Daniel Ambrosi Dreamscapes Share Your Science: Pushing the Limits of Computational Photography

Share Your Science: Pushing the Limits of Computational Photography

Share Your Science: Pushing the Limits of Computational Photography

Daniel Ambrosi, Artist and Photographer, is using NVIDIA GPUs in the Amazon cloud and CUDA to create giant 2D-stitched HDR panoramas called “Dreamscapes.” Ambrosi applies a modified version of Google’s DeepDream neural net visualization code to his original panoramic landscape images to create truly one-of-a-kind pieces of art.

For more information visit http://www.danielambrosi.com/Dreamscapes.

Share your GPU-accelerated science with us at http://nvda.ly/Vpjxr and with the world on #ShareYourScience.

Watch more scientists and researchers share how accelerated computing is benefiting their work at http://nvda.ly/X7WpH

6 Can’t Miss Experiences at the GPU Technology Conference

6 Can’t Miss Experiences at the GPU Technology Conference

6 Can’t Miss Experiences at the GPU Technology Conference

Explore the future of artificial intelligent and deep learning, experience virtual reality, and see what the future holds for self-driving cars at the GPU Technology Conference in Silicon Valley, April 4-7.

In addition to keynotes by notable speakers that include NVIDIA CEO Jen-Hsun Hung, Toyota Research Institute CEO Gill Pratt, and IBM Watson CTO Rob High, as well as over 500 talks, tutorials and scientific posters – here are the top six must-see things at GTC this year:

  1. AI Playground: Interact with hands-on deep learning demos from universities, start-ups and well-known companies like Baidu, Twitter and Yahoo.

  1. VR Village: New this year, experience the latest advances in a variety of immersive virtual reality experiences from a wide-range of industries including gaming, media & entertainment, manufacturing, medicine and science.
  1. Emerging Companies Summit Pavilion: Over 90 start-ups will showcase how they are using GPUs to solve some of the world’s most complex challenges. Twelve of the participants will be vying for $100,000 at the Early Stage Challenge.
  1. Hands-on Labs: Take one of the intensive 26 instructor-led labs that range from 90 to 180 minutes covering a wide-range of topics, from the comfort of your own laptop. Attendees looking for more training can grab a seat in the self-paced labs area or try them from anywhere for free, using promo code GTC16_EARLYBIRD to receive free credits on NVIDIA’s cloud-based learning platform.
  1. Share Your Science: This is a great opportunity to share how you are doing amazing work with GPUs. NVIDIA will be video interviewing developers, researchers and scientists and then amplifying your stories to the broader community. Fill out this short form to be considered for an interview.
  1. Face2Face Demo: A team of researchers from Stanford, the Max Planck Institute for Informatics and the University of Erlangen-Nuremberg are using TITAN X GPUs and CUDA to manipulate YouTube videos with real-time facial reenactment that works with any commodity webcam. You can experience the magic of the Face2Face demo in person at the NVIDIA booth.

Register by April 2 to save up to $300, and see what the future holds at the GPU Technology Conference.

کودا – CUDA

کودا به انگلیسی (CUDA) که مخفف عبارت انگلیسی Compute Unified Device Architecture است یک سکوی پردازش موازی و مدل برنامه‌نویسی است که توسط شرکت انویدیا به‌وجود آمده است و در واحدهای پردازش گرافیکی این شرکت پشتیبانی می‌شود.کودا به توسعه دهنده گان نرم‎افزار اجازه می‎دهد تا از یک GPU که ویژگی CUDA-enabled دارد برای هدف پردازش استفاده کنند، رویکردی که GPGUG شناخته می‎شود. کودا به توسعه‌دهنده گان امکان دسترسی مستقیم به حافظه و مجموعه دستورالعمل در واحد پردازش گرافیکی را می‌دهد.

سکوی کودا برای کار با زبان‎های برنامه‎نویسی مانند C و ++C و فرترن طراحی شده‎است.این دسترسی باعث می‎شود تا برای متخصصان استفاده از منابع GPU آسان‎تر شود برخلاف راه کار های API دیگر چون DIRECT3D و OpenGL که نیاز به توانایی حرفه ای در برنامه نویسی گرافیک داشتند.همچین کودا از چارچوب‎هایی چون OpenACC و OpenCL پشتیبانی می کند.

پیش زمینه

GPU به عنوان یک پردازنده خاص ،درخواست‎های های بلادرنگ با کیفیت بالا گرافیک سه بعدی که از نظر وظایف محاسباتی فشرده هستند را مختصات‎دهی می‎کند.از سال 2012 میلادی GPU ها به سیستم‎های چند هسته ای قدرتمندی ارتقا یافتند که قادر به دستکاری بلوک‎های بزرگی از داده ها هستند.این طراحی بسیار از هدف عامه CPU ها برای الگوریتم‎ها در مواقعی که پردازش موازی روی بلوک های داده انجام می‎شود موثرتر است.به عنوان مثال:

  • الگوریتم ارسال برچسب
  • الگوریتم مرتب سازی سریع روی لیست‎های ‎بزرگ
  • تبدیل موجک سریع دوبعدی
  • شبیه‎سازی دینامیک مولکولی

قابلیت‎های برنامه‌نویسی

کودا توسط کتابخانه‎های مجهز شده کودا ،دستوردهنده کامپایلر مانند OpenACC و همین طور توسعه‎هایی استاندارد صنعتی از زبان‎هایی شامل C، ++C و فرترن برای توسعه‎دهندگان قابل دسترسی است.برنامه‎نویسان C++/C از ‘++CUDA C/C’ استفاده می کنند که کامپایل شده با “nvcc” است.nvcc یک کامپایلر C++/C بر پایه LLVM شرکت انویدیا است.برنامه نویسان فرترن نیز می توانند از ‘CUDA Fortran’ استفاده کنند که کامپایل شده با PGI CUDA Fortran Complier شرکت The Portland Group است. علاوه بر کتابخانه‎ها ،دستوردهنده‎های کامپایلر و ++CUDA C/C و CUDA Fortran ،سکو کودا از سایر رابط‎های محاسباتی شامل موارد زیر پشتیبانی می کند.

  • OpenCL گروه Khronos
  • DirectCompute مایکروسافت
  • محاسبات سایه زنی OpenGL
  • C++ AMP

همچنین لفافه سوم شخص (Third party wrappers) برای زبان هایی مانند پرل (Perl)،پایتون (Python)،آر (R) ،فرترن (FORTRAN)،جاوا (Java)،روبی (Ruby)،هسکل (Haskell)،متلب (Haskell) ،آی دی ال (IDL)،لوآ (Lua) و نیز به طور پیشفرض متمتیکا (Mathematica) در دسترس هستند.

در صنعت بازی‎های کامپیوتری ،GPUها تنها برای رندر کردن گرافیک نیست بلکه در محاسبات فیزیکی بازی (اثرات فیزیکی شبیه دود ،آتش ،ترشحات و آوار) نیز هستند.مثال‎هایی نظیر فیز-اکس و گلوله شامل این مورد هستند.کودا همچنین برای کاربردهای شتاب‎دهی غیرگرافیکی در زیست‎شناسی محاسباتی ،رمزنگاری و حوزه های دیگر نیز استفاده می‎شود.

کودا هم یک API سطح پایین و هم یک API سطح بالا فراهم می کند.SDK اولیه کودا در 15 فوریه 2007 برای ویندوز مایکرو‎سافت و لینوکس انتشار عمومی شد.پشتیبانی در سیستم‎عامل مک در نسخه دوم اضافه شد که جای نسخه تست 14 فوریه 2008 را می‎گیرد.کودا با تمامی ‎GPUهای از سری G8x به بعد شامل جی‎فورس ،کوادرو و تسلا(گرافیک) کار می‎کند.کودا با بیشتر سیستم‎عامل‎‎های استاندارد کار می‎کند.انویدیا می‎گوید برنامه‎هایی که برای سری G8x توسعه‎یافته‎اند همچنین بدون تغییر روی نسل‎های آینده کارت‎های گرافیک بسته به سازگاری دودویی کارخواهند کرد.

مزایا

کودا چندین برتری در برابر محاسبات عمومی سنتی روی GPU ها(در کل منظورGPGPU) دارد که از واسط‎های گرافیکی استفاده می‎کنند.

  • خواندن پراکنده یعنی کد می‎تواند از آدرس‎های دلخواه در حافظه بخواند.
  • حافظه مجازی یکپارچه (کودا نسخه 4.0 به بعد)
  • حافظه یکپارچه(کودا نسخه 6.0 به بعد)
  • حافظه مشترک کودا ناحیه ای که یک حافظه سریع مشترک است ،نشان می‎دهد که می‎تواند میان نخ‎ها به اشتراک گذاشته‎شود.این حافظه می‎تواند به عنوان یک حافظه نهان مدیریت شده تحت دسترسی کاربر استفاده شود و پهنای باند بیشتری داریم یعنی امکان استفاده را از جستجو بافتی.
  • دانلود‎های سریع تر و مجدد خوانی
  • پشتیبانی کامل برای اعداد صحیح و عملیات بیتی شامل جستجوی بافتی صحیح

How Virtual Reality Is Making the Leap into Everyday Life

How Virtual Reality Is Making the Leap into Everyday Life

How Virtual Reality Is Making the Leap into Everyday Life

Headlines about virtual reality often focus on how it’s upending the world of gaming. But VR is also revolutionizing fields across everyday life — areas like medicine, architecture, education, product design and retailing.

A great example is Audi’s new virtual showroom, where you can explore each of their models in vivid detail through VR. The German automaker is using NVIDIA Quadro GPUs to craft a virtual showroom that lets you build custom configurations of any Audi model, and experience them in a number of environments.

Surgical Theater LLC is launching a new division focused on using their VR technology in surgery, such as for brain tumor procedures. They’re utilizing multiple NVIDIA GPUs and NVIDIA SLI technology to increase resolution and responsiveness. Surgeons, as a result, can “fly through” a patient’s anatomy prior to surgery.

And if the recent Hollywood hit “The Martian” got you pondering what it’s really like on the red planet, 20th Century Fox can help. They and partners are debuting The Martian VR Experience, an interactive, immersive VR adventure. It lets you fly onto the surface of Mars, steer at zero gravity through space, drive a rover and experience other key scenes from the film in a 360-degree VR environment. The CES demonstrations are powered by NVIDIA GPUs, delivering extremely high frame rates at high fidelity for the maximum visual experience.

How Virtual Reality Is Making the Leap into Everyday Life

How Virtual Reality Is Making the Leap into Everyday Life

These — and countless other initiatives — are coming onto the market now because of the confluence of new VR headsets and NVIDIA’s leading-edge graphics innovation.

GPUs are at the heart of VR, which demands refresh rates of up to 90 times a second for each eye. A truly immersive experience requires as much as 7x the processing power needed to display a game on a typical monitor. NVIDIA’s latest Maxwell architecture GPUs are optimized for VR performance with ultra-low latency, fast performance and new rendering capabilities specifically for VR.

Great software capabilities are also required for VR. And a number of companies are creating breakthrough new applications using NVIDIA CUDA, our parallel computing platform, and our DesignWorks VR software development kit. These innovators are merging the real world and the virtual world to create new experiences for consumers, educators, scientists and designers.

Jaunt partners with world-class creatives to produce and distribute premium cinematic VR content across a wide variety of experiences — ranging from narrative storytelling to music, travel/adventure, sports, documentary and more. Recently Jaunt has partnered with news organizations such as ABC News, SkyNews and Ryot to bring immersive 360-degree virtual reality journalism to consumers. Jaunt processes their content in the cloud using CUDA and NVIDIA GPUs, which is essential to their ability to scale and put VR content in the hands of people wherever they are.

Nurulize uses NVIDIA GPUs to process data from lidar scans to create highly detailed scene reconstructions used to bring real-world environments into the virtual world. Their Atom View technology allows vast point cloud data from the industry’s leading scanners to be imported and viewed without manual re-work or time consuming post-processing, allowing viewing within minutes rather than weeks or months.

And 8i is launching at CES their new 8i Portal VR player. They run CUDA-optimized implementations of their proprietary algorithms to maximize content throughput in creating stunning volumetric 3D videos of real people as well as using NVIDIA GPUs for high-performance VR playback.

In the areas of education and edutainment, Realities.io is providing virtual access to historical sites and inaccessible areas. Their larger-than-life journeys are created from photogrammetry, photos, videos, interactive elements and massive amounts of data. CUDA-based software from CapturingReality handles the processing and NVIDIA GPUs are used for visualization.

And in the area of product design, Ford is making VR central to their design process. With its Ford Immersive VR Environment (FiVE) capability — driven by two top-of-the-line NVIDIA Quadro M6000 cards — the automaker can evaluate vehicle prototypes in real time, in full scale and in context. It brings energy, emotion and accuracy to its immersive 3D visual environments.

New Features in CUDA 7.5

Today I’m happy to announce that the CUDA Toolkit 7.5 Release Candidate is now available. The CUDA Toolkit 7.5 adds support for FP16 storage for up to 2x larger data sets and reduced memory bandwidth, cuSPARSE GEMVI routines, instruction-level profiling and more. Read on for full details.

16-bit Floating Point (FP16) Data

CUDA 7.5 expands support for 16-bit floating point (FP16) data storage and arithmetic, adding new half and half2 datatypes and intrinsic functions for operating on them. 16-bit “half-precision” floating point types are useful in applications that can process larger datasets or gain performance by choosing to store and operate on lower-precision data. Some large neural network models, for example, may be constrained by available GPU memory; and some signal processing kernels (such as FFTs) are bound by memory bandwidth.

Many applications can benefit by storing data in half precision, and processing it in 32-bit (single) precision. At GTC 2015 in March, NVIDIA CEO Jen-Hsun Huang announced that future Pascal architecture GPUs will include full support for such “mixed precision” computation, with FP16 (half) computation at higher throughput than FP32 (single) or FP64 (double) .

With CUDA 7.5, applications can benefit by storing up to 2x larger models in GPU memory. Applications that are bottlenecked by memory bandwidth may get up to 2x speedup. And applications on Tegra X1 GPUs bottlenecked by FP32 computation may benefit from 2x faster computation on half2 data.

CUDA 7.5 provides 3 main FP16 features:

  1. A new header, cuda_fp16.h defines the half and half2 datatypes and __half2float() and __float2half() functions for conversion to and from FP32 types, respectively.
  2. A new `cublasSgemmEx()“ routine performs mixed-precision matrix-matrix multiplications using FP16 data (among other formats) as inputs, while still executing all computation in full 32-bit precision. This allows multiplication of 2x larger matrices on the GPU.
  3. For current users of Drive PX with Tegra X1 GPUs (and on future GPUs such as Pascal), cuda_fp16.h also defines intrinsics for 16-bit computation and comparison. cuBLAS also includes cublasHgemm() (half-precision computation matrix-matrix multiply) routine for these GPUs.

NVIDIA GPUs implement the IEEE 754 floating point standard (2008), which defines half-precision numbers as follows (see Figure 1).

  • Sign: 1 bit
  • Exponent width: 5 bits
  • Significand precision: 11 bits (10 explicitly stored)

The range of half-precision numbers is approximately 5.96 \times 10^{-8} \ldots 6.55 \times 10^4. half2 structures store two half values in the space of a single 32-bit word, as the bottom of Figure 1 shows.

Figure 1: 16-bit half-precision data formats. Top: single `half` value. Bottom: `half2` vector representation.
Figure 1: 16-bit half-precision data formats. Top: single `half` value. Bottom: `half2` vector representation.

New cuSPARSE Routines Accelerate Natural Language Processing.

The cuSPARSE library now supports the cusparse{S,D,C,Z}gemvi() routine, which multiplies a dense matrix by a sparse vector, using the following equation.

\mathbf{y} = \alpha op(\mathbf{A}) \mathbf{x} + \beta \mathbf{y},

where \mathbf{A} is a dense matrix, \mathbf{x} is a sparse input vector, \mathbf{y} is a dense output vector, and op() is either a no-op, transpose, or conjugate transpose. For example:

\left[ \begin{array}{c} \mathbf{y}_1 \\ \mathbf{y}_2 \\ \mathbf{y}_3 \end{array} \right] = \alpha \left[ \begin{array}{ccccc} \mathbf{A}_{11} & \mathbf{A}_{12} & \mathbf{A}_{13} & \mathbf{A}_{14} & \mathbf{A}_{15} \\ \mathbf{A}_{21} & \mathbf{A}_{22} & \mathbf{A}_{23} & \mathbf{A}_{24} & \mathbf{A}_{25} \\ \mathbf{A}_{31} & \mathbf{A}_{32} & \mathbf{A}_{33} & \mathbf{A}_{34} & \mathbf{A}_{35} \end{array} \right] \left[ \begin{array}{c} - \\ 2 \\ - \\ - \\ 1 \end{array} \right] + \beta \left[ \begin{array}{c} \mathbf{y}_1 \\ \mathbf{y}_2 \\ \mathbf{y}_3 \end{array} \right]

This type of computation is useful in machine learning and natural language processing applications. Suppose I’m processing English language documents, so I start with a dictionary, which assigns a unique index to every word in the English language. If the dictionary has N entries, then any document can be represented with a Bag of Words (BoW): an N-dimensional vector in which each entry is the number of occurences of the corresponding dictionary word in the document.

In natural language processing and machine translation, it’s useful to compute a vector representation of words, where the vectors have O(300) dimensions (rather than a raw BoW representation which may have hundreds of thousands of dimensions, due to the size of the language dictionary). A good example of this approach is the word2vec algorithm, which maps natural language words into a semantically meaningful vector space. In word2vec, similar words map to similar locations in the vector space, which aids reasoning about word relationships, pattern recognition, and model generation.

Mapping a sentence or document represented as a BoW into the lower-dimensional word vector space requires multiplying a dense matrix with a sparse vector, where each row in the matrix corresponds to the vector corresponding to a dictionary word, and the vector is the sparse BoW vector for the sentence/document.

The new cusparse{S,D,C,Z}gemvi() routine in CUDA 7.5 makes it easier for developers of these complex applications to achieve high performance with GPUs. cuSPARSE routines are tuned for top performance on NVIDIA GPUs, so users don’t need to be experts in GPU performance.

To learn more about related techniques in machine translation, check out the recent post Introduction to Neural Machine Translation.

Pinpoint Performance Bottlenecks with Instruction-Level Profiling

One of the biggest challenges in optimizing code is determining where in the application to put optimization effort for the greatest impact. NVIDIA has been improving profiling tools with every release of CUDA, adding more focused introspection and smarter guided analysis. CUDA 7.5 further improves the power of the NVIDIA Visual Profiler (and NSight Eclipse Edition) by enabling true instruction-level profiling on Maxwell GM200 and later GPUs. This lets you quickly identify the specific lines of source code causing performance bottlenecks in GPU code, making it easier to apply advanced performance optimizations.

Before CUDA 7.5, the NVIDIA Visual Profiler supported kernel-level profiling: for each kernel, the profiler could tell you the amount of time spent, the relative importance as a fraction of total run time, and key statistics and limiters. For example, Figure 1 shows a kernel-level analysis showing that the kernel in question is possibly limited by instruction latencies.

Figure 2: Before CUDA 7.5, the NVIDIA Visual Profiler supported only kernel-level profiling, showing performance and key statistics and limiters for each kernel invocation.
Figure 2: Before CUDA 7.5, the NVIDIA Visual Profiler supported only kernel-level profiling, showing performance and key statistics and limiters for each kernel invocation. Click for full resolution.

CUDA 6 added support for more detailed profiling, correlating lines of code with the number of instructions executed by those lines, as Figure 2 shows. But the highest instructions count lines do not necessarily take the longest. In the example, these lines from a reduction are not taking as long as the true hotspot, which has longer stalls due to memory dependencies.

Figure 3: CUDA 6 added support for detailed profiling, showing the correspondence between source lines and assembly code, and the number of instructions executed for each source line.
Figure 3: CUDA 6 added support for detailed profiling, showing the correspondence between source lines and assembly code, and the number of instructions executed for each source line. Click for full resolution.

Per-kernel statistics and instruction counts are very useful information, but getting to the root of performance problems in complex kernels could still be difficult. When profiling, you want to know exactly which lines are taking the most execution time. With CUDA 7.5, the profiler uses program counter sampling to find and show specific “hot spot” lines of code where the kernel is spending most of its time, as Figure 3 shows.

Figure 3: New in CUDA 7.5, instruction-level-profiling pinpoints specific lines of code that are hotspots.
Figure 3: New in CUDA 7.5, instruction-level-profiling pinpoints specific lines of code that are hotspots. Click for full resolution.

Not only does the profiler show hotspot lines, but it shows potential reasons for the hotspot, based on the state of warps executing the lines. In this case, the hotspot is due to synchronization and memory latency, and the assembly view shows that the kernel is stalling on local memory loads (LDL) and __syncthreads(). Knowing this, the kernel developer can optimize the kernel to keep data in registers. Figure 4 shows the results after code tuning, where the kernel time has improved by about 2.5x.

Figure 4: By using instruction-level profiling, the developer was able to optimize the kernel performance, achieving a 2.5X kernel speedup.
Figure 4: By using instruction-level profiling, the developer was able to optimize the kernel performance, achieving a 2.5X kernel speedup. Click for full resolution.

Experimental Feature: GPU Lambdas

CUDA 7 introduced support for C++11, the latest version of the C++ language standard. Lambda expressions are one of the most important new features in C++11. Lambda expressions provide concise syntax for defining anonymous functions (and closures) that can be defined in line with their use, can be passed as arguments, and can capture variables.

C++11 lambdas are handy when you have a simple computation that you want to use as an operator in a generic algorithm, like the thrust::count_if() algorithm that I used in a past blog post. The following code from that post uses Thrust to count the frequency of ‘x’, ‘y’, ‘z’, and ‘w’ characters in a text. But before CUDA 7.5, this could only be done with host-side lambdas, meaning this code couldn’t execute on the GPU.

#include <initializer_list>

void xyzw_frequency_thrust_host(int *count, char *text, int n)
{
  using namespace thrust;

  *count = count_if(host, text, text+n, [](char c) {
    for (const auto x : { 'x','y','z','w' }) 
      if (c == x) return true;
    return false;
  });
}

CUDA 7.5 introduces an experimental feature: GPU lambdas. GPU lambdas are anonymous device function objects that you can define in host code, by annotating them with a __device__ specifier. Here is xyzw_frequency function modified to run on the GPU. The code indicates the GPU lambda with the __device__ specifier before the parameter list.

#include <initializer_list>

void xyzw_frequency_thrust_device(int *count, char *text, int n)
{
  using namespace thrust;

  *count = count_if(device, text, text+n, [] __device__ (char c) {
    for (const auto x : { 'x','y','z','w' }) 
      if (c == x) return true;
    return false;
  });
}

Parallel For Programming

GPU lambdas enable a “parallel-for” style of programming that lets you write parallel computations in-line with the code that invokes them—just like you would with a for loop. The following SAXPY shows how for_each() lets you write parallel code for a GPU in a style very similar to a simple for loop. Using Thrust in this way ensures you get great performance on the GPU, as well as performance portability to CPUs: the same code can be compiled and run for multi-threaded execution on CPUs using Thrust’s OpenMP or TBB backends.

void saxpy(float *x, float *y, float a, int N) {
    using namespace thrust;
    auto r = counting_iterator(0);
    for_each(device, r, r+N, [=] __device__ (int i) {
        y[i] = a * x[i] + y[i];
    });
}

GPU lambdas are an experimental feature in CUDA 7.5. To use them, you need to enable the feature by passing the flag --expt-extended-lambda to nvcc on the compiler command line. As an experimental feature, GPU lambda functionality is subject to change in future releases, and there are some limitations to how they can be used. See the CUDA C++ Programming Guide for full details. I’ll write more about GPU lambdas in a future blog post.

Windows Remote Desktop

With CUDA 7.5, you can now run Windows CUDA applications remotely via Windows Remote Desktop. This means that even without a CUDA-capable GPU in your Windows laptop, you can still run GPU-accelerated applications remotely on a Windows server or desktop PC. CUDA applications can also now be run as services on Windows.

These Windows capabilities are supported on all NVIDIA GPU products.

LOP3

A new LOP3 instruction is added to PTX assembly, supporting a range of 3-operand logic operations, such as A & B & C, A & B & ~C, A & B | C, etc. This functionality, supported on Compute Capability 5.0 and higher GPUs, can save instructions when performing complex logic operations on multiple inputs. See section 8.7.7.6 of the PTX ISA specification included with the CUDA Toolkit version 7.5.

More improvements

  • 64-bit API for cuFFT
  • n-dimensional Euclidian norm floating-point math functions
  • Bayer CFA to RGB conversion functions in NPP
  • Faster double-precision square-roots (sqrt)
  • Programming examples for the cuSOLVER library
  • Nsight Eclipse Edition supports the POWER platform

Platform Support

The CUDA 7.5 release notes include a full list of supported platforms; here are some notable changes.

  • Added: Ubuntu 15.04, Windows 10, and (upcoming) OS X 10.11
  • Added: host compiler support for Clang 3.5 and 3.6 on Linux.
  • Removed: Ubuntu 12.04 LTS on (32-bit) x86, cuda-gdb native debugging on Mac OS X
  • Deprecated: legacy (environment variable-based) command-line profiler. Use the more capable nvprof command-line profiler instead.

Download the CUDA 7.5 Release Candidate Today!

CUDA Toolkit 7.5 is now available for download. If you are not already a member of the free NVIDIA developer program, signing up is easy.

To learn more about the features in CUDA 7.5, register for the webinar “CUDA Toolkit 7.5 Features Overview” and put it on your calendar for September 22.

A Defining Moment for Heterogeneous Computing

The streets of downtown Austin, just cleared of music festival attendees and auto racing fans, are now filled with enthusiasts of a different sort. This year the city is host to SC15, the largest event for supercomputing systems and software, and AMD is on site to meet with customers and technology partners.

 

A Defining Moment for Heterogeneous Computing

A Defining Moment for Heterogeneous Computing

The hardware is here, of course, including industry-leading AMD FirePro™ graphics and the upcoming AMD Opteron™ A1100 64-bit ARM® processor. However, the big story for AMD at the show this year is the “Boltzmann Initiative”, delivering new software tools to take advantage of the processing power of our products, including those on the future roadmap, like the new “Zen” x86 CPU core coming next year.  Ludwig Boltzmann was a theoretical physicist and mathematician who developed critical formulas for predicting the behavior of different forms of matter. Today, these calculations are central to work done by the scientific and engineering communities we are targeting with these tools.

First though, just a quick review of what ties this story together: Heterogeneous Computing. The Heterogeneous System Architecture (HSA) Foundation was created in 2012, with AMD as a founding member, to make it dramatically easier to program heterogeneous computing systems. Heterogeneous computing takes advantage of CPUs, GPUs, and other accelerators such as DSPs and other programmable and fixed-function devices to help increase performance and efficiency with the goal of reduced energy use. The GPU in particular is a critical component since general purpose computing on a GPU (GPGPU) makes large performance gains achievable for certain applications through parallel execution. However, while effectively harnessing the GPU for computing has become easier, AMD is taking a huge leap forward today with the announcement of the Boltzmann Initiative and its three key new tools for developers.

The first innovation is our new, heterogeneous compute compiler (HCC) for C++ programming. Over the last several years, it’s been possible to program for GPU compute through the use of OpenCL™, an open industry standard language, or the proprietary CUDA language. Both provide a general-purpose model for data parallelism as well as low-level access to hardware. And while both are significant improvements in both ease and functionality compared to previous methods, they still require unique programming skills. This is a problem because the potential for leveraging the GPU is so great and so diverse. Applications ranging from 3D medical imaging to facial recognition, from climate analysis to human genome mapping can all benefit, to name a few.

Ultimately, for heterogeneous computing to become a mainstream reality, these technologies will need to become accessible to a majority of the programmers in the world through more familiar languages such as C++. By creating a logical model where heterogeneous processors fully share system resources such as memory, HSA promises a standard programming model that allows developers to write code that can run seamlessly on whatever processor block is best able to execute it. The idea of matching the right workload to the right processor is compelling and being embraced by many hardware and software companies. The new AMD C++ compiler makes that idea a whole lot easier to execute.

Second is our new Linux® driver. While the Windows® operating system is fantastic and supports billions of consumer client devices and commercial servers, Linux is highly popular in technical and scientific communities where collaboration on application development is the traditional model to maximize performance. By making an all new Linux driver available, AMD is helping expand the developer base for heterogeneous computing even further. Important benefits for the programmer of this new, headless Linux driver include low latency compute dispatch, peer-to-peer GPU support, Remote Direct Memory Access (RDMA) from InfiniBand™ interconnects directly to GPU memory, and Large Single Memory Allocation support. Combined with the new C++ compiler, the Linux driver is a powerful addition to the Boltzmann Initiative.

Finally, for applications already developed in CUDA, they can now be ported into C++. This is achieved using the new Heterogeneous-computing Interface for Programmers (HIP) tool that ports CUDA runtime APIs into C++ code. AMD testing shows that in many cases 90 percent or more of CUDA code can be automatically converted into C++ by HIP. The remainder will require manual programming, but this should take a matter of days, not months as before. Once ported, the application could run on a variety of underlying hardware, and enhancements could be made directly through C++. The overall effect would enable greater platform flexibility and reduced development time and cost.

The availability of the new C++ compiler, Linux driver and HIP tool means that heterogeneous computing will be available to many more software developers, substantially increasing the pool of programmers. That’s a tremendous amount of brain power that can now create applications that more readily take advantage of the underlying hardware. It also means many more applications can take advantage of parallelism, when applicable, enabling better performance and greater energy efficiency. I encourage you to stop by booth #727 at the Austin Convention Center this week to learn more!

GPU Accelerated Computing with C and C++

With the CUDA Toolkit from NVIDIA, you can accelerate your C or C++ code by moving the computationally intensive portions of your code to an NVIDIA GPU.  In addition to providing drop-in library acceleration, you are able to efficiently access the massive parallel power of a GPU with a few new syntactic elements and calling functions from the CUDA Runtime API.

The CUDA Toolkit from NVIDIA is free and includes:

  • Visual and command-line debugger
  • Visual and command-line GPU profiler
  • Many GPU optimized libraries
  • The CUDA C/C++ compiler
  • GPU management tools
  • Lots of other features

Getting Started:

  1. Make sure you have an understanding of what CUDA is.
    • Read through the Introduction to CUDA C/C++ series on Mark Harris’ Parallel Forall blog.
  2. Try CUDA by taking a self-paced lab on nvidia.qwiklab.com. These labs only require a supported web browser and a network that allows Web Sockets. Click here to verify that your network & system support Web Sockets in section “Web Sockets (Port 80)”, all check marks should be green.
  3. Download and install the CUDA Toolkit.
    • You can watch a quick how-to video for Windows showing this process:

    • Also see Getting Started Guides for Windows, Mac, and Linux.
  4. See how to quickly write your first CUDA C program by watching the following video:

Learning CUDA:

  1. Take the easily digestible, high-quality, and free Udacity Intro to Parallel Programming course which uses CUDA as the parallel programming platform of choice.
  2. Visit docs.nvidia.com for CUDA C/C++ documentation.
  3. Work through hands-on examples:
  4. Look through the code samples that come installed with the CUDA Toolkit.
  5. If you are working in C++, you should definitely check out the Thrust parallel template library.
  6. Browse and ask questions on stackoverflow.com or NVIDIA’s DevTalk forum.
  7. Learn more by:
  8. Look at the following for more advanced hands-on examples:

So, now you’re ready to deploy your application?
You can register today to have FREE access to NVIDIA TESLA K40 GPUs.
Develop your codes on the fastest accelerator in the world. Try a Tesla K40 GPU and accelerate your development.

Availability

The CUDA Toolkit is a free download from NVIDIA and is supported on Windows, Mac, and most standard Linux distributions.

  • Starting with CUDA 5.5, CUDA also supports the ARM architecture
  • For the host-side code in your application, the nvcc compiler will use your default host compiler.

NVIDIA Deep Learning SDK Now Available

NVIDIA Deep Learning SDK Now Available

NVIDIA Deep Learning SDK Now Available

The NVIDIA Deep Learning SDK brings high-performance GPU acceleration to widely used deep learning frameworks such as Caffe, TensorFlow, Theano, and Torch. The powerful suite of tools and libraries are for data scientists to design and deploy deep learning applications.

Following the Beta release a few months ago, the production release is now available with:

  • cuDNN 4 – Accelerate training and inference with batch normalization, tiled FFT, and NVIDIA Maxwell optimizations.
  • DIGITS 3 – Get support for Torch and pre-defined AlexNet and Googlenet models
cuDNN 4 Speedup vs. CPU-only for batch = 1 AlexNet + Caffe, Fwd Pass Convolutional Layers GeForce TITAN X, Core i7-4930 on Ubuntu 14.04 LTS

[cuDNN 4 Speedup vs. CPU-only for batch = 1]—[AlexNet + Caffe, Fwd Pass Convolutional Layers]—[GeForce TITAN X, Core i7-4930 on Ubuntu 14.04 LTS]

Download now >>

GPUs Help Measure Rising Sea Levels in Real-Time

GPUs Help Measure Rising Sea Levels in Real-Time-GPUs Help Measure Rising Sea Levels in Real-Time

GPUs Help Measure Rising Sea Levels in Real-Time

Sea levels have traditionally been measured by marks on land – but the problem with this approach is that parts of the earth’s crust move too.

A group of researchers from Chalmers University of Technology in Sweden are using GPS receivers along the coastline in combination with reflections of GPS signals that bounce off the water’s surface. NVIDIA GPUs then crunch those data signals to compute the water level in real-time.

The researchers are using the cuFFT library, alongside NVIDIA Tesla and GeForce GPUs to process the nearly 800 megabits per second of data that come from the reflectometry stream systems.

Schematics of the data flow for a software-defined radio GNSS-R solution. Direct (RHCP) and reflected (LHCP) signals are received, A/D converted and sent to a host PC, where a Tesla K40 GPU handles signal processing.

Schematics of the data flow for a software-defined radio GNSS-R solution. Direct (RHCP) and reflected (LHCP) signals are received, A/D converted and sent to a host PC, where a Tesla K40 GPU handles signal processing.

“Without the use of GPUs, we would not have been able to process all our signals in real-time,” said Thomas Hobiger, a researcher on the project.

This work has placed the team among the top five finalists for NVIDIA’s 2016 Global Impact Award which awards a $150,000 grant to researchers doing groundbreaking work that addresses social, humanitarian and environmental problems.



Popular Pages
  • CV Resume Ahmadrezar Razian-سید احمدرضا رضیان-رزومه Resume Full name Sayed Ahmadreza Razian Nationality Iran Age 36 (Sep 1982) Website ahmadrezarazian.ir  Email ...
  • CV Resume Ahmadrezar Razian-سید احمدرضا رضیان-رزومه معرفی نام و نام خانوادگی سید احمدرضا رضیان محل اقامت ایران - اصفهان سن 33 (متولد 1361) پست الکترونیکی ahmadrezarazian@gmail.com درجات علمی...
  • Nokte feature image Nokte – نکته نرم افزار کاربردی نکته نسخه 1.0.8 (رایگان) نرم افزار نکته جهت یادداشت برداری سریع در میزکار ویندوز با قابلیت ذخیره سازی خودکار با پنل ساده و کم ح...
  • Tianchi-The Purchase and Redemption Forecasts-Big Data-Featured Tianchi-The Purchase and Redemption Forecasts 2015 Special Prize – Tianchi Golden Competition (2015)  “The Purchase and Redemption Forecasts” in Big data (Alibaba Group) Among 4868 teams. Introd...
  • Brick and Mortar Store Recommendation with Budget Constraints-Featured Tianchi-Brick and Mortar Store Recommendation with Budget Constraints Ranked 5th – Tianchi Competition (2016) “Brick and Mortar Store Recommendation with Budget Constraints” (IJCAI Socinf 2016-New York,USA)(Alibaba Group...
  • Drowning Detection by Image Processing-Featured Drowning Detection by Image Processing In this research, I design an algorithm for image processing of a swimmer in pool. This algorithm diagnostics the swimmer status. Every time graph sho...
  • 1st National Conference on Computer Games-Challenges and Opportunities 2016-Featured 1st National Conference on Computer Games-Challenges and Opportunities 2016 According to the public relations and information center of the presidency vice presidency for science and technology affairs, the University of Isfah...
  • Shangul Mangul Habeangur,3d Game,AI,Ahmadreza razian,boz,boz boze ghandi,شنگول منگول حبه انگور,بازی آموزشی کودکان,آموزش شهروندی,آموزش ترافیک,آموزش بازیافت Shangul Mangul HabeAngur Shangul Mangul HabeAngur (City of Goats) is a game for child (4-8 years). they learn how be useful in the city and respect to people. Persian n...
  • Design an algorithm to improve edges and image enhancement for under-sea color images in Persian Gulf-Featured 3rd International Conference on The Persian Gulf Oceanography 2016 Persian Gulf and Hormuz strait is one of important world geographical areas because of large oil mines and oil transportation,so it has strategic and...
  • 2nd Symposium on psychological disorders in children and adolescents 2016 2nd Symposium on psychological disorders in children and adolescents 2016 2nd Symposium on psychological disorders in children and adolescents 2016 Faculty of Nursing and Midwifery – University of Isfahan – 2 Aug 2016 - Ass...
  • MyCity-Featured My City This game is a city simulation in 3d view. Gamer must progress the city and create building for people. This game is simular the Simcity.
  • GPU vs CPU Featured CUDA Optimizing raytracing algorithm using CUDA Abstract Now, there are many codes to generate images using raytracing algorithm, which can run on CPU or GPU in single or multi-thread methods. In t...
Popular posts
Interested
About me

My name is Sayed Ahmadreza Razian and I am a graduate of the master degree in Artificial intelligence .
Click here to CV Resume page

Related topics such as image processing, machine vision, virtual reality, machine learning, data mining, and monitoring systems are my research interests, and I intend to pursue a PhD in one of these fields.

جهت نمایش صفحه معرفی و رزومه کلیک کنید

My Scientific expertise
  • Image processing
  • Machine vision
  • Machine learning
  • Pattern recognition
  • Data mining - Big Data
  • CUDA Programming
  • Game and Virtual reality

Download Nokte as Free


Coming Soon....

Greatest hits

Imagination is more important than knowledge.

Albert Einstein

You are what you believe yourself to be.

Paulo Coelho

Gravitation is not responsible for people falling in love.

Albert Einstein

The fear of death is the most unjustified of all fears, for there’s no risk of accident for someone who’s dead.

Albert Einstein

Waiting hurts. Forgetting hurts. But not knowing which decision to take can sometimes be the most painful.

Paulo Coelho

Anyone who has never made a mistake has never tried anything new.

Albert Einstein

It’s the possibility of having a dream come true that makes life interesting.

Paulo Coelho

One day you will wake up and there won’t be any more time to do the things you’ve always wanted. Do it now.

Paulo Coelho


Site by images
Recent News Posts