This report is sourced from the team in our National Security Portfolio’s Machine Learning Center of Excellence.

Recently, in the process of creating prototypes focused on innovative uses of technologies for our clients, we utilized NVIDIA’s CuPy library for GPU-accelerated fractal generation and searching.

We used a feature of CuPy that allows you to write raw CUDA C++ code in a Python project without needing to create a C++ project or interact directly with a C++ compiler. This gives users access to much of the power of CUDA C++ without needing to be experts in CUDA or C++, lowering the technical bar to be able to implement this type of innovative technology. Fractals served as a good example of a custom computation that can be GPU accelerated with CuPy with minimal C++ experience.

In the context of federal agencies’ work, this type of innovative computation can help speed up existing workflows using GPUs, particularly with large datasets. By making our processes more efficient, we can then ask more questions about our data, get more value out of it, and ultimately push the boundaries of what is possible.


Any data scientist who works in Python most likely works with NumPy, a versatile math library underpinning most data science workflows in Python. However, NumPy is only capable of utilizing your CPU, even if a GPU is available. This is where CuPy comes in. CuPy is a library that strives to be a drop-in replacement for NumPy, transparently replacing most of your computations with GPU-accelerated versions while still utilizing much of the same syntax. In many cases you can simply replace NumPy function calls with identically-named CuPy calls and it just works. This is extremely powerful on its own, even if that was all that CuPy could do. The chart below from the CuPy website shows significant speedups (100x or more) for a variety of common operations compared with CPU-based methods.

Chart showing significant speedups (100x or more) for a variety of common operations compared with CPU-based methods.Source:

However, this report is about a lesser-known feature of CuPy: custom CUDA kernels. This feature is useful when you have a calculation that is highly parallelizable (i.e., GPU-friendly) but is difficult to implement efficiently with the standard NumPy or CuPy methods. With custom CUDA kernels, we can write short snippets of C++ code as raw Python strings, and CuPy takes care of compiling and pushing the code to the GPU. With this, users can get the power and efficiency of CUDA C++ without the friction.


We’ve always been interested in fractals: infinitely complex geometric patterns that are frequently the result of surprisingly simple mathematical equations. One particular class of fractals called Julia Sets are created from a huge quantity of simple computations, making it an ideal target for GPU acceleration. We decided to write a Julia Set generator in CuPy, which achieves more than 1,000x speedup over CPU versions and uses the speed to search for visually appealing fractals.

You don’t need to understand the minutiae of fractal generation to appreciate the fascinating visuals. Click here to view an animation of the closely-related Mandelbrot Set from Wikipedia, demonstrating that no matter how far you zoom in on the fractal, you find more, infinitely-recursing fractal structures.

The most important thing to know is that Julia Set fractals are generated using “complex” numbers which have both a “real” and “imaginary” part. Typically, this would be drawn on a two-dimensional grid called the “complex plane,” with the real part along the x-axis and the imaginary part along the y-axis. An example is ‘z = 2 + 7i’, where 2 is the real part and 7i is the imaginary part. This value would be represented at (2, 7) on the complex plane. This is why we frequently reference a “value” but draw it on a 2D grid.

Julia Set generation on CPU and GPU

To generate an image of a Julia Set we need to do the following:

  • Pick a complex c-value for the equation z ← z2 + c. Each c-value that is chosen will result in a different fractal.
    • For example, a popular c-value that produces a visually pleasing fractal is ‘c = -0.8 + 0.16i’
  • Make a grid of points (z-values) on the complex plane matching your desired image size and run each one repeatedly through the equation z ← z2 + c
  • For each point, record how many iterations until its length exceeds 2.0 (after which it gets exponentially larger). This is called its “escape” time.
  • Color each pixel of your image by the “escape” time of its associated z-value

Here are some sample fractals with different c-values, generated identically by both CPU and GPU code in the notebook:

Four sample fractals with different c-values:  c-value = -0.80 + 0.16i c-value = -0.20 + 0.40i c-value = -0.40 + 0.60i c-value = 0.40 + 0.10i

You can see that there is a huge diversity of Julia Sets driven by the choice of c-value; some are interesting and visually pleasing, others less so. Below, we have copied the code for both the CPU and GPU implementations to be compared side-by-side, along with the timings for each. The code itself is nearly identical, but the CuPy implementation is orders of magnitude faster.Chart comparing native Python implementation with CuPy implementation. Measurement in fractals per second.

There is a range of speeds for the GPU version depending on whether each computed fractal is transferred off the GPU when it is done or kept on the GPU for further processing. For our interactive fractal viewer, we will need to transfer each one back to our CPU for immediate display, so we’ll get the lower speed. But for our automated fractal search, the fractal will stay on the GPU for further processing, so we’ll get the higher speed. If we batched the results or only needed some kind of summary statistics, we would likely get a speed somewhere between those values.

We also created and measured a pure C++ version of the fractal generator. Python is not known for speed in this kind of computation, and if this was a real speed-sensitive application we would probably resort to C++ code or another high-performance language. The C++ speedup is profound but is still dramatically slower than the GPU version.

Method Speed (fractals/sec)

Python/NumPy (CPU)

0.13 FPS
C++ (CPU) 10.11 FPS
CuPy (GPU), w/ transfer to host

2,142.71 FPS

CuPy (GPU), no transfers

24,020.43 FPS

At a minimum, CuPy is giving us a 200x speedup, but it could be much higher depending on which language is running the CPU version and what you need transferred back to the host. We’ll see later how we can implement a GPU pipeline that leverages the results directly on the GPU without any host transfers to get the higher speed.

Finding visually appealing fractals

With such impressive speeds, we can do some efficient fractal searching to find visually appealing fractals. Here we show that each fractal was the result of picking a different c-value, which can be represented as points on the complex plane. To find more fractals, we just need to keep trying new c-values.

Image showing four fractals with arrows to different points of the chart of sample c-values

We used the “mpl-interactions” library with JupyterLab to link the position of the mouse to different c-values. Every time the mouse moves on the image, it sends a new c-value to the GPU, which generates the new fractal and sends it back to JupyterLab for display.

We need 30-60 frames per second for a smooth visual experience, which is easy for our GPU version that runs above 2,000 frames per second. There is overhead due to the fact that we use Jupyter remotely, which adds latency for c-values to be sent over the network, and fractals to be sent back, but it’s still a smooth experience - click here to view.

Automatic fractal searching using CuPy FFTs

What if we want a fractal that is objectively interesting? We have enough speed to automate a full search of c-values, but it is not entirely clear how to evaluate whether a fractal is “interesting.”

We decided to try spectral analysis using Fast Fourier Transforms (FFTs) to score each fractal based on its frequency distribution. Fractals with more high-frequency components have finer structures and are probably more visually pleasing. Even better, CuPy has an efficient FFT implementation which expects the input image to already be on the GPU. This is a case where we can get the faster generation speed (from the previous timing table) because each generated fractal is simply passed to the next GPU computation without any copies or transfers back to the host/CPU. For each fractal, we only need to send the real and imaginary parts of the c-value, and it only returns a single score. There are a few oversimplifications in that statement, but we are still talking about less than 100 bytes total being transferred, even though each fractal is more than 2 megabytes. It doesn’t get much more efficient than that!

Diagram with two boxes – 1) Host (CPU) containing the elements of “c-value (a + bi)” and “computed score) and 2) Device (GPU) containing the elements of “generate fractal,” “compute FFT,” and “calculate frequency score.” Lines are showing a circular relationship.Simplified diagram of our fractal search pipeline on GPU

We tested this process with two fractals that we know should be at opposite ends of the “interestingness” spectrum. The fractal images are on the top, and the spectrogram (produced using CuPy’s FFT) is on the bottom. Brightness in the spectrogram near the middle indicates low-frequency content, and brightness along the edges is high-frequency content:

Four boxes showing the two fractals generated and their corresponding spectrogram

It doesn’t exactly pop out at you, but there is a noticeable difference between two spectrograms. The “bad” fractal has a bit more density in the middle than the good one. The scores are from a simple equation that penalizes low-frequencies and rewards high-frequencies.

We applied this scoring process to a 256x256 grid of c-values on the complex plane - 65,536 total fractals scored. It took 27 seconds, which is approximately 2,400 fractal evaluations per second, each of which involves fractal generation, and 768x768 FFT, and application of the scoring algorithm. Here is the result:

Chart with a green fractal on a dark blue background

It turns out that the “good fractal map” is itself another fractal! And not just any fractal, but the closely-related Mandelbrot Set. Click here to view a high-resolution version of the Mandelbrot Set from Wikipedia.

We weren’t sure what to expect when we did this experiment but were still surprised by the result. In hindsight, there is a somewhat straightforward explanation for this if you are familiar with the underlying math, but we’ll leave that as an exercise for the ambitious reader.

The Mandelbrot Set showing up was a pleasant distraction, but we should remember we actually ran this exercise for the purpose of finding interesting Julia Sets. Below is the Julia Set generated by the c-value with the maximum intensity in our brightness plot:

Chart with yellow, spindly fractal on blue background of concentric circles

This exercise was a fun way to learn and demonstrate the benefits and simplicity of using CuPy. Even the most straightforward uses of CuPy replacing NumPy calls can give data scientists a magnitude of speedup for common operations used in workflows every day. We demonstrated that even more advanced features of CuPy can be both approachable and powerful, which could lead to significant improvements in data processing for our federal clients.

Shauna Revay, Ph.D.

Senior Manager – Accenture Federal Services, Machine Learning

Subscription Center
Subscribe to Accenture's Federal Viewpoints Blog Subscribe to Accenture's Federal Viewpoints Blog