Massively Parallel Computing in Cosmology with JAX

Wassim Kabalan

François Lanusse, Alexandre Boucaud, Josquin Errard

Goals for This Presentation

Understand the Basics of Parallelism: Learn how parallelism works and its importance for high-performance computing.
Know When (and When Not) to Parallelize: Discover when it is beneficial to parallelize your code and when it’s better to avoid it.
When to Use (and Avoid) Parallelism: Discover the benefits and limitations.
Scale Code Using JAX: Explore techniques to scale your computations using JAX for large-scale tasks.
Hands-On Tutorials: Apply the concepts discussed with interactive code examples and tutorials.

Understand the Basics of Parallelism:
- “First, we’ll start with the fundamentals of parallelism — understanding how parallel computing works.”
- “We’ll look at different types of parallelism, such as task parallelism and data parallelism, and see how they are applied in computational problems.”
- “It’s important to get a solid understanding of these basic concepts before we dive into how to scale them effectively in cosmology.”
Know When (and When Not) to Parallelize:
- “Next, we’ll cover a critical aspect: knowing when to use parallelism and, just as importantly, when not to use it.”
- “While parallelism can offer huge speedups, not all problems are suitable for parallelization. In fact, sometimes parallelism can make things slower due to overhead. I’ll show you how to identify the right cases for parallelism and how to avoid it when it’s not the best approach.”
Scale Code Using JAX:
- “Then, we’ll explore how to scale your code using JAX.”
- “JAX makes it easy to scale computations by automatically taking advantage of GPUs. We’ll also look at how your code might change depending on the parallelism strategy you choose. JAX allows for flexible parallelization strategies, so you can tailor it to your specific needs.”
Hands-On Tutorials:
- “Finally, we’ll wrap up with hands-on tutorials. We’ll work through interactive code examples so you can see firsthand how these concepts are implemented in practice.”
- “These examples will give you the opportunity to apply what we’ve discussed and see the power of parallel computing in action.”
Transition:
- “Now that we know what we aim to cover, let’s dive into the basics of parallelism and lay the groundwork for the rest of the talk.”

Background on Parallel Computing with GPUs

How GPUs Work

Massive Thread Count

GPUs are designed with thousands of threads.
Each core can handle many data elements simultaneously.

The main bottleneck is memory throughput

Computation is often only a fraction of total processing time.

Optimizing Throughput with Multiple GPUs:

Using multiple GPUs increases overall data throughput, enhancing performance and reducing idle time.

Introduction to GPUs:
- “GPUs are designed for high-throughput, parallel computation. Unlike CPUs, which are optimized for single-threaded tasks, GPUs have thousands of threads, allowing them to process large datasets in parallel. This makes them ideal for tasks like simulations and large-scale data processing.”
Massive Thread Count:
- “Each core in a GPU can do certain steps of the compuation.”

NEXT

Memory Throughput Bottleneck:
- “However, even with so many threads, the true bottleneck in GPU performance is often memory throughput. If the GPU doesn’t receive data quickly enough, even the large number of cores can’t operate at full capacity. In other words, GPUs can process a lot of data in parallel, but only if they’re fed enough data to keep all those cores busy.”
- “As you can see in the illustration, when the GPU becomes saturated, it can no longer process new data until the current batch is fully processed. This results in idle time, where the GPU isn’t doing any useful computation, even though it’s fully capable of processing more data.”

NEXT

Optimizing Throughput with Multiple GPUs:
- “This leads us to the concept of optimizing throughput with multiple GPUs.”
- “In many workloads, especially in cosmology and other large-scale computations, the computation is only a small portion of the total processing time. A significant amount of time is spent waiting for data, transferring it, or processing smaller chunks of data that can’t be fully parallelized.”
- “By adding more GPUs, we can increase the total data throughput, reduce idle time, and ultimately improve overall performance. With multiple GPUs working in parallel, the data is distributed more effectively, reducing bottlenecks and allowing us to handle larger and more complex datasets.”

Types of Data parallelism

Data Parallelism

Simple Parallelism: Each device processes a different subset of data independently.

Data Parallelism with Collective Communication:
- Devices process data in parallel but periodically share results (e.g., for gradient averaging in training).

Task Parallelism

Each device handles a different part of the computation.
The computation itself is divided between devices.
Is generally more complex than data parallelism.

In the computer science world, there are many types of parallelism, they untimately fall into two categories: data parallelism and task parallelism.

Data Parallelism (Simple Parallelism):
- “In data parallelism, each device processes a different subset of the data independently. This approach is ideal when:
  - The data can be evenly split across devices.
  - Each subset can be processed without interaction with other devices.”
- “In cosmology, simple data parallelism could be used, for example, in parameter estimation tasks where each GPU computes the likelihood for different portions of the dataset independently.”

NEXT

Data Parallelism with Collective Communication:
- “In data parallelism with collective communication, devices process data in parallel but need to share intermediate results periodically.”
- “For example, in distributed training of machine learning models, GPUs periodically exchange gradient information to keep models in sync.”
- “This approach is more complex because it requires synchronization between devices, adding communication overhead.”
- “Challenges arise in the form of bottlenecks during the communication phase, which can limit scalability and affect overall efficiency.”

NEXT

Task Parallelism:
- “In task parallelism, each device handles a unique part of the computation.”
- “This is useful when the computation can be split into discrete tasks that contribute independently to the final result.
- “Unlike data parallelism, task parallelism typically requires more complex coordination and may involve custom communication protocols between tasks.”
- This is used extensively in large language models

When Should You Use Parallelism?

Simple cases

Data Parallelism (Simple) ✅
- If your pipeline resembles simple data parallelism, then parallelism is a good idea.
Data Parallelism with Simple Collectives ✅
- Simple collectives (e.g., gradient averaging) can be easily expressed in JAX, allowing devices to share intermediate results.

Complex cases

Non-splittable Input (e.g., N-body Simulation Fields) ⚠️
- When the input is not easily batchable, like a field in an N-body simulation.

Task Parallelism ⚠️
- Useful for long sequential cosmological pipelines where each device handles a unique task in the sequence.
- More common in training complex models (e.g., LLMs like Gemini or ChatGPT).

Data Parallelism (Simple):
- “If your pipeline follows a simple data parallelism structure, where the dataset can be split across devices and each device processes a portion independently, parallelism is a good idea.”
- This is usually a no-brainer, as it’s a straightforward way to speed up computations with minimal overhead.

NEXT

Data Parallelism with Simple Collectives: ✅
- “When you need to share intermediate results (like gradient averaging) between devices, parallelism is still a great fit.”
- I most cases, this is a good idea as it allows devices to work independently but still communicate when needed.

NEXT

Non-splittable Input (e.g., N-body Simulation Fields): ⚠️
- “When the input data can’t be easily divided into independent chunks, parallelism can become challenging. For instance, in an N-body simulation, where the simulation field might be a continuous dataset that doesn’t naturally split into batches, you’ll need a more advanced approach.”
- “In such cases, we have to use data parallelism with more complex collectives, which can introduce overhead and reduce efficiency.”

NEXT

Task Parallelism: ⚠️
- “Task parallelism is often used when a computational pipeline is long and sequential, where each task performed by a device is distinct.”
- “This is common in large language model (LLM) training, like models such as Gemini or ChatGPT, and models that have billions of parameters.”
- “Task parallelism often requires significant restructuring of your pipeline and may involve custom communication protocols between tasks. It’s more complex than data parallelism, and it’s not always the best option unless your task demands it.”

When NOT to Use Parallelism

To Keep in Mind

Data Fits on a Single GPU

Need for Complex Collectives
- Additional GPUs can add complexity and may not yield enough performance improvement.

Task Parallel Model
- Changing the pipeline or adapting to new devices often requires significant rewrites.

Consider Scaling to multiple GPUs if:

You have a single-GPU prototype that’s working but needs significant runtime reduction.
Has a significant impact on your results.
- Using multiple GPUs can significantly decrease execution time.
- OR You have non-splittable input (e.g., fields in a cosmological simulation) that is crucial for your results.

Data Fits on a Single GPU: ❌
- “If your data fits comfortably on a single GPU and the computation can be done efficiently without needing to distribute the workload, then adding more GPUs won’t offer much benefit.”
- “In such cases, you’re better off sticking with a single GPU, as parallelism would just introduce unnecessary complexity.”

NEXT

Need for Complex Collectives: ❌
- “When the task requires complex collective operations, adding more GPUs can actually increase the overhead and communication complexity.”
- “For example, if your workload demands frequent data synchronization between GPUs (such as gradient averaging), the added communication costs may outweigh the benefits of parallelism.”

NEXT

Task Parallel Model: ❌
- “When your computational pipeline is structured for task parallelism, it may require substantial changes to accommodate multiple GPUs.”
- “Adapting the pipeline to new devices can be a major task, often requiring significant rewrites of the codebase to handle the distributed nature of parallelism.”

NEXT

Consider Scaling to Multiple GPUs If:
- “You have a single-GPU prototype that’s working, but it requires a significant runtime reduction.”
- “If reducing execution time has a huge impact on your results, scaling to multiple GPUs can provide the needed speedup.”
- “Or, if you have non-splittable input (like fields in a cosmological simulation) that are critical to your results, adding GPUs could be essential for improving performance while handling larger, more complex datasets.”

How to Measure Scaling for Parallel Codes

Strong Scaling

Increasing the number of GPUs to reduce runtime for a fixed data size.

Assesses performance as more GPUs are added to a fixed dataset. Danger Zone⚠️: Indicates the distributed code is not scaling efficiently.

Weak Scaling

Increasing data size with a fixed number of GPUs.

Tests how the code handles increasing data sizes with a fixed number of GPUs. Danger Zone⚠️: Suggests underlying scaling issues with the code itself.

Strong Scaling:
- “Strong scaling tests how your code performs when you add more GPUs to a fixed dataset, aiming to reduce runtime. If adding more GPUs doesn’t significantly decrease the runtime, it’s a sign that your code is not scaling efficiently.”
- “This is a critical metric for evaluating how well your code can handle the increasing parallelism and whether it benefits from more GPUs.”
- “Danger Zone ⚠️: If you see little to no reduction in runtime as more GPUs are added, this indicates that your parallel code is facing scalability issues.”

NEXT

Weak Scaling:
- “Weak scaling, on the other hand, tests how your code handles an increasing dataset with a fixed number of GPUs.”
- “This is useful for understanding how the code can handle larger data sizes and whether it can maintain performance as the workload increases.”
- “Danger Zone ⚠️: If the performance doesn’t improve with an increase in data size, this suggests there are underlying scaling issues in the code itself.”

Environmental Impact of High-Performance Computing

Perlmutter Supercomputer (NERSC)

Location: NERSC, Berkeley Lab, California, USA
Compute Power: ~170 PFlops
GPUs: 7,208 NVIDIA A100 GPUs
Power Draw: ~ 3-4 MW

Jean Zay Supercomputer (IDRIS)

Location: IDRIS, France
Compute Power: ~126 PFlops (FP64), 2.88 EFlops (BF/FP16)
GPUs: 3,704 GPUs, including V100, A100, and H100
Power Draw: ~1.4 MW on average (as of September, without full H100 usage), leveraging France’s renewable energy grid.

Environmental Benefits of Efficient Parallel Computing

Higher throughput moves computations closer to peak FLOPS.
Operating near peak FLOPS ensures more effective use of computational resources.
More computations are achieved per unit of energy, improving energy efficiency.

Perlmutter Supercomputer (NERSC):
- “Perlmutter, located at NERSC in California, USA, is one of the most powerful supercomputers in the world, capable of 170 petaflops of compute power.” NEXT
Jean Zay Supercomputer (IDRIS):
- “The Jean Zay Supercomputer, located at IDRIS in France, is another impressive example, with a compute power of around 126 petaflops
What are FLOPs?:
- “FLOPs stand for Floating Point Operations per Second, which is a measure of computational performance. Essentially, it refers to how many floating-point calculations a system can perform in one second.”

How does this impact the environment?

NEXT

Environmental Benefits of Efficient Parallel Computing:
- “Efficient parallel computing, when operated near peak FLOPS, maximizes the number of computations per unit of energy. This means that when we optimize the throughput of these supercomputers, we’re not just increasing speed, but also improving energy efficiency.”
- “By ensuring that we use computational resources more effectively, we can reduce the environmental footprint of these massive computational infrastructures.”

How to Scale in JAX

Why JAX for Distributed Computing?

Distributed Computing Isn’t New:
- Tools like MPI and OpenMP are used extensively.
- ML frameworks like TensorFlow and PyTorch offer distributed training.
- DiffEqFlux.jl Horovod and Ray

Familiar and Accessible API:
- JAX offers a NumPy-like API that is both accessible and intuitive.
- Python users can leverage parallelism without needing in-depth knowledge of low-level parallel frameworks like MPI.

Key Points

Pythonic Scalability: JAX allows you to write scalable, pythonic code that is compiled by XLA for performance.
Automatic Differentiation: JAX offers a trivial way to write diffrentiable distributed code.
Same code runs on anything from a laptop to multi node supercomputer.

Why am i laser focused on JAX

Traditional Distributed Computing: MPI and OpenMP have been essential tools for achieving high-performance distributed computing in fields like cosmology. These frameworks provide fine control but often require specific parallel programming expertise.
Accessibility and Familiarity: JAX’s familiar, high-level syntax lowers the barrier to entry, bringing distributed computing within reach of Python users without the need to manage intricate MPI or OpenMP settings.
JAX’s Unique Advantages: Through XLA compilation, JAX not only scales code efficiently but also integrates differentiability, crucial for machine learning and simulations that require backpropagation. This blend of performance and flexibility sets JAX apart for scientific and AI applications.

Talking Points on Alternative Framework Limitations

Open MPI: Primarily optimized for CPU-based parallelism, making it less effective on GPUs where scientific workloads in JAX often run. This can limit its efficiency for cosmology applications that leverage GPU acceleration.
ML Frameworks (PyTorch, TensorFlow): While powerful for machine learning, these frameworks are ML-centric and don’t natively support the arbitrary scientific functions often needed in cosmology. Customizing these frameworks for scientific use cases requires significant additional effort.
DiffEqFlux.jl (Julia): While Julia’s ecosystem is growing, it’s still limited compared to Python, particularly for scientific computing and distributed applications. This smaller ecosystem can make it harder to find compatible tools and libraries for complex cosmological simulations.
Horovod and Ray: Neither is natively differentiable, which means they rely on external frameworks (e.g., PyTorch or TensorFlow) for differentiation. This lack of built-in differentiability adds overhead and complexity for workflows that require gradient-based optimization, a key feature that JAX integrates seamlessly.

For questions

PyTorch Distributed

Complex Setup: Requires more effort to configure distributed training, especially outside deep learning.
Performance Overhead: Lacks JAX’s XLA compilation, which can lead to inefficiencies in scientific applications.
Limited Scientific Libraries: PyTorch’s ecosystem is growing, but it still lacks the depth of JAX for scientific and physics-based computing.

TensorFlow Distributed

Complex and Verbose: Distributed setup with tf.distribute.Strategy is often more cumbersome and requires multiple API layers.
Less Flexibility with Gradients: Limited flexibility in complex gradient computations compared to JAX’s functional approach.
Under-Optimized for Scientific Workflows: XLA support is not as performant in scientific HPC compared to JAX.

Ray with Auto-Differentiation

Not Natively Differentiable: Ray relies on external libraries (like PyTorch and TensorFlow) for differentiation, adding communication and synchronization overhead.
Focus on General Purpose Computing: Lacks specific optimizations for HPC environments and scientific computing.
Limited Low-Level Hardware Control: Ray abstracts device management, reducing optimization potential in specialized HPC setups.

DiffEqFlux.jl (Julia)

Limited Ecosystem: Julia’s ecosystem is smaller and less mature, especially for scientific computing.
Developing Distributed Support: Distributed computing in Julia is still evolving and less robust than in Python.
Learning Curve: Julia has a steeper learning curve for Python-based teams, and integration with Python infrastructure can be difficult.

Mesh TensorFlow

Specialized for Transformers: Primarily designed for partitioning large transformer models, limiting flexibility in other scientific applications.
Complex Configuration: Mesh configuration is often challenging and may be a barrier for scientific users.
Tied to TensorFlow: Mesh TensorFlow’s dependency on TensorFlow makes it less intuitive for scientific computing compared to JAX’s NumPy-like API.

Horovod (Multi-Framework)

Optimized for Data Parallelism in ML: Primarily suited for data parallelism in ML, not as adaptable for complex scientific workflows.
External Library Dependence: Requires frameworks like PyTorch or TensorFlow for auto-differentiation, adding performance overhead.
Limited Scientific Integration: Does not integrate well with libraries focused on physical simulations, whereas JAX has a growing ecosystem for such applications.

Key Advantages of JAX

Unified, Pythonic API: JAX’s intuitive API combines ease of use with the power of distributed computing.
XLA Compilation for Efficiency: Optimized for performance across devices, making it highly suitable for HPC environments.
Native Differentiability: Differentiability is built-in and seamlessly integrated with distributed workflows, providing a smooth experience for scientific applications.

Expressing Parallelism in JAX (Simple parallelism)

Example of computing a gaussian from data Points

import jax
import jax.numpy as jnp
from jax.debug import visualize_array_sharding

def gaussian(x, mean, variance):
  coefficient = 1.0 / jnp.sqrt(2 * jnp.pi * variance)
  exponent = -((x - mean) ** 2) / (2 * variance)
  return coefficient * jnp.exp(exponent)

mean = 0.0
variance = 1.0
x = jnp.linspace(-5, 5, 128)
result = gaussian(x, mean, variance)
visualize_array_sharding(x)
visualize_array_sharding(result)

  GPU 0

  GPU 0

Expressing Parallelism in JAX (Simple parallelism)

Example of computing a gaussian from data Points

assert jax.device_count() == 8

from jax.sharding import PartitionSpec as P, NamedSharding

def gaussian(x, mean, variance):
  coefficient = 1.0 / jnp.sqrt(2 * jnp.pi * variance)
  exponent = -((x - mean) ** 2) / (2 * variance)
  return coefficient * jnp.exp(exponent)

mesh = jax.make_mesh((8,), ('x'))
sharding = NamedSharding(mesh , P('x'))

mean = 0.0
variance = 1.0
x = jnp.linspace(-5, 5, 128)
x = jax.device_put(x, sharding)
result = gaussian(x, mean, variance)
visualize_array_sharding(x)
visualize_array_sharding(result)

  GPU 0    GPU 1    GPU 2    GPU 3    GPU 4    GPU 5    GPU 6    GPU 7

  GPU 0    GPU 1    GPU 2    GPU 3    GPU 4    GPU 5    GPU 6    GPU 7

Expressing Parallelism in JAX (Using collectives)

Example of SGD with Gradient averaging (from Jean-Eric’s tutorial)



@jax.jit  
def gradient_descent_step(p, xi, yi, lr=0.1):
  gradients = jax.grad(loss_fun)(p, xi, yi)
  return p - lr * gradients

def minimzer(loss_fun, x_data, y_data, par_init, method, verbose=True):
  ...
# Example usage
par_mini_GD = minimzer(
  loss_fun, 
  x_data=xin, 
  y_data=yin, 
  par_init=jnp.array([0., 0.5]), 
  method=partial(gradient_descent_step, lr=0.5), 
  verbose=True
)

Parallel Execution: - The gradient descent step function is executed in parallel across multiple devices. Each device computes the gradients independently based on its subset of the data. This is achieved by using the @partial(shard_map, mesh=mesh, in_specs=P('x'), out_spec=P('x')) decorator, which ensures that the function is applied across devices with proper mapping for parallel execution.

Compute Gradients per Device: - For each device, the gradients with respect to the loss function are computed using jax.grad(loss_fun)(p, xi, yi). This means each device will compute the gradients using only its local data, and the gradients will be different depending on the device’s data partition.

Collective Communication: - Once the gradients are computed on each device, jax.lax.pmean() is used to average the gradients across all devices. This is the collective communication step, ensuring that all devices have synchronized, averaged gradients before any updates are made to the model parameters. This ensures the consistency of gradient updates across devices.

Sharding the Data: - The data (xin, yin) is distributed (or “sharded”) across the available devices using jax.device_put(). Each device gets a portion of the input data. This enables parallel execution by splitting the data into smaller chunks that can be processed in parallel, with each device working on its own data subset.

Expressing Parallelism in JAX (Using collectives)

Example of SGD with Gradient averaging (from Jean-Eric’s tutorial)

from jax.experimental.shard_map import shard_map

@jax.jit 
@partial(shard_map, mesh=mesh , in_specs=P('x'), out_spec=P('x'))
def gradient_descent_step(p, xi, yi, lr=0.1):
      per_device_gradients = jax.grad(loss_fun)(p, xi, yi)
      avg_gradients = jax.lax.pmean(per_device_gradients, axis_name='x')
      return p - lr * avg_gradients

def minimzer(loss_fun, x_data, y_data, par_init, method, verbose=True):
     ...
  # Example usage
xin = jax.device_put(xin, sharding)
yin = jax.device_put(yin, sharding)
par_mini_GD = minimzer(
        loss_fun, 
        x_data=xin, 
        y_data=yin, 
        par_init=jnp.array([0., 0.5]), 
        method=partial(gradient_descent_step, lr=0.5), 
        verbose=True
    )

JAX Collective Operations for Parallel Computing

Overview of JAX Collectives in `jax.lax.p*` Functions

Function	Description
`lax.pmean`	Computes the mean of arrays across devices. Useful for averaging gradients in distributed training.
`lax.ppermute`	Permutes data across devices in a specified order. Very useful in cosmological simulations.
`lax.all_to_all`	Exchanges data between devices in a controlled manner. Useful for custom data exchange patterns in distributed computing.
`lax.pmax` / `lax.pmin`	Computes the element-wise maximum/minimum across devices. Often used in situations where you want to find the maximum or minimum of a distributed dataset.
`lax.psum`	Sums arrays across devices. Commonly used for aggregating gradients or other values in distributed settings.
`lax.pall`	Checks if all values across devices are `True`. Often used for collective boolean checks across distributed data.

JAX offers a range of collective operations through the jax.lax.p* functions, enabling efficient parallelism across multiple devices in distributed computing tasks. Here’s a breakdown of the most commonly used functions:

Gradient Aggregation: In distributed training, pmean is commonly used to average gradients from multiple devices, ensuring each device updates with the same gradient.
Logical Collectives: Operators like pand, por, and pall allow distributed boolean logic operations, which can help in synchronization or conditional checks in parallel code.
Flexible Data Distribution: ppermute allows data rearrangement across devices, making it useful in more complex parallelism setups or for rearranging distributed data for specific computations.
Data Exchange: all_to_all provides a controlled way to exchange data between devices, useful for custom data exchange patterns in distributed computing.

Towards Infinite Scalability with JAX

A Node vs a Supercomputer

Differences in Scale

Single GPU:
- Maximum memory: 80 GB

Single Node (Octocore):
- Maximum memory: 640 GB
- Contains multiple GPUs (e.g., 8 A100 GPUs) connected via high-speed interconnects.

Multi-Node Cluster:
- Infinite Memory 🎉
- Connects multiple nodes, allowing scaling across potentially thousands of GPUs.

Multi-Node scalability with Jean Zay

Up to 30TB of memory using all 48 nodes of Jean Zay
Is enough to run a 15 billion particle simulation.

So far, we’ve focused on scaling within a single, but we have not yet explored how to achieve truly infinite scalability across an unlimited number of nodes First, let’s understand the differences in scale between a single GPU, a single node, and a multi-node cluster.

NEXT

Single GPU: GPUs have powerful cores but are limited by memory. With a max memory of 80 GB, they are ideal for tasks that fit within this memory constraint, often used for model training or inference. NEXT
Single Node (Octocore): An octocore node can host multiple GPUs (e.g., 8 GPUs) and has larger memory (up to 640 GB), enabling it to handle larger datasets. This setup is common in high-performance servers. NEXT
Multi-Node Cluster: By connecting nodes in a distributed cluster, we achieve “infinite” scalability in terms of memory and compute. JAX can take advantage of this via distributed parallelism, making it ideal for cosmological simulations and other large-scale scientific computations.

Scaling JAX on a Single GPU vs. Multi-Host Setup

Single GPU Code

x = jnp.linspace(-5, 5, 128)
mean = 0.0
variance = 1.0
result = gaussian(x, mean, variance)

Multi-GPU Code

mesh = jax.make_mesh((8,), ('x'))
sharding = NamedSharding(mesh , P('x'))
x = jnp.linspace(-5, 5, 128)
x = jax.device_put(x, sharding)
mean = 0.0
variance = 1.0
result = gaussian(x, mean, variance)

Multi-Host Code

Scaling JAX on a Single GPU vs. Multi-Host Setup

A JAX process per GPU

Requesting a slurm job

$ salloc --gres=gpu:8 --ntasks-per-node=1 --nodes=1

multi-host-jax.py

import jax

mesh = jax.make_mesh((4,), ('x'))
sharding = NamedSharding(mesh , P('x'))

def gaussian(x, mean, variance):
    ...
mean = 0.0
variance = 1.0
x = jnp.linspace(-5, 5, 128)
x = jax.device_put(x, sharding)
result = gaussian(x, mean, variance)
visualize_array_sharding(x)
visualize_array_sharding(result)

Running with srun

$ srun python multi-host-jax.py

Scaling JAX on a Single GPU vs. Multi-Host Setup

A JAX process per GPU

Requesting a slurm job

$ salloc --gres=gpu:8 --ntasks-per-node=8 --nodes=2

multi-host-jax.py

import jax
jax.distributed.initialize()
mesh = jax.make_mesh((16,), ('x'))
sharding = NamedSharding(mesh , P('x'))

def gaussian(x, mean, variance):
    ...
mean = 0.0
variance = 1.0
x = jnp.linspace(-5, 5, 128)
x = jax.device_put(x, sharding) ❌ # DOES NOT WORK
result = gaussian(x, mean, variance)
visualize_array_sharding(x)
visualize_array_sharding(result)

Running with srun

$ srun -n 8 python multi-host-jax.py

Scaling JAX on a Single GPU vs. Multi-Host Setup

A JAX process per GPU

Requesting a slurm job

$ salloc --gres=gpu:8 --ntasks-per-node=8 --nodes=2

multi-host-jax.py

import jax
jax.distributed.initialize()
mesh = jax.make_mesh((16,), ('x'))
sharding = NamedSharding(mesh , P('x'))

def gaussian(x, mean, variance):
    ...
mean = 0
variance = 1.0
x = jnp.linspace(-5, 5, 128)
x = jax.device_put(x, sharding) ❌ # DOES NOT WORK
result = gaussian(x, mean, variance)
visualize_array_sharding(x)
visualize_array_sharding(result)

Running with srun

$ srun -n 8 python multi-host-jax.py

CAUTION ⚠️

jax.device_put does not work with multi-host setups.
Allocating a jax numpy array does not have the same behavior as single node setups.

Loading Data in JAX in a Multi-Host Setup

A JAX process per GPU

import jax
jax.distributed.initialize()

assert jax.device_count() == 16

x = jnp.linspace(-5, 5, 128)
visualize_array_sharding(x)

  GPU 0

  GPU 2

  GPU 1

  GPU 3

  GPU 14

  GPU 8

  GPU 7

  GPU 5

  GPU 6

  GPU 4

  GPU 15

  GPU 12

  GPU 13

  GPU 11

Loading Data in JAX in a Multi-Host Setup

A JAX process per GPU

multi-host-jax.py

import jax
jax.distributed.initialize()

mesh = jax.make_mesh((16,) , ('x',))
sharding = NamedSharding(mesh , P('x'))

def distributed_linspace(start, stop, num):
    def local_linspace(indx):
        return np.linspace(start, stop, num)[indx]
    return jax.make_array_from_callback(shape=(num,), sharding=sharding,data_callback=local_linspace)

x = distributed_linspace(-5, 5, 128)
if jax.process_index() == 0:
  visualize_array_sharding(x)

  …    …    …    …    …    …    …    …    …    …   G…   G…   G…   G…   G…   G…  
  0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15

Loading Data in JAX in a Multi-Host Setup

A JAX process per GPU

multi-host-jax.py

import jax
jax.distributed.initialize()

mesh = jax.make_mesh((16,) , ('x',))
sharding = NamedSharding(mesh , P('x'))

def distributed_linspace(start, stop, num):
    def local_linspace(indx):
        return np.linspace(start, stop, num)[indx]
    return jax.make_array_from_callback(shape=(num,), sharding=sharding,data_callback=local_linspace)

x = distributed_linspace(-5, 5, 128)
if jax.process_index() == 0:
  visualize_array_sharding(x)
mean = 0.0
variance = 1.0
result = gaussian(x, mean, variance)
if jax.process_index() == 0:
  visualize_array_sharding(result)

  …    …    …    …    …    …    …    …    …    …   G…   G…   G…   G…   G…   G…  
  0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15

  …    …    …    …    …    …    …    …    …    …   G…   G…   G…   G…   G…   G…  
  0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15

Multi-node packages in JAX for Cosmology

Forward Modeling in Cosmology

Weak Lensing Model

Prediction:
- A simulator generates observations from initial conditions and cosmological parameters.
Inference:
- The simulated results are compared with actual observations.
- Optimal initial conditions and parameters are inferred to closely match the observed data.

Scaling Challenges

Software: Existing tools like JaxPM or PMWD already exist.
Resolution Today: these differentiable simulators currently support up to 130 million particles \(512^3\).
Ideal Resolution: Billion-particle simulations are necessary for high accuracy \(1024^3\) and more.
(See Hugo’s and Justine’s talks for more details)
We need to scale up to multiple GPUs and nodes to reach the required resolution.

So before diving into multi-node tools for cosmology, let’s see how they can benefit forward modeling.
- Forward modeling is a cornerstone of cosmological inference, linking theoretical predictions with observed data.

In forward modeling, the goal is to replace an explicit likelihood function with a simulator. The process involves:

Prediction:
- The simulator generates synthetic observables, such as convergence maps, using initial conditions and cosmological parameters.
- These observables mimic the universe’s large-scale structure under specific physical assumptions.
Inference:
- Simulated results are compared to actual observations (e.g., from telescopes).
- Through iterative refinement, we infer the parameters that best match the observed universe, like dark matter density or Hubble constant.
Resolution Today:
- Simulations operate with 250,000–130 million particles (512^3).
- These scales capture broad features but miss finer details essential for precision cosmology.
Ideal Resolution:
- Billion-particle simulations are critical for matching the accuracy demanded by modern cosmological surveys.
- These simulations uncover small-scale phenomena like non-linear clustering.
Tools:
- Tools like JaxPM and PMWD handle simulations up to 130 million particles on a single GPU.
- Scaling beyond this requires multi-node, distributed approaches.

jaxDecomp : Components for Distributed Particle Mesh Simulations

Key Features

Distributed 3D FFT
- Essential for force calculations in large-scale simulations.

Halo Exchange for Boundary Conditions
- Manages boundary conditions or particles leaving the simulation domain.

Fully Differentiable
- Can be used with differentiable simulations.

Multi-Node Supports
- Works seamlessly across multiple nodes.
Supports Different Sharding strategies
Open-source and available on PyPI

Introduction to jaxDecomp:
- Let’s start with jaxDecomp, an open-source package designed specifically for distributed 3D Fast Fourier Transforms (FFT) using JAX.
Key Features:
- Distributed 3D FFT: This core feature is essential for performing FFTs across distributed systems, making it highly effective for force calculations or solving partial differential equations in large-scale simulations.
- Halo Exchange for Boundary Conditions: A critical feature that manages boundary conditions in distributed simulations, ensuring proper handling of particles or data that leave or enter the simulation domain. This maintains the accuracy and continuity of simulations across multiple nodes.
- Fully Differentiable: The package is fully differentiable, which means it integrates well with JAX’s automatic differentiation capabilities. This makes it ideal for simulations where optimization or gradient-based methods are required.
- Multi-Node Support: jaxDecomp scales across multiple nodes, allowing users to take full advantage of large-scale distributed systems, such as supercomputers or cloud clusters.
- Supports Different Sharding Strategies: jaxDecomp provides flexible data sharding options, which helps in distributing data efficiently across devices, ensuring optimal parallel computation.
- Open-source: jaxDecomp is an open-source library available for anyone to use. You can easily install it via PyPI or access the code on GitHub.

Performance benchmarks of `PFFT3D`

Strong Scaling

Weak scaling

Halo exchange in distributed simulations

JaxPM 2.0 : Distributed Particle Mesh Simulation

Box size: 1G Mpc/h
Resolution: \(1024^3\)
Number of particles: 1 billion
Number of snapshots: 10
Halo size: 128
Number of GPU used : 32
time taken : 45s

Key Features of JaxPM

Multi-Node Performance: Optimized for efficient scaling across nodes.
High Resolution: Capable of handling billions of particles for accurate simulations.
Differentiable: Compatible with JAX’s automatic differentiation (HMC, NUTS compatible).
Open Source:

Conclusion

Conclusion: Enabling Scalable Cosmology with Distributed JAX

Distributed JAX: A Game-Changer for Cosmology

The future is bright for JAX in cosmology 🎉🎉!!
JAX has transformed the landscape for scientific computing, enabling large-scale, distributed workflows in a Pythonic environment.
Recent advancements (JAX 0.4.3x+) make it straightforward to scale computations across multiple GPUs and nodes.
Key Advantages
- Simplicity: JAX makes it easier than ever to write high-performance code, allowing researchers to focus on science rather than infrastructure.
- Differentiability: JAX allows seamless differentiation of code running across hundreds of GPUs, enabling advanced inference techniques.
The Future Ahead
- Scaling Inference Models with Distributed jaxPM: By integrating the new distributed jaxPM into existing cosmological inference models, we can achieve unprecedented levels of detail and complexity.
- Paving the way to fully leverage large-scale survey data for deeper insights into the universe.

Tutorials and Exercises

https://github.com/ASKabalan/Tutorials/blob/main/Cophy2024/Exercises/01_MultiDevice_With_JAX.ipynb

Introduction to JAX’s Impact on Cosmology:
- The future of JAX in cosmology is incredibly promising! 🎉🎉 With its rapid advancements, JAX has revolutionized scientific computing, especially in the field of cosmology. It enables large-scale, distributed workflows, making it easier to scale computations across multiple GPUs and nodes without sacrificing ease of use or flexibility.
Key Advantages:
- Simplicity: One of the standout features of JAX is its simplicity. The ability to write high-performance code in a Pythonic manner allows researchers to focus on the science and not the complexity of infrastructure. The learning curve for using JAX is significantly reduced, empowering scientists to leverage cutting-edge computational techniques without the steep overhead typically associated with parallelism or distributed computing.
- Differentiability: Another key benefit is differentiability. JAX allows for automatic differentiation of complex models that run across hundreds or thousands of GPUs, making it easier to integrate advanced inference techniques. This is especially powerful in cosmology, where we need to optimize models across vast datasets and large-scale simulations.
Looking to the Future:
- Scaling Inference Models with Distributed jaxPM: Looking ahead, JAX’s ability to scale inference models through tools like distributed jaxPM will open new doors. By incorporating this into cosmological inference models, we can simulate more complex phenomena at much higher resolutions, giving us the power to explore more detailed and intricate patterns in the universe’s behavior. The integration of these tools promises to provide new insights into large-scale structures in the universe.
- Leveraging Large-Scale Survey Data: The future of cosmology also depends on fully harnessing the potential of large-scale survey data. With JAX, researchers will be able to process and analyze these massive datasets with unprecedented efficiency, leading to deeper insights and a better understanding of our universe’s fundamental properties.
Summary:
- JAX is changing the landscape of cosmology by providing tools that allow us to easily scale our computational workflows, differentiate complex models, and unlock the potential of distributed computing. With continuous advancements and the growing support of large-scale computing systems, the future of cosmology looks brighter than ever! 🌌

Extra slides

Using `shard_map` for Advanced Parallelism in JAX

Why `shard_map` instead of `pmap`?

Limitations of pmap :
- pmap is effective for simple data parallelism but lacks flexibility in more complex cases.
- Nested Parallelism: pmap does not handle nested parallelism well.
- Data Layout Control: pmap does not offer fine-grained control over data layout.
Advantages of shard_map:
- Greater Flexibility: shard_map allows custom parallelism patterns and fine control over data sharding.
- Nested Parallelism Support: Suitable for complex workloads that require hierarchical parallelism.
- Direct Device Control: Allows fine-grained control over data distribution and parallel operations.

JAX explaining the weakness of pmap

Example: Nested Parallelism with `shard_map`

mesh = jax.make_mesh((2,2), ('x', 'y'))
sharding = NamedSharding(mesh , P('x', 'y'))
data = jnp.arange(16).reshape(4, 4) 
sharded_data = lax.with_sharding_constraint(data, sharding)

@partial(jax.pmap, axis_name='x' , devices=mesh.devices[0])
@partial(jax.pmap, axis_name='y', devices=mesh.devices[1])
def sum_and_avg_nested_pmap(x):
      sum_across_x = lax.psum(x, axis_name='x')
      avg_across_y = lax.pmean(sum_across_x, axis_name='y')  
      return avg_across_y

def sum_and_avg_pmap(x):
    sum_across_x = jax.pmap(lambda a: lax.psum(a, axis_name='x'),
                            axis_name='x',
                            devices=mesh.devices[0])(x.reshape(2, 2, 4))
    avg_across_y = jax.pmap(lambda a: lax.pmean(a, axis_name='y'),
                            axis_name='y',
                            devices=mesh.devices[1])(sum_across_x.reshape(2, 4, 2))
    return avg_across_y.reshape(4, 4)

@partial(shard_map , mesh=mesh , in_specs=(P('x', 'y'),), out_specs=P('x'))
def sum_and_avg_shardmap(x):
      sum_across_x = lax.psum(x, axis_name='x')
      avg_across_y = lax.pmean(sum_across_x, axis_name='y')  
      return avg_across_y

Example: Nested Parallelism with `shard_map`

mesh = jax.make_mesh((2,2), ('x', 'y'))
sharding = NamedSharding(mesh , P('x', 'y'))
data = jnp.arange(16).reshape(4, 4) 
sharded_data = lax.with_sharding_constraint(data, sharding)

@partial(jax.pmap, axis_name='x' , devices=mesh.devices[0])
@partial(jax.pmap, axis_name='y', devices=mesh.devices[1])
def sum_and_avg_nested_pmap(x):
      sum_across_x = lax.psum(x, axis_name='x')
      avg_across_y = lax.pmean(sum_across_x, axis_name='y')  
      return avg_across_y

def sum_and_avg_pmap(x):
    sum_across_x = jax.pmap(lambda a: lax.psum(a, axis_name='x'),
                            axis_name='x',
                            devices=mesh.devices[0])(x.reshape(2, 2, 4))
    avg_across_y = jax.pmap(lambda a: lax.pmean(a, axis_name='y'),
                            axis_name='y',
                            devices=mesh.devices[1])(sum_across_x.reshape(2, 4, 2))
    return avg_across_y.reshape(4, 4)

@partial(shard_map , mesh=mesh , in_specs=(P('x', 'y'),), out_specs=P('x'))
def sum_and_avg_shardmap(x):
      sum_across_x = lax.psum(x, axis_name='x')
      avg_across_y = lax.pmean(sum_across_x, axis_name='y')  
      return avg_across_y

Example: Nested Parallelism with `shard_map`

mesh = jax.make_mesh((2,2), ('x', 'y'))
sharding = NamedSharding(mesh , P('x', 'y'))
data = jnp.arange(16).reshape(4, 4) 
sharded_data = lax.with_sharding_constraint(data, sharding)

@partial(jax.pmap, axis_name='x' , devices=mesh.devices[0])
@partial(jax.pmap, axis_name='y', devices=mesh.devices[1])
def sum_and_avg_nested_pmap(x):
      sum_across_x = lax.psum(x, axis_name='x')
      avg_across_y = lax.pmean(sum_across_x, axis_name='y')  
      return avg_across_y

def sum_and_avg_pmap(x):
    sum_across_x = jax.pmap(lambda a: lax.psum(a, axis_name='x'),
                            axis_name='x',
                            devices=mesh.devices[0])(x.reshape(2, 2, 4))
    avg_across_y = jax.pmap(lambda a: lax.pmean(a, axis_name='y'),
                            axis_name='y',
                            devices=mesh.devices[1])(sum_across_x.reshape(2, 4, 2))
    return avg_across_y.reshape(4, 4)

@partial(shard_map , mesh=mesh , in_specs=(P('x', 'y'),), out_specs=P('x'))
def sum_and_avg_shardmap(x):
      sum_across_x = lax.psum(x, axis_name='x')
      avg_across_y = lax.pmean(sum_across_x, axis_name='y')  
      return avg_across_y

Example: Nested Parallelism with `shard_map`

mesh = jax.make_mesh((2,2), ('x', 'y'))
sharding = NamedSharding(mesh , P('x', 'y'))
data = jnp.arange(16).reshape(4, 4) 
sharded_data = lax.with_sharding_constraint(data, sharding)

@partial(jax.pmap, axis_name='x' , devices=mesh.devices[0])
@partial(jax.pmap, axis_name='y', devices=mesh.devices[1])
def sum_and_avg_nested_pmap(x):
      sum_across_x = lax.psum(x, axis_name='x')
      avg_across_y = lax.pmean(sum_across_x, axis_name='y')  
      return avg_across_y

def sum_and_avg_pmap(x):
    sum_across_x = jax.pmap(lambda a: lax.psum(a, axis_name='x'),
                            axis_name='x',
                            devices=mesh.devices[0])(x.reshape(2, 2, 4))
    avg_across_y = jax.pmap(lambda a: lax.pmean(a, axis_name='y'),
                            axis_name='y',
                            devices=mesh.devices[1])(sum_across_x.reshape(2, 4, 2))
    return avg_across_y.reshape(4, 4)

@partial(shard_map , mesh=mesh , in_specs=(P('x', 'y'),), out_specs=P('x'))
def sum_and_avg_shardmap(x):
      sum_across_x = lax.psum(x, axis_name='x')
      avg_across_y = lax.pmean(sum_across_x, axis_name='y')  
      return avg_across_y

Motivation: Cosmology in the Exascale Era

Upcoming Surveys and Massive Data in Cosmology

Massive Data Volume: LSST will generate 20 TB of raw data per night over 10 years, totaling 60 PB.
Catalog Size: The processed LSST catalog database will reach 15 PB.

Cosmological Models and Pipelines

Cosmological simulations and forward modeling can easily reach multiple terabytes in size.
We need to scale up cosmological pipelines to handle these data volumes effectively.

Massively Parallel Computing in Cosmology with JAX

Goals for This Presentation

Background on Parallel Computing with GPUs

How GPUs Work

Massive Thread Count

The main bottleneck is memory throughput

Optimizing Throughput with Multiple GPUs:

Types of Data parallelism

Data Parallelism

Task Parallelism

When Should You Use Parallelism?

Simple cases

Complex cases

When NOT to Use Parallelism

To Keep in Mind

How to Measure Scaling for Parallel Codes

Strong Scaling

Weak Scaling

Environmental Impact of High-Performance Computing

Perlmutter Supercomputer (NERSC)

Jean Zay Supercomputer (IDRIS)

How to Scale in JAX

Why JAX for Distributed Computing?

Expressing Parallelism in JAX (Simple parallelism)

Example of computing a gaussian from data Points

Expressing Parallelism in JAX (Simple parallelism)

Example of computing a gaussian from data Points

Expressing Parallelism in JAX (Using collectives)

Example of SGD with Gradient averaging (from Jean-Eric’s tutorial)

Expressing Parallelism in JAX (Using collectives)

Example of SGD with Gradient averaging (from Jean-Eric’s tutorial)

JAX Collective Operations for Parallel Computing

Overview of JAX Collectives in jax.lax.p* Functions

Towards Infinite Scalability with JAX

A Node vs a Supercomputer

Differences in Scale

Scaling JAX on a Single GPU vs. Multi-Host Setup

Single GPU Code

Multi-GPU Code

Multi-Host Code

Scaling JAX on a Single GPU vs. Multi-Host Setup

A JAX process per GPU

Requesting a slurm job

Running with srun

Scaling JAX on a Single GPU vs. Multi-Host Setup

A JAX process per GPU

Requesting a slurm job

Running with srun

Scaling JAX on a Single GPU vs. Multi-Host Setup

A JAX process per GPU

Requesting a slurm job

Running with srun

Loading Data in JAX in a Multi-Host Setup

A JAX process per GPU

Loading Data in JAX in a Multi-Host Setup

A JAX process per GPU

Loading Data in JAX in a Multi-Host Setup

A JAX process per GPU

Multi-node packages in JAX for Cosmology

Forward Modeling in Cosmology

Weak Lensing Model

jaxDecomp : Components for Distributed Particle Mesh Simulations

Key Features

Performance benchmarks of PFFT3D

Strong Scaling

Weak scaling

Halo exchange in distributed simulations

JaxPM 2.0 : Distributed Particle Mesh Simulation

Conclusion

Conclusion: Enabling Scalable Cosmology with Distributed JAX

Extra slides

Using shard_map for Advanced Parallelism in JAX

Why shard_map instead of pmap?

Example: Nested Parallelism with shard_map

Example: Nested Parallelism with shard_map

Example: Nested Parallelism with shard_map

Example: Nested Parallelism with shard_map

Motivation: Cosmology in the Exascale Era

Overview of JAX Collectives in `jax.lax.p*` Functions

Performance benchmarks of `PFFT3D`

Using `shard_map` for Advanced Parallelism in JAX

Why `shard_map` instead of `pmap`?

Example: Nested Parallelism with `shard_map`

Example: Nested Parallelism with `shard_map`

Example: Nested Parallelism with `shard_map`

Example: Nested Parallelism with `shard_map`