SambaFlow Developer Guide

Copyright © 2020-2023 by SambaNova Systems, Inc. All contents are subject to a licensing agreement with SambaNova Systems, Inc. Any disclosure, reproduction, distribution, reverse engineering, or any other use made without the advance written permission of SambaNova Systems, Inc. is unauthorized and strictly prohibited. All rights of ownership and enforcement are reserved.

Table of Contents

SambaFlow software release notes
Hello SambaFlow! Compile and run a model
Architecture and workflows
Transition to DataScale SN30
Examine logreg model code
Convert a simple model to SambaFlow
Run language example applications
Using LayerNorm instead of BatchNorm
How to use data parallel mode

SambaFlow software release notes

Release 1.23

Release 1.23 includes improved OS support, changes to application locations, and renaming of components. Please review the following updates carefully to ensure compatibility with your environment.

Supported OS versions

Red Hat: Starting with this release, SambaFlow supports Red Hat (8.8).
Ubuntu: The version for Ubuntu 22.04.x remains unchanged.

Package/Application location change

To align with Linux best practices, 3rd party applications have been relocated from their previous locations (/opt/ or /usr/local/) to a more standardized directory (/opt/sambanova/). This change ensures compatibility, avoids conflicts with pre-installed customer packages, and provides controlled versions compatible with the SambaNova software stack.

New location: /opt/sambanova/
Previous locations: /opt/ and /usr/local/

If you rely on custom scripts or configurations pointing to old paths, please update references to the new directory.

Renamed applications

The following applications have been renamed, and their old names are deprecated starting with release 1.23.

Old names (deprecated) New names

Old names (deprecated)	New names
`sambaflow-apps-datascale-image-segmentation`	`sambaflow-apps-datascale-vision-segmentation`
`sambaflow-apps-datascale-image-segmentation-3d`	`sambaflow-apps-datascale-vision-segmentation-3d`
`sambaflow-apps-datascale-image-vit`	`sambaflow-apps-datascale-vision-vit`

sambaflow-apps-datascale-image-segmentation

sambaflow-apps-datascale-vision-segmentation

sambaflow-apps-datascale-image-segmentation-3d

sambaflow-apps-datascale-vision-segmentation-3d

sambaflow-apps-datascale-image-vit

sambaflow-apps-datascale-vision-vit

Deprecated components

The following packages and application names are deprecated.

Packages

The following package is deprecated and has been removed starting with release 1.23:

sambaflow-apps-datascale-image-object-detection

Application names

The following application names are deprecated as part of the renaming process (see above).

sambaflow-apps-datascale-image-segmentation
sambaflow-apps-datascale-image-segmentation-3d
sambaflow-apps-datascale-image-vit

Release 1.22

Release 1.22 includes improved OS support.

Supported OS versions

Ubuntu: Starting with this release, SambaFlow supports Ubuntu 22.04.x.

As part of the upgrade to support Ubuntu 20, we’re changing an environment variable. You don’t have to do anything to have the change take effect.
Red Hat: The version for Red Hat (8.5) remains unchanged.

Release 1.21

Release 1.21 includes internal code changes that support our first release of SambaNova Model Zoo, available in this public GitHub repository.

This first official release of SambaNova Model Zoo is currently in Beta. SambaNova customers can download a container image (Devbox) that includes the SambaFlow compiler, other SambaNova libraries, and all prerequisite software.

Existing SambaNova customers can contact their Customer Support representative to access the Devbox.
If you’re new to SambaNova and interested in trying out Model Zoo, contact us at help@sambanovasystems.com to get started!

See the Model Zoo Release Notes for details.

Release 1.20

Release 1.20 was an internal release. No user-visible changes were made in that release.

Release 1.19

New features

This release has primarily had a focus on performance improvement and some other features that are not yet visible to customers.

Cached compilation mode (experimental)

Our experimental cached compilation mode can speed up compilation times of large models. In this mode, the compiler maintains a cache of previously compiled sections of a model, so that subsequent compilations can use the cached sections instead of recompiling them.

To enable cached compilation mode, set the SN_PEF_CACHE environmental variable to the path of a folder.
The compiler will then populate a cache at that location (and create the folder if it doesn’t exist). The content of the cache is an internal detail subject to change.

Cached compilation mode supports the development flow of a single user who makes frequent changes to a model. You cannot share the cache with other users. For the best use of the cache, make small changes (instead of extensive changes) followed by a compile. By limiting the scope of a change, you increase the likelihood that more sections can be pulled precompiled from the cache because they did not change.

API updates

In argmax(), the default value of keepdim (bool) has changed from True to False. keepdim is used to indicate whether Samba retains the dim in the output tensor. Now the dim is no longer retained by default.
The groupby() operator was added in this release.

Documentation

Added Best practices
Added How to use data parallel mode
Added Compose complex operations with parallel patterns

Release 1.18

In Release 1.18, most of SambaFlow was migrated from /usr/local to /opt/sambanova. Add /opt/sambanova/bin to your PATH. s

New features

Tensor parallel support (Beta). Tensor parallel mode uses multiple RDUs for inference and training. Tensor parallel speeds up the runtime performance and ensure that large models, which might exceed the memory limit of a single socket, still run. See How to use tensor parallel mode (Beta) for details.
Multigraph support. The new multigraph feature supports partitioning a model into individual graphs so you can run each graph separately. See Use multigraph to partition models.
UE Replay (Beta). Some updates to Uncorrectable Error replay (Beta) make the feature easier to use.
Mixed precision (Beta). Mixed precision combines the use of different numerical formats (such as FP32 and BF16) to reduce memory footprint and speed up large neural network workloads. See Mixed precision support for details and examples.

Compiler and performance improvements

New and renamed heuristics. This release includes improvements to heuristics for use with o1.
- SAFE_GEMM, DEFAULT_GEMM (new), AGGRESSIVE_GEMM. Applicable only to patterns that are dominated by a single large matrix multiply (GEMM) operation.
- MHA (renamed in 1.18). For use with a multi-headed attention block. Renamed from GPT3_MHA.
- SDPA (new in 1.18). For use with PyTorch SDPA operations.
The compiler’s new deduplication feature can reduce compile time and improve model performance. The feature is currently limited to a single RDU. The feature is on by default. Contact Customer Support if you see a need to turn it off.
This release includes an improved algorithm for mapping compute graphs onto RDU resources. The enhanced algorithm, which is on by default:
- Accelerates the optimization process, resulting in shorter compile times
- Reduces on-RDU congestion when running models, providing performance improvements.

New operators

This release includes several new PyTorch operators. See Functional Operators

Supported datatypes for each new operator are still being validated and more information will be made available at a later date. If you have specific questions on support datatypes, contact SambaNova Support.

Arithmetic operators
- abs()
- mul()
- relu()
- rsqrt()
- scale()
- sigmoid()
- silu()
Parallel patterns operators
- sn_gather()
- sn_imm()
- sn_iteridx()
- sn_reduce()
- sn_scatter()
- sn_select()
- sn_zipmapreduce()
Tensor operators
- ct_attention()
- sn_identity()
- to()
- type_as()
Other operators
- multi_head_attention()
- layer_norm()

Documentation improvements

Updated SambaFlow API Reference with a new template that supports both dark and light mode.
In the API reference, new APIs now include information about the release (e.g. New in 1.18)
Updates to the _Run pretrained models on RDU tutorial- include code snippets for download and conversion of the dataset in hf-compile-run.adoc#_download_the_dataset and hf-compile-run.adoc#_prepare_the_dataset.

Release 1.17

New compiler features

Released o0 and o1 compiler optimization modes (previously in Beta). See Compiler optimization modes.
(Beta) Added support for operator fusion rule yaml files and heuristics for use in conjuction with the o1 compiler option.
- SambaNova will make a limited set of fusion rule yaml files available that direct the compiler, resulting in a more highly optimized PEF for certain families of models (e.g. LLM). See Operator fusion rule yaml syntax.
- Users can make changes to the yaml file to achieve more efficient compiler behavior.
(Beta) Added support for preset scheduling heuristics to improve fused operators' performance in o1 compiler mode. Users cannot edit the heuristics in this release. See Operator fusion heuristics.

Other new features and improvements

Introduced beta version of the uncorrectable error replay (UE replay) feature, which attempts to automatically recover and continue a training run if the run encounters a UE. See Uncorrectable Error replay (Beta).
For improved performance, changed ENABLE_LINEAR_GRAD_ACCUM_STOC to default to 1 instead of 0. As a result, stochastic rounding is turned on for mixed-precision general matrix multiply (GEMM) by default. If you want to return to the previous default, contact SambaNova Support.
Enhanced PyTorch operator support
- silu: FP32 (experimental support)
- gelu: FP32 (experimental support)
- tanh: FP32 (experimental support)
- For mul, full support for B16 and FP32 had been omitted from the documentation by mistake. It’s now been added.

Performance improvements

Enabled compile-time device-program control scheduling for Bloom 176B and GPT13B LLM models for NLP inference.

Supported versions

PyTorch: 1.10.2+cpu
Changed Python: 3.8 (1.17.3 and later)

Documentation improvements

Some documentation updates that are not release dependent became available in the SambaFlow 1.16 documentation after that version was released. Here is the complete list of release-dependent and release-agnostic documentation.

Several of our tutorials are now available from the new sambanova/tutorials GitHub repo. More to be added in future releases. See Tutorials for an overview of all tutorials.
SambaFlow learning map has an overview of documentation and tutorials for new users.
Model conversion overview is a high-level discussion of model porting tasks. Includes pointers to the porting example that is part of this doc set.
Hyperparameter reference is a short overview of Hyperparameters. We’ll point to that doc page from session:run() in the API reference .
Updates and fixes in Compiler argument reference.

API Reference improvements

Changes and additions to the SambaFlow API reference :

Added documentation for samba.random
Added documentation for samba.from_torch_model
Added documentation for samba.utils.trace_graph
Added documentation for samba.optim
Fixes to some supported data types in Functional Operators
Small fixes for samba.session documentation
Fixed some broken links

Release 1.16 (2023-07-14)

New features and other improvements

Introduced new compiler modes -o0 and -o1 (Beta), which allow users to fine-tune compiler performance.
- See SambaFlow compiler overview for some background information.
- See Compiler argument reference for reference documentation, which includes examples.
Change to compiler --help behavior. The --help command now returns a limited number of fully supported options. A call to compile with --help --debug returns a longer list of options, some of them experimental.

Performance improvements

Various optimizations in this release help improve model performance and reduce compile times especially for NLP models.

Documentation improvements

Updated API Reference includes documentation for supported PyTorch operators

API Reference documentation always opens in a new tab (or window). To return to the main doc set, click the previous tab (or window).
New SambaNova messages and logs doc page explains which messages you can safely ignore, where to find which logging information, and which errors you might be able to resolve yourself.
New SambaFlow compiler overview doc page gives an overview of the compiler stack and discusses some compiler arguments, including the new o0, o1, etc. options.
New Compiler argument reference doc page is a reference to frequently used compiler arguments and includes a discussion of the new arguments.
New Use sntilestat for performance analysis doc page explains how to use the sntilestat tool for performance analysis and includes examples of visualizing sntilestat CSV output in a spreadsheet.

Obsolete components and APIs

The grad_of_outputs parameter in samba.session.run, was deprecated in release 1.15 and has been removed. Use SambaTensor::sn_grad to set an output tensor’s gradients instead.

Release 1.15 (2023-03-30)

Deprecated components and APIs

The grad_of_outputs parameter in samba.session.run is deprecated and will be removed in release 1.16. Use SambaTensor::sn_grad to set an output tensor’s gradients instead.

Release 1.14 (2023-01-10)

Deprecated components and APIs

The following APIs have been renamed. The old names are deprecated.

Renamed samba.from_torch to samba.from_torch_tensor
Renamed samba.from_torch_ to samba.from_torch_model_

Release 1.13 (2022-11-03)

New features and other improvements

New features
- Added option to sntilestat to skip idle tiles.
- Enhanced multi-processing support for SambaNova Runtime APIs.
- Enhanced host profiling information and detailed timeline view in SambaTune.
- Enhanced snprof and added more robust fault reporting in snstat.
Performance improvements
- Faster SambaFlow context creation.
- More efficient CPU usage.
- Better performance for scaleout operations.
Software
- Updated PEF to version 2.5.0.
  
  Recompile all models with this release due to the PEF version change.
- Version 2 of SambaFlow compiler scheduler, specified with option --mac-v2, is now the default. The --mac-v1 is still supported but requires using explicit option.

Deprecated components

venv: The venv shared generic package is deprecated and has been replaced by model-specific venv packages. The generic package will be removed from future releases.
UnoSecInf: The UnoSecInf inference performance test, which is based on section-by-section mapping, is deprecated starting in Release 1.13. Starting in Release 1.14, this performance test will no longer be available.

The uno_full.py model is not deprecated.

Release 1.12.7 (2022-07-30)

New features

Added SambaTune: a tool that supports profiling application performance.
Improved Scale-out performance through parallel reduce.
Enhanced RDU reset support with VM.

Supported components and versions

Operating Systems

Red Hat Enterprise Linux 8.5
Ubuntu Linux 20.04 LTS

Software

Updated PEF to version 2.0.0. Models must be recompiled to be used with this release due to the PEF version change.
Version 2 of SambaFlow compiler scheduler, specified with option --mac-v2, is now the default. The --mac-v1 will continue to be supported but requires using explicit option.

Deprecated components

The global virtual environment under /opt/sambaflow/venv is deprecated and will be removed in version 1.13. It will be replaced by individual virtual environments for each model.

Hello SambaFlow! Compile and run a model

Welcome! In this tutorial, you learn how to compile and run a logreg.py example model. This Hello SambaFlow! example uses the classic machine learning problem of recognizing the hand-written digits in the MNIST dataset.

In this tutorial you:

Ensure that your environment is ready to compile and run models.
Compile the model to run on the RDU architecture. Compilation generates a PEF file.
Do a training run of the model, passing in the generated PEF file.

We discuss the code for this model in Examine logreg model code.

Prepare your environment

To prepare your environment, you ensure that the SambaFlow package is installed.

Check your SambaFlow installation

You must have the sambaflow package installed to run this example and any of the tutorial examples.

To check if the package is installed, run this command:
- For Ubuntu Linux
  $ dpkg -s sambaflow
- For Red Hat Enterprise Linux
  $ rpm -qi sambaflow
Examine the output and verify that the SambaFlow version that you are running matches the documentation you are using.
If you see a message that sambaflow is not installed, contact your system administrator.

Download the model code

Before you start, clone the SambaNova/tutorials GitHub repository, as instructed in the README.

After a SambaFlow upgrade, you might have to do a git pull again if your model no longer works.

Create your own directory

SambaNova recommends that you create your own directory inside your home directory for the tutorial code:

Log in to your SambaNova environment.
Create a directory for the tutorials, and a subdirectory for logreg.
```
$ mkdir $HOME/tutorials
$ mkdir $HOME/tutorials/logreg
```

Compile and run your first model

To compile and run your first model, you check supported options, prepare data, and then run scripts to compile and run logreg.

Look at supported options

Each example and each model has its own set of supported options, so it’s important to list them explicitly.

To see all arguments for the logreg model, change to the directory you created earlier and look at the --help output:

$ cd $HOME/tutorials/logreg
$ python logreg.py --help

The output looks similar to the following, and shows that you can compile and run this model.

usage: logreg.py [-h] {compile,run,test,measure-performance} ...

positional arguments:
  {compile,run,test,measure-performance}
                        different modes of operation

optional arguments:
  -h, --help            show this help message and exit

The test and measure-performance options are primarily used internally or when working with SambaNova Support.

You can drill down and run each command with --help to see options at that level. For example, run the following command to see options for run:

$ python logreg.py run --help

In most cases, using the defaults for the optional arguments is best. In Useful arguments for logreg.py we list a few commonly used arguments.

Prepare data

This tutorial downloads train and test datasets from the internet, so there’s no separate step for preparing data.

If your system does not have access to the internet, download the data to a system that has access and make the files available. See Download model data (Optional).

Compile logreg

When you compile the model, the compiler generates a PEF file that is suitable for running on the RDU architecture. You later pass in that file when you do a training run.

Start in the tutorials/logreg directory that you created in Create your own directory.
```
$ cd $HOME/tutorials/logreg
```
Run the compilation step, passing in the name of the PEF file to be generated. You will later pass in that file when you do a training run.
```
$ python logreg.py compile --pef-name="logreg"
```
The compiler runs the model and displays progress messages and warnings on screen.
- You can safely ignore all info and warning messages.
- If a message says warning samba it might indicate a problem with your code.
- For some background, see SambaNova messages and logs.
When the command returns to the prompt, look for this output, shown toward the end:
- Compilation succeeded for partition_X_X shows you that compilation succeeded.
- Logs are generated in … shows where the log files are located.
Verify that the PEF file was generated:
```
$ ls -lh ./out/logreg/logreg.pef
```
The generated PEF file contains all information that the system needs to do a training run of the model.

Start a logreg training run

When you do a training run, the application uploads the PEF file onto the chip and trains the model with the specified dataset. This example uses the MNIST dataset. The example code downloads the data set automatically.

If your system is disconnected from the Internet you have to manually download the dataset to a system with Internet access and copy the dataset to the system you are running the models on. See Download model data (Optional).

Start a training run of the model with the PEF file that you generated. Use -e to specify the number of epochs (default is 1).

$ python $HOME/sambaflow-apps/starters/logreg/logreg.py run --num-epochs 2 --pef=out/logreg/logreg.pef

Even one epoch would be enough to train this simple model, but we use --num-epochs to see if loss decreases in the second run. The run command:

Downloads the model data.

Returns output that includes the following:

2023-01-25T15:14:06 : [INFO][LIB][1421606]: sn_create_session: PEF File: out/logreg/logreg.pef
Log ID initialized to: [snuser1][python][1421606] at /var/log/sambaflow/runtime/sn.log
Epoch [1/2], Step [10000/60000], Loss: 0.4634
Epoch [1/2], Step [20000/60000], Loss: 0.4085
Epoch [1/2], Step [30000/60000], Loss: 0.3860
Epoch [1/2], Step [40000/60000], Loss: 0.3702
Epoch [1/2], Step [50000/60000], Loss: 0.3633
Epoch [1/2], Step [60000/60000], Loss: 0.3555
Test Accuracy: 91.54  Loss: 0.3012
Epoch [2/2], Step [10000/60000], Loss: 0.2861
Epoch [2/2], Step [20000/60000], Loss: 0.3065
Epoch [2/2], Step [30000/60000], Loss: 0.3080
Epoch [2/2], Step [40000/60000], Loss: 0.3084
Epoch [2/2], Step [50000/60000], Loss: 0.3076
Epoch [2/2], Step [60000/60000], Loss: 0.3061
Test Accuracy: 91.54  Loss: 0.3001

Congratulations! You have run your first model on the SambaNova system! The output shows that the training run is successful and has a very low loss percentage, which decreases over time.

Useful arguments for logreg.py

Each of the example model commands has several arguments. In most cases, the default gives good results.

Arguments for compile

For a list of compile arguments for use with logreg.py, run this command:

$ python $HOME/tutorials/logreg/logreg.py compile --help

The command returns a full list of arguments. Here are some useful arguments:

--pef-name — Name of the output file, which has the information for running the model on RDU.
--n-chips, --num-tiles — Number of chips you want to use (from 1 to 8) and the number of tiles on the chip (1, 2, or 4). Default is 1 chip (4 tiles).
--num-features — Number of input features (for this model the default is 784)
--num-classes — Number of output labels (for this model the default is 10)

Arguments for run

For a list of run arguments for use with logreg.py, run this command:

$ python $HOME/tutorials/logreg/logreg.py run --help

The command returns a full list of arguments. Here are some important arguments:

-p PEF The only required argument. A PEF file that was the output from a compile.
-b BATCH_SIZE, --batch-size BATCH_SIZE — How many samples to put in one batch.
-e, --num-epochs — How many epochs to run with the model.
--num-features, --num-classes — Input features and output classes for the model.
--lr — Learning rate parameter. Decimal fraction between 0 and 1.

Learn more!

See SambaNova messages and logs to understand what the messages to stdout mean.
See Use Python virtual environments to learn how to run models inside a Python virtual environment.
See Compilation, training, and inference for an intermediate tutorial that includes using checkpoints and running inference.
Find other tutorials in our SambaNova Tutorials public GitHub repo . See Tutorials:GitHub and doc for an overview.

Download model data (Optional)

Only users without internet access perform this task. By default, the application code downloads model data.

If you run the example on a system that is not connected to the internet, you have to download the model data from a connected system and copy the data to the system where you want to run the model.

On a connected system run:

$ mkdir -p /tmp/data/MNIST/raw
$ cd /tmp/data/MNIST/raw
$ wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
$ wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
$ wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
$ wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz

Copy the four .gz files to the DataScale system and place them in the directory /tmp/data/MNIST/raw.
When you later use the compile and the run command, add the --data-folder=/tmp/data argument.

Architecture and workflows

The SambaFlow™ software stack runs on DataScale^® hardware. You can run models on this software stack in several ways:

To get started, use the tutorials available from the SambaNova tutorials repo. You can also examine and compile and run models included in opt/sambaflow/apps on your DataScale host.
To progress, use one of the models available in the SambaNova modelzoo repo. You run Model Zoo models in a DevBox container that includes all prerequisite software. The model source code, which been customized for RDU, is available in a public GitHub repo, which also includes example apps. You can compile a model, and can then run inference (text generation) and fine-tune the model with custom data. To fine tune, you download a checkpoint for the same model (Hugging Face format), prepare your dataset, and compile and train the model.

In this doc page, you learn about the different components of the software stack, the compile/train and compile/generate cycles, and the command-line arguments.

SambaNova Stack

It’s useful to understand the different components of the SambaNova hardware and software stack and how they interact with each other. For example, SambaFlow developers might find it useful to investigate what’s going on in the SambaNova Runtime component.

SambaNova Reconfigurable Dataflow Unit™ (RDU) is a processor that provides native dataflow processing. It has a tiled architecture that consists of a network of reconfigurable functional units. See the white paper SambaNova Accelerated Computing with a Reconfigurable Dataflow Architecture .
SambaNova Systems DataScale is a complete rack-level computing system. Each DataScale system configuration consists of one or more DataScale nodes, integrated networking, and a management infrastructure in a standards-compliant data center rack.
SambaNova Runtime. The SambaNova Runtime component loads code and data onto the RDUs and manages the return of result data. System administrators can perform configuration, fault management, troubleshooting, etc. See the SambaNova Runtime documentation for details.
SambaFlow Python SDK. The SambaFlow Python SDK serves as our frontend for compiling and running models on SambaNova hardware.
SambaFlow models. We offer models in several places.
- Starter models are included on the SambaNova host at /opt/sambaflow/apps/ and are targeted towards the new user.
- SambaFlow models in our sambanova/tutorials GitHub repo allow you to examine the Python code and then perform compilation, training, and inference runs. See SambaFlow tutorials.
- Model Zoo, available from the public modelzoo repo includes model source code that’s been customized to work well on RDU, and scripts for running them. You run the models in a Devbox container. You can use Model Zoo in conjunction with checkpoints in Hugging Face format (and a corresponding config.json) to fine tune the model with your data.

Workflows

When you develop for RDU hardware, you start with model code, optional checkpoints, and fine-tuning data. You end with a trained model that might for example, responds to prompt data.

Data preparation workflow

It’s possible to start by training from scratch with large data sets, but it usually makes sense to work with an existing checkpoint and fine-tune that checkpoint with your own data. In both cases, your data needs to be in a format that SambaFlow can work with.

Data preparation is a much-discussed topic in AI. This doc page doesn’t go into data preparation details but focuses instead on the format of the data, not the content.

Data prep workflow. Both custom data and training data are passed through data prep tool.

Figure 1. Data preperation workflow

SambaNova expects that you pass in your data as HDF5 files. Our public generative_data_prep repo includes scripts and documentation for converting plain text or jsonline format files to HDF5 format.

Training workflow

The goal of the training workflow is to generate a a checkpoint that you can then use in a generation workflow.

First you compile for training and generate a PEF file. The PEF file defines the dataflow graph for the SambaNova hardware.
Then you run training with the data of your choice and pass in the PEF file.

With the small tutorial models, you can complete a training run. For the much larger Model Zoo models, we recommend that you download a checkpoint from Hugging Face and run training with your model source, the checkpoint, and your custom data instead.

Figure 2. Training workflow

For the training workflow, you proceed as follows:

Complete data preparation, discussed in Data preparation workflow
Perform compilation by running the compilation command, passing in the model code and arguments.
- If you’re using the container-based Model Zoo solution, you specify most arguments in a YAML config file and specify only a few on the command line.
- If you’re using one of the tutorial examples, you specify all arguments on the command line. See Compiler argument reference.
  
  The output of the compilation step is a PEF file, which defines how the model is run on SambaNova hardware. You cannot edit a PEF file.
Perform training by running the training command, passing in the PEF file, the configuration info for training, and the prepared data files.
- If you’re using the container-based Model Zoo solution, you specify most arguments in a YAML config file and specify only a few on the command line. Because Model Zoo models are large models, you typically pass in a check point that you download from Hugging Face. Training a model from scratch without a checkpoint takes a long time. See the examples README for a walkthrough example with Llama2.
- If you’re using one of the tutorial examples, you specify all arguments on the command line. Run the training command with --help to see all arguments.
  
  As part of a training run, information about loss and accuracy are logged to stdout.
Examine the loss and accuracy information, which is logged to stdout as part of a training run.
- If you’re running a Model Zoo model, observe the output. If you see that loss is decreasing, your PEF is valid. You can then perform fine tuning with your own data and a Hugging Face checkpoint.
- If you’re running a tutorial example, you can run training to completion. The output is a checkpoint file that you can use for fine tuning.

Fine-tuning workflow

The most common use case for SambaNova customers is to start with an existing model, generate a PEF file, fine-tune the model with custom data, and then use the model for generative inference. The workflow looks like this:

Figure 3. Fine-tuning workflow

During fine tuning, you create a version of the model that’s been trained with your organization’s data.

Before you start fine tuning, ensure that you have the custom data in the correct format, as discussed in Data preparation workflow.
Start a training run and pass in:
- A PEF file that was the output of compilation.
- A checkpoint. For a Model Zoo model, use a publicly available Hugging Face checkpoint. For a tutorial model, you can use a checkpoint generated by a training run.
- The configuration parameters, either on the command line or, for Model Zoo models, in a configuration file.
Output of the fine tuning run is a model checkpoint that has been fine tuned with your custom data. You can then pass in that checkpoint when you runfine- inference (generation).

Inference (generation) run

The final step is running generative inference. In this step, you send prompt data to the trained model and the model responds to the prompt or performs summarization, identification, etc. The actual tasks your model can perform depends on the model itself.

Figure 4. Inference workflow

In most cases, you first compile an inference PEF. Compilation for inference consists only of the forward pass.
You run inference, passing in a checkpoint and prompt data or other input.
You can experiment with inference parameters, such as temperature, topk, etc. Some parameters require a recompile, others do not.
When you’re satisfied with the results, you can deploy the tested model.

See Run and verify inference for an example discussion.

Command-line arguments

SambaNova initially used a workflow where users passed all configuration arguments in on the command line. With our more recent Model Zoo initiative, we are using a Pydantic and Hydra infrastructure. Most arguments are specified in a YAML configuration file. A few arguments, such as the name of the PEF file, are specified on the command line.

Command-line arguments for Model Zoo models

For Model Zoo models, we have greatly simplified argument management for the developer and argument usage for the person who uses the model. A combination of Pydantic and Hydra make this possible.

Model developers are encouraged to become familiar with the Pydantic and Hydra infrastructure to reap the benefits of streamlined argument management.
Model users will notice that the scripts for running the model have a different syntax than before. For example, here’s how you might invoke generation:

/opt/sambanova/bin/python $HOME/modelzoo/modelzoo/examples/text_generation/run_rdu.py \
command=compile \
model.cache_dir=/opt/sambanova/modelbox/checkpoints \
model.batch_size=1 \
+samba_compile.target_runtime_version=1.3.7 \
+samba_compile.output_folder=/opt/out \
+samba_compile.pef_name=llama2_7b_infer

Command-line arguments for other models

For tutorial models and for models at /opt/sambaflow, each model supports different command-line arguments. To see all arguments, run the model with the task (compile or run) and --help, for example, app.py compile --help.

Compilation

All models support the shared arguments that are documented in Arguments to compile.
You can generate a PEF for inference by specifying the --inference compile flag. By default, we compile for training, which includes a forward, backward, and optimization pass. Inference compilation is only a forward pass.
All models support a shared set of experimental arguments, usually used during debugging when working with SambaNova Support. To include these arguments in the help output, run app.py compile --debug --help.
Additionally, each model has a set of model-specific arguments that are defined in the app code.

Training

You call the model with run. You must provide a PEF file.
Similar to compile, all models support a set of shared run arguments and an additional set of model-specific arguments.

Inference

You call the model with run --inference. You must provide a PEF file.
Most arguments to run are supported when you run inference.

Transition to DataScale SN30

The DataScale SN30 system offers significantly improved performance over DataScale SN10 system. Because the system is different, you likely have to recompile and retrain a model that was compiled on SN10 system. This topic gives some guidance.

A PEF built on SN10 is not expected to run unmodified on an SN30.

General RDU difference information

Here are the RDU differences between SN10 and SN30.

SN10	SN30
8 RDUs	8RDUs
4 tiles per RDU	8 tiles per RDU
Default compile yields a PEF that uses 1 RDU (4 tiles)	Default compile yields a PEF that uses 1 RDU (8 tiles)
By default you run using 1 copy of the model on 1 RDU	By default you run 2 copies of the model, one on each "half" of the RDU (using tensor parallel execution).

SN10

SN30

8 RDUs

8RDUs

4 tiles per RDU

8 tiles per RDU

Default compile yields a PEF that uses 1 RDU (4 tiles)

Default compile yields a PEF that uses 1 RDU (8 tiles)

By default you run using 1 copy of the model on 1 RDU

By default you run 2 copies of the model, one on each "half" of the RDU (using tensor parallel execution).

Compiler impacts

RDU differences mean that the compiler optimizes the PEF file differently. Here’s what you need to know:

On both SN10 or SN30 you can explicitly specify the number of tiles with num-tiles, for example, --num-tiles=4.
- If you compile with --num-tiles=4 on an SN10 system, you can run 8 instances of data-parallel on a node.
- If you compile with --num-tiles 4 on an SN30 system, you can run 16 instances of data-parallel on a node.
If you specify --num-chips=1 on SN10 or SN30 you get 4 tiles.
Because SN30 uses tensor parallel, both compile and run operations require that the batch size be an even number. The results are reduced using data parallel in the PEF. This is the default, it is equivalent to --tensor-parallel=batch.
It is not unusual to need to use different human decision files and different compiler configuration files when migrating your model from SN10 to SN30.

PEF information

For information on resource requirements for your PEF, for example, how many tiles are required, use the /opt/sambaflow/slurm/python/slurmfeeder utility.

Examine logreg model code

SambaNova supports several tutorials. You learn how to compile and train a simple logreg model in Hello SambaFlow! Compile and run a model. This doc page examines the Python code and data you’re using to run logreg.

Our logreg model uses:

A Python program. See the complete file in our tutorials GitHub repo at https://github.com/sambanova/tutorials/blob/main/hello_world/logreg.py.
A simple neural network dataset (MNIST). The example application downloads the dataset when you run the model.

In this tutorial you learn what’s inside the Python code.

What you’ll learn

This tutorial explores these topics:

Typical imports
Components of main()
Model definition, including input arguments, compilation, and training.

Files

All tutorial code files are in our tutorials GitHub repo at https://github.com/sambanova/tutorials/tree/main. This doc page includes collapsible code snippets for each code component we discuss.

Data

The tutorial uses the classic MNIST dataset, which includes a training set of 60,000 examples, and a test set of 10,000 examples.

By default, the code downloads the dataset files as part of the training run.
In environments that don’t have access to the internet, you can explicitly download the dataset. See (Optional) Download model data.

Code files

The code for this tutorial is in a single code file, logreg.py.

Imports

Our model imports several Python modules. Here’s the Python code, followed by an explanation of each import.

Imports

import argparse
import sys
from typing import Tuple

import torch
import torch.distributed as dist
import torch.nn as nn
import torchvision

import sambaflow.samba.utils as utils
from sambaflow import samba
from sambaflow.samba.utils.argparser import (parse_app_args,
                                             parse_yaml_to_args)
from sambaflow.samba.utils.dataset.mnist import dataset_transform
from sambaflow.samba.utils.pef_utils import get_pefmeta

sambaflow.samba is the set of SambaFlow modules.
sambaflow.samba.utils contains all the utilities, such as tracing etc.
parse_app_args is our built-in argument parsing support for each supported execution mode (more details below).
dataset_transform is a utility function to transform the data.
get_pefmeta saves the model’s metadata in the resulting executable file (PEF file).

It all starts with main()

The workflows for SambaNova models are outlined in sambaflow-intro.adoc#_workflows[Workflows]. The intermediate tutorial includes both training and inference.

The main() function includes the functions to perform compilation and training, and also does some preparation.

Function Description See

Function	Description	See
`utils.set_seed()`	Set a random seed for reproducibility while we’re in the development phases of our tutorial.
`parse_app_args()`	Collect the arguments coming from `add_common_args()` and `add_run_args()`. When users run the model, they can specify predefined arguments that are handled bycompiler (e.g. o0) and the SambaFlow framework, as well as application-specific arguments.	Define input arguments
`samba.randn()`, `samba.randint`	Create random input and output for compilation	See the API Reference.
`samba.from_torch_model_()`	Set up the model to use the SambaFlow framework. The function, which also converts a PyTorch model to a Samba model, performs some initialization and related tasks. We pass in `model`, a class we create to represent the model.	Define the model
`samba.optim.SGD()`	Define the optimizer we’ll use for training the model. The SambaFlow framework supports AdamW and SGD out of the box. You can also specify a different optimizer.	See the API Reference
`compile()`	If the user specified `compile()` on the command line, call `samba.session.compile()`.	Compile the model
`run()`	If the user specified `run()` on the command line, perform training, testing, or inference, based on other arguments that are passed in.	Train the model

utils.set_seed()

Set a random seed for reproducibility while we’re in the development phases of our tutorial.

parse_app_args()

Collect the arguments coming from add_common_args() and add_run_args(). When users run the model, they can specify predefined arguments that are handled bycompiler (e.g. o0) and the SambaFlow framework, as well as application-specific arguments.

Define input arguments

samba.randn(), samba.randint

Create random input and output for compilation

See the API Reference.

samba.from_torch_model_()

Set up the model to use the SambaFlow framework. The function, which also converts a PyTorch model to a Samba model, performs some initialization and related tasks. We pass in model, a class we create to represent the model.

Define the model

samba.optim.SGD()

Define the optimizer we’ll use for training the model. The SambaFlow framework supports AdamW and SGD out of the box. You can also specify a different optimizer.

See the API Reference

compile()

If the user specified compile() on the command line, call samba.session.compile().

Compile the model

run()

If the user specified run() on the command line, perform training, testing, or inference, based on other arguments that are passed in.

Train the model

Define the model

The model definition specifies the layers in the model and the number of features in each layer.

Here’s the Python code:

LogReg class

class LogReg(nn.Module):
    """ Define the model architecture

    Define the model architecture i.e. the layers in the model and the
    number of features in each layer.

    Args:
        nlin_layer (ivar): Linear layer
        criterion (ivar): Cross Entropy loss layer
    """

    def __init__(self, num_features: int, num_classes: int, bias: bool):
        """ Initialization function for this class

        Args:
            num_features (int):  Number of input features for the model
            num_classes (int): Number of output labels the model classifies inputs
            bias (bool): _description_
        """
        super().__init__()
        self.num_features = num_features
        self.num_classes = num_classes

        # Linear layer for predicting target class of inputs
        self.lin_layer = nn.Linear(in_features=num_features, out_features=num_classes, bias=bias)

        # Cross Entropy layer for loss computation
        self.criterion = nn.CrossEntropyLoss()

    def forward(self, inputs: torch.Tensor, targets: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """ Forward pass of the model for the given inputs.

        The forward pass predicts the class labels for the inputs
        and computes the loss between the correct and predicted class labels.

        Args:
            inputs (torch.Tensor):  Input samples in the dataset
            targets (torch.Tensor): correct labels for the inputs

        Returns:
            Tuple[torch.Tensor, torch.Tensor]:The loss and predicted classes of the inputs
        """

        out = self.lin_layer(inputs)
        loss = self.criterion(out, targets)

        return loss, out

Two functions are defined for the class:

init(), the initialization function, uses num_features, num_classes, and bias that are specified for this model, and also specifies the linear layer and cross entropy layer.
forward(), which is used by train(), predicts class labels and computes the loss between the correct and predicted labels.

In main() we’ll then convert the model from a PyTorch model to a SambaFlow model by calling from_torch_model().

Define input arguments

The add_args function defines parameters for use with this model. These are all arguments that are typically used with an ML model.

add_args function

def add_args(parser: argparse.ArgumentParser) -> None:
    """ Add model-specific arguments.

    By default, the compiler and the SambaFlow framework support a set of arguments to compile() and run().
    The arguement parser supports adding application-specific arguments.

    Args:
        parser (argparse.ArgumentParser): SambaNova argument parser.
    """

    parser.add_argument('--lr', type=float, default=0.0015, help="Learning rate for training")
    parser.add_argument('--momentum', type=float, default=0.0, help="Momentum value for training")
    parser.add_argument('--weight-decay', type=float, default=3e-4, help="Weight decay for training")
    parser.add_argument('--num-epochs', '-e', type=int, default=1)
    parser.add_argument('--num-steps', type=int, default=-1)
    parser.add_argument('--num-features', type=int, default=784)
    parser.add_argument('--num-classes', type=int, default=10)
    parser.add_argument('--yaml-config', default=None, type=str, help='YAML file used with launch_app.py')
    parser.add_argument('--data-dir',
                        '--data-folder',
                        type=str,
                        default='mnist_data',
                        help="The folder to download the MNIST dataset to.")
    parser.add_argument('--bias', action='store_true', help='Linear layer will learn an additive bias')

Users of the model can then specify these arguments on the command line to set model parameters.

--num-epochs or -e specifies the number of epochs to run the training loop.
--num-features specifies the embedding dimension of the input data.
--num-classes is the number of different classes in our classification problem. For the MNIST example, the number of different classes is ten for digits from 0 to 9.
--data-folder specifies the download location for the MNIST data.

Data preparation

Data preparation is pretty standard, and familiar to those who’ve worked with PyTorch datasets. The prepare_dataloader() function defines and then returns both the train and the test dataset.

prepare_dataloader() function

def prepare_dataloader(args: argparse.Namespace) ->
            Tuple[torch.utils.data.DataLoader, torch.utils.data.DataLoader]:

    # Get the train & test data (images and labels) from the MNIST dataset
    train_dataset = torchvision.datasets.MNIST(root=f'{args.data_dir}',
                                               train=True,
                                               transform=dataset_transform(vars(args)),
                                               download=True)
    test_dataset = torchvision.datasets.MNIST(root=f'{args.data_dir}',
                                              train=False,
                                              transform=dataset_transform(vars(args)))

    # Get the train & test data loaders (input pipeline)
    train_loader = torch.utils.data.DataLoader(
        dataset=train_dataset, batch_size=args.batch_size, shuffle=True)
    test_loader = torch.utils.data.DataLoader(
        dataset=test_dataset, batch_size=args.batch_size, shuffle=False)
    return train_loader, test_loader

Compile the model

For model compilation, we use the samba.session.compile function, passing some arguments including the optimizer.

Calling samba.session.compile()

if args.command == "compile":
        #  Compile the model to generate a PEF (Plasticine Executable Format) binary
        samba.session.compile(model,
                              inputs,
                              optimizer,
                              name='logreg_torch',
                              app_dir=utils.get_file_dir(__file__),
                              config_dict=vars(args),
                              pef_metadata=get_pefmeta(args, model))

Train the model

The train() function defines the training logic. It is similar to a typical PyTorch training loop.

The outer loop iterates over the number of epochs provided by the --num-epochs argument.
The inner loop iterates over the training data.

Let’s look at the annotated code first, and then explore some details.

train() function

def train(args: argparse.Namespace, model: nn.Module, output_tensors:
            Tuple[samba.SambaTensor]) -> None:

    # Get data loaders for training and test data
    train_loader, test_loader = prepare_dataloader(args)

    # Total training steps (iterations) per epoch
    total_step = len(train_loader)

    hyperparam_dict = { "lr": args.lr,
                        "momentum": args.momentum,
                        "weight_decay": args.weight_decay}

    # Train and test for specified number of epochs
    for epoch in range(args.num_epochs):
        avg_loss = 0

        # Train the model for all samples in the train data loader
        for i, (images, labels) in enumerate(train_loader):
            global_step = epoch * total_step + i
            if args.num_steps > 0 and global_step >= args.num_steps:
                print('Maximum num of steps reached. ')
                return None

            sn_images = samba.from_torch_tensor(images, name='image', batch_dim=0)
            sn_labels = samba.from_torch_tensor(labels, name='label', batch_dim=0)

            loss, outputs = samba.session.run(input_tensors=[sn_images, sn_labels],
                                              output_tensors=output_tensors,
                                              hyperparam_dict=hyperparam_dict,
                                              data_parallel=args.data_parallel,
                                              reduce_on_rdu=args.reduce_on_rdu)

            # Sync the loss and outputs with host memory
            loss, outputs = samba.to_torch(loss), samba.to_torch(outputs)
            avg_loss += loss.mean()

            # Print loss per 10,000th sample in every epoch
            if (i + 1) % 10000 == 0 and args.local_rank <= 0:
                print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1,
                    args.num_epochs, i + 1, total_step, avg_loss / (i + 1)))

        # Check the accuracy of the trained model for all samples in the test data loader
        # Sync the model parameters with host memory
        samba.session.to_cpu(model)
        test_acc = 0.0
        with torch.no_grad():
            correct = 0
            total = 0
            total_loss = 0
            for images, labels in test_loader:
                loss, outputs = model(images, labels)
                loss, outputs = samba.to_torch(loss), samba.to_torch(outputs)
                total_loss += loss.mean()
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum()

            test_acc = 100.0 * correct / total

            if args.local_rank <= 0:
                print(f'Test Accuracy:
                      {test_acc:.2f} Loss: {total_loss.item() / len(test_loader):.4f}')


        # if args.acc_test:
           # assert args.num_epochs == 1, "Accuracy test only supported for 1 epoch"
           # assert test_acc > 91.0 and test_acc < 92.0, "Test accuracy not within specified bounds."

Here’s some detail on the code fragments.

The function from_torch_tensor creates SambaFlow tensors (SambaTensor) from PyTorch tensors. This function is similar to the torch.from_numpy function in PyTorch, which creates a PyTorch tensor from a NumPy array. (The function samba.to_torch creates a PyTorch tensor from a SambaTensor.)

When we run the model on the device, we call the samba.session.run function:

loss, outputs = samba.session.run(input_tensors =
        [sn_images, sn_labels],
        output_tensors=output_tensors,
        hyperparam_dict=hyperparam_dict,
        data_parallel=args.data_parallel,
        reduce_on_rdu=args.reduce_on_rdu)

To collect data about loss and output and print those data, we convert back from SambaTensors to PyTorch tensors in loss, outputs = samba.to_torch(loss), samba.to_torch(outputs).

Main function

The main function runs in different modes depending on the command-line input. The two main execution modes are compile and run.

Here’s how compiling and running a SambaFlow model works:

You compile the model with the compile command. As part of compilation, our code generates random SambaTensors (ipt and tgt) and passes them to the compiler.
After compile has produced a PEF file, you can do a training run, passing in the PEF file name as a parameter.

Hello SambaFlow! Compile and run a model explains how to compile and run this model.

main() function

def main(argv):
    """
    :param argv: Command line arguments (`compile`, `test` or `run`)
    """
    args = parse_app_args(argv=argv,
                          common_parser_fn=add_args,
                          run_parser_fn=add_run_args)

    # when it is not distributed mode, local rank is -1.
    args.local_rank = dist.get_rank() if dist.is_initialized() else -1

    # Create random input and output data for testing
    ipt = samba.randn(args.batch_size,
                      args.num_features,
                      name='image',
                      batch_dim=0,
                      named_dims=('B', 'F')).bfloat16().float()
    tgt = samba.randint(args.num_classes, (args.batch_size, ),
                        name='label',
                        batch_dim=0,
                        named_dims=('B', ))

    ipt.host_memory = False
    tgt.host_memory = False

    # Instantiate the model
    model = LogReg(args.num_features, args.num_classes)

    # Sync model parameters with RDU memory
    samba.from_torch_model_(model)

    # Annotate parameters if weight normalization is on
    if args.weight_norm:
        utils.weight_norm_(model.lin_layer)

    inputs = (ipt, tgt)

    # Instantiate an optimizer if the model will be trained
    if args.inference:
        optimizer = None
    else:
        # We use the SGD optimizer to update the weights of the model
        optimizer = samba.optim.SGD(model.parameters(),
                                    lr=args.lr,
                                    momentum=args.momentum,
                                    weight_decay=args.weight_decay)

    if args.command == "compile":
        #  Compile the model to generate a PEF (Plasticine Executable Format) binary
        samba.session.compile(model,
                              inputs,
                              optimizer,
                              name='logreg_torch',
                              app_dir=utils.get_file_dir(__file__),
                              config_dict=vars(args),
                              pef_metadata=get_pefmeta(args, model))

    elif args.command in ["test", "run"]:
        # Trace the compiled graph to initialize the model weights and input/output tensors
        # for execution on the RDU.
        # The PEF required for tracing is the binary generated during compilation
        # Mapping refers to how the model layers are arranged in a pipeline for execution.
        # Valid options: 'spatial' or 'section'
        utils.trace_graph(model,
                          inputs,
                          optimizer,
                          pef=args.pef,
                          mapping=args.mapping)

        if args.command == "test":
            # Test the model's functional correctness. This tests if the result of execution
            # on the RDU is comparable to that on a CPU. CPU run results are used as reference.
            # Note that this test is different from testing model fit during training.
            # Given the same initial weights and inputs, this tests if the graph execution
            # on RDU generates outputs that are comparable to those generated on a CPU.
            outputs = model.output_tensors
            test(args, model, inputs, outputs)

        elif args.command == "run":

            # Train the model on RDU. This is where the model will be trained
            # i.e. weights will be learned to fit the input dataset
            train(args, model)


if __name__ == '__main__':
    main(sys.argv[1:])

For discussion of a main() function that’s very similar to the function above, see Tie the pieces together with main().

Learn more!

The intermediate model, Compilation, training, and inference, includes discussion of data preparation and inference.
If you’re repurposing a PyTorch model, you have to convert PyTorch tensors to SambaTensors and likely make other changes so that the model can run on RDU instead of CPU. See Convert existing models to SambaFlow.
Developing a model with SambaFlow is similar to developing a model with the PytTorch Neural Network examples .

Convert a simple model to SambaFlow

Many SambaNova customers convert an existing model that they built in PyTorch to SambaFlow. This doc page uses a simple example to illustrate what is essential for the conversion and discusses some best practices. You’ll see that much of your code remains unchanged and that SambaFlow doesn’t usually require you to reformat your data.

This tutorial is about model conversion. For background on data preparation, see our public GitHub repository

In this tutorial, you:

Learn about The example model.
Look at some Planning questions that will help you be successful.
Learn about required and optional changes to the model PyTorch code.
- Code changes if the model includes a loss function are in Examine functions and changes
- Additional code changes if the model uses an external a loss function are in Examine model code with external loss function.

The example model

Convolutional Neural Networks (CNNs) are a popular model type in the Visual AI space. Our example model is a CNN that performs image classification on the MNIST dataset. It consists of four layers:

2 Convolutional layers, each containing a:
- Conv2D
- ReLU
- MaxPool2D
2 Fully-connected linear layers

Included or external loss function

This conversion example presents two example solutions:

The solution in Examine functions and changes includes the model’s loss function as part of the model definition.
- This approach results in performance enhancements because loss computation happens on RDU.
- In the example, the loss function is included in the forward() function.
The solution in Examine model code with external loss function includes code for a loss function that is external to the model.
- This solution uses a host CPU to compute the loss and gradients for backpropagation.
- Use this approach if your model’s loss function isn’t currently supported by SambaFlow or if you are using a custom loss function.

Original and converted model code

This tutorial explains code modifications using a simple 2-layer Convolutional Neural Network example. We picked this example because it’s simple and compiles quickly.

You can download the original code from this repo: https://github.com/adventuresinML/adventures-in-ml-code/blob/master/conv_net_py_torch.py.

The revised code is available below.

Included loss function

import sambaflow
import sambaflow.samba as samba
import sambaflow.samba.optim as optim
import sambaflow.samba.utils as utils
from sambaflow.samba.utils.common import common_app_driver
from sambaflow.samba.utils.argparser import parse_app_args
from sambaflow.samba.sambaloader import SambaLoader

import sys
import argparse
from typing import Tuple

import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

class ConvNet(nn.Module):
    """
    Instantiate a 4-layer CNN for MNIST Image Classification.

    In SambaFlow, it is possible to include a loss function as part of a model's definition and put it in
    the forward method to be computed.

    Typical SambaFlow usage example:

    model = ConvNet()
    samba.from_torch_model_(model)
    optimizer = ...
    inputs = ...
    if args.command == "run":
        utils.trace_graph(model, inputs, optimizer, pef=args.pef, mapping=args.mapping)
        train(args, model)
    """

    def __init__(self):

        super(ConvNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
        self.layer2 = nn.Sequential(
            nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
        self.drop_out = nn.Dropout()
        self.fc1 = nn.Linear(7 * 7 * 64, 1000)
        self.fc2 = nn.Linear(1000, 10)
        self.criterion = nn.CrossEntropyLoss() # Add loss function to model

    def forward(self, x: torch.Tensor, labels: torch.Tensor):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.drop_out(out)
        out = self.fc1(out)
        out = self.fc2(out)
        loss = self.criterion(out, labels)     # Compute loss
        return loss, out

def add_user_args(parser: argparse.ArgumentParser) -> None:
    """
    Add user-defined arguments.

    Args:
        parser (argparse.ArgumentParser): SambaFlow argument parser
    """

    parser.add_argument(
        "-bs",
        type=int,
        default=100,
        metavar="N",
        help="input batch size for training (default: 100)",
    )
    parser.add_argument(
        "--num-epochs",
        type=int,
        default=6,
        metavar="N",
        help="number of epochs to train (default: 6)",
    )
    parser.add_argument(
        "--num-classes",
        type=int,
        default=10,
        metavar="N",
        help="number of classes in dataset (default: 10)",
    )
    parser.add_argument(
        "--learning-rate",
        type=float,
        default=0.001,
        metavar="LR",
        help="learning rate (default: 0.001)",
    )
    parser.add_argument(
        "--data-path",
        type=str,
        default="data",
        help="Download location for MNIST data",
    )
    parser.add_argument(
        "--model-path", type=str, default="model", help="Save location for model"
    )

def get_inputs(args: argparse.Namespace) -> Tuple[samba.SambaTensor]:
    """
    Generates random SambaTensors in the same shape as MNIST image  and label tensors.

    In order to properly compile a PEF and trace the model graph, SambaFlow requires a SambaTensor that
    is the same shape as the input Torch Tensors, allowing the graph to be optimally mapped onto an RDU.

    Args:
        args (argparse.Namespace): User- and system-defined command line arguments

    Returns:
        A tuple of SambaTensors with random values in the same shape as MNIST image and label tensors.
    """

    dummy_image = (
        samba.randn(args.bs, 1, 28, 28, name="image", batch_dim=0),
        samba.randint(args.num_classes, (args.bs,), name="label", batch_dim=0),
    )

    return dummy_image

def prepare_dataloader(args: argparse.Namespace) -> Tuple[sambaflow.samba.sambaloader.SambaLoader, sambaflow.samba.sambaloader.SambaLoader]:
    """
    Transforms MNIST input to tensors and creates training/test dataloaders.

    Downloads the MNIST dataset (if necessary); splits the data into training and test sets; transforms the
    data to tensors; then creates Torch DataLoaders over those sets.  Torch DataLoaders are wrapped in
    SambaLoaders.

    Args:
        args (argparse.Namespace): User- and system-defined command line arguments

    Returns:
        A tuple of SambaLoaders over the training and test sets.
    """

    # Transform the raw MNIST data into PyTorch Tensors, which will be converted to SambaTensors
    transform = transforms.Compose(
        [
            transforms.ToTensor(),
            transforms.Normalize((0.1307,), (0.3081,)),
        ]
    )

    # Get the train & test data (images and labels) from the MNIST dataset
    train_dataset = datasets.MNIST(
        root=args.data_path,
        train=True,
        transform=transform,
        download=True,
    )
    test_dataset = datasets.MNIST(root=args.data_path, train=False, transform=transform)

    # Set up the train & test data loaders (input pipeline)
    train_loader = DataLoader(
        dataset=train_dataset, batch_size=args.bs, shuffle=True
    )
    test_loader = DataLoader(
        dataset=test_dataset, batch_size=args.bs, shuffle=False
    )

    # Create SambaLoaders
    sn_train_loader = SambaLoader(train_loader, ["image", "label"])
    sn_test_loader = SambaLoader(test_loader, ["image", "label"])

    return sn_train_loader, sn_test_loader

def train(args: argparse.Namespace, model: nn.Module) -> None:
    """
    Trains the model.

    Prepares and loads the data, then runs the training loop with the hyperparameters specified
    by the input arguments.  Calculates loss and accuracy over the course of training.

    Args:
        args (argparse.Namespace): User- and system-defined command line arguments
        model (nn.Module): ConvNet model
    """

    sn_train_loader, _ = prepare_dataloader(args)
    hyperparam_dict = {"lr": args.learning_rate}

    total_step = len(sn_train_loader)
    loss_list = []
    acc_list = []

    for epoch in range(args.num_epochs):
        for i, (images, labels) in enumerate(sn_train_loader):

            # Run the model on RDU: forward -> loss/gradients -> backward/optimizer
            loss, outputs = samba.session.run(
                input_tensors=(images, labels),
                output_tensors=model.output_tensors,
                hyperparam_dict=hyperparam_dict
            )

            # Convert SambaTensors back to Torch Tensors to calculate accuracy
            loss, outputs = samba.to_torch(loss), samba.to_torch(outputs)
            loss_list.append(loss.tolist())

            # Track the accuracy
            total = labels.size(0)
            _, predicted = torch.max(outputs.data, 1)
            correct = (predicted == labels).sum().item()
            acc_list.append(correct / total)

            if (i + 1) % 100 == 0:
                print(
                    "Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Accuracy: {:.2f}%".format(
                        epoch + 1,
                        args.num_epochs,
                        i + 1,
                        total_step,
                        torch.mean(loss),
                        (correct / total) * 100,
                    )
                )

def main(argv):

    args = parse_app_args(argv=argv, common_parser_fn=add_user_args)

    # Create the CNN model
    model = ConvNet()

    # Convert model to SambaFlow (SambaTensors)
    samba.from_torch_model_(model)

    # Create optimizer
    # Note that SambaFlow currently supports AdamW, not Adam, as an optimizer
    optimizer = samba.optim.AdamW(model.parameters(), lr=args.learning_rate)

    # Normally, we'd define a loss function here, but with SambaFlow, it can be defined
    # as part of the model, which we have done in this case

    # Create dummy SambaTensor for graph tracing
    inputs = get_inputs(args)

    # The common_app_driver() handles model compilation and various other tasks, e.g.,
    # measure-performance.  Running, or training, a model must be explicitly carried out
    if args.command == "run":
        utils.trace_graph(model, inputs, optimizer, pef=args.pef, mapping=args.mapping)
        train(args, model)
    else:
        common_app_driver(args=args,
                        model=model,
                        inputs=inputs,
                        optim=optimizer,
                        name=model.__class__.__name__,
                        init_output_grads=not args.inference,
                        app_dir=utils.get_file_dir(__file__))

if __name__ == '__main__':
    main(sys.argv[1:])

Custom loss function

import sambaflow
import sambaflow.samba as samba
import sambaflow.samba.optim as optim
import sambaflow.samba.utils as utils
from sambaflow.samba.utils.common import common_app_driver
from sambaflow.samba.utils.argparser import parse_app_args
from sambaflow.samba.sambaloader import SambaLoader

import sys
import argparse
from typing import (Tuple, Callable)

import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

class ConvNetCustomLoss(nn.Module):
    """
    Instantiate a 4-layer CNN for MNIST Image Classification.

    In SambaFlow, while it is possible to include a loss function in the model definition, it
    is not done here as an example of how to compute loss on the host.

    Typical SambaFlow usage example:

    model = ConvNet()
    samba.from_torch_(model)
    optimizer = ...
    inputs = ...
    if args.command == "run":
        utils.trace_graph(model, inputs, optimizer, pef=args.pef, mapping=args.mapping)
        train(args, model)
    """

    def __init__(self):

        super(ConvNetCustomLoss, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
        self.layer2 = nn.Sequential(
            nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
        self.drop_out = nn.Dropout()
        self.fc1 = nn.Linear(7 * 7 * 64, 1000)
        self.fc2 = nn.Linear(1000, 10)

    def forward(self, x: torch.Tensor):
        # Since loss isn't part of the model, we don't pass a label to forward()
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.drop_out(out)
        out = self.fc1(out)
        out = self.fc2(out)
        return out

def add_user_args(parser: argparse.ArgumentParser) -> None:
    """
    Add user-defined arguments.

    Args:
        parser (argparse.ArgumentParser): SambaFlow argument parser
    """

    parser.add_argument(
        "-bs",
        type=int,
        default=100,
        metavar="N",
        help="input batch size for training (default: 100)",
    )
    parser.add_argument(
        "--num-epochs",
        type=int,
        default=6,
        metavar="N",
        help="number of epochs to train (default: 6)",
    )
    parser.add_argument(
        "--num-classes",
        type=int,
        default=10,
        metavar="N",
        help="number of classes in dataset (default: 10)",
    )
    parser.add_argument(
        "--learning-rate",
        type=float,
        default=0.001,
        metavar="LR",
        help="learning rate (default: 0.001)",
    )
    parser.add_argument(
        "--data-path",
        type=str,
        default="data",
        help="Download location for MNIST data",
    )
    parser.add_argument(
        "--model-path", type=str, default="model", help="Save location for model"
    )

def get_inputs(args: argparse.Namespace) -> Tuple[samba.SambaTensor]:
    """
    Generates random SambaTensors in the same shape as MNIST image tensors.

    In order to properly compile a PEF and trace the model graph, SambaFlow requires a SambaTensor that
    is the same shape as the input Torch Tensors, allowing the graph to be optimally mapped onto an RDU.

    Args:
        args (argparse.Namespace): User- and system-defined command line arguments

    Returns:
        A SambaTensor with random values in the same shape as MNIST image tensors.
    """

    # Loss is computed on the host, so a dummy SambaTensor is only needed for the MNIST images
    return samba.randn(args.bs, 1, 28, 28, name="image", batch_dim=0),

def prepare_dataloader(args: argparse.Namespace) -> Tuple[sambaflow.samba.sambaloader.SambaLoader, ...]:
    """
    Transforms MNIST input to tensors and creates training/test dataloaders.

    Downloads the MNIST dataset (if necessary); splits the data into training and test sets; transforms the
    data to tensors; then creates Torch DataLoaders over those sets.  Torch DataLoaders are wrapped in
    SambaLoaders.

    Input:
        args: User- and system-defined command line arguments

    Returns:
        A tuple of SambaLoaders over the training and test sets.
    """

    # Transform the raw MNIST data into PyTorch Tensors, which will be converted to SambaTensors
    transform = transforms.Compose(
        [
            transforms.ToTensor(),
            transforms.Normalize((0.1307,), (0.3081,)),
        ]
    )

    # Get the train & test data (images and labels) from the MNIST dataset
    train_dataset = datasets.MNIST(
        root=args.data_path,
        train=True,
        transform=transform,
        download=True,
    )
    test_dataset = datasets.MNIST(root=args.data_path, train=False, transform=transform)

    # Set up the train & test data loaders (input pipeline)
    train_loader = DataLoader(
        dataset=train_dataset, batch_size=args.bs, shuffle=True
    )
    test_loader = DataLoader(
        dataset=test_dataset, batch_size=args.bs, shuffle=False
    )

    # Create SambaLoaders
    # function_hook allows us to specify which tensor(s) should be passed along to the model
    #  -> The hook must return a list containing the same number of tensors as specified in the list of names
    #  -> Any other tensors will be filtered out, so if you need those, then...
    # return_original_batch allows us to retain the original input tensors for later processing, e.g., computing loss
    #  -> It causes the SambaLoader to also return a list of the original input tensors
    sn_train_loader = SambaLoader(dataloader=train_loader, names=["image"], function_hook=lambda t: [t[0]], return_original_batch=True)
    sn_test_loader = SambaLoader(dataloader=test_loader, names=["image"], function_hook=lambda t: [t[0]], return_original_batch=True)

    return sn_train_loader, sn_test_loader

def train(args: argparse.Namespace, model: nn.Module, criterion: Callable) -> None:
    """
    Trains the model.

    Prepares and loads the data, then runs the training loop with the hyperparameters specified
    by the input arguments with a given loss function.  Calculates loss and accuracy over the course of training.

    Args:
        args (argparse.Namespace): User- and system-defined command line arguments
        model (nn.Module): ConvNet model
        criterion (Callable): Loss function
    """

    sn_train_loader, sn_test_loader = prepare_dataloader(args)
    hyperparam_dict = {"lr": args.learning_rate}

    total_step = len(sn_train_loader)
    loss_list = []
    acc_list = []

    for epoch in range(args.num_epochs):
        for i, (images, original_batch) in enumerate(sn_train_loader):

            # The label tensor is the second element of the original batch
            labels = original_batch[1]

            # Run only the forward pass on RDU and note the section_types argument
            # The first element of the returned tuple contains the raw outputs of forward()
            outputs = samba.session.run(
                input_tensors=(images,),
                output_tensors=model.output_tensors,
                hyperparam_dict=hyperparam_dict,
                section_types=["FWD"]
            )[0]

            # Convert SambaTensors back to Torch Tensors to carry out loss calculation
            # on the host CPU.  Be sure to set the requires_grad attribute for PyTorch.
            outputs = samba.to_torch(outputs)
            outputs.requires_grad = True

            # Compute loss on host CPU and store it for later tracking
            loss = criterion(outputs, labels)

            # Compute gradients on CPU
            loss.backward()
            loss_list.append(loss.tolist())

            # Run the backward pass and optimizer step on RDU and note the grad_of_outputs
            # and section_types arguments
            samba.session.run(
                input_tensors=(images,),
                output_tensors=model.output_tensors,
		        hyperparam_dict=hyperparam_dict,
		        grad_of_outputs=[samba.from_torch_tensor(outputs.grad)], # Bring the grads back from CPU to RDU
                section_types=["BCKWD", "OPT"])

            # Compute and track the accuracy
            total = labels.size(0)
            _, predicted = torch.max(outputs.data, 1)
            correct = (predicted == labels).sum().item()
            acc_list.append(correct / total)

            if (i + 1) % 100 == 0:
                print(
                    "Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Accuracy: {:.2f}%".format(
                        epoch + 1,
                        args.num_epochs,
                        i + 1,
                        total_step,
                        torch.mean(loss),
                        (correct / total) * 100,
                    )
                )

def main(argv):

    args = parse_app_args(argv=argv, common_parser_fn=add_user_args)

    # Create the CNN model
    model = ConvNetCustomLoss()

    # Convert model to SambaFlow (SambaTensors)
    samba.from_torch_model_(model)

    # Create optimizer
    # Note that SambaFlow currently supports AdamW, not Adam, as an optimizer
    optimizer = samba.optim.AdamW(model.parameters(), lr=args.learning_rate)

    ###################################################################
    # Define loss function here to be used in the forward pass on CPU #
    ###################################################################
    criterion = nn.CrossEntropyLoss()

    # Create dummy SambaTensor for graph tracing
    inputs = get_inputs(args)

    # The common_app_driver() handles model compilation and various other tasks, e.g.,
    # measure-performance.  Running, or training, a model must be explicitly carried out
    if args.command == "run":
        utils.trace_graph(model, inputs, optimizer, init_output_grads=not args.inference, pef=args.pef, mapping=args.mapping)
        train(args, model, criterion)
    else:
        common_app_driver(args=args,
                        model=model,
                        inputs=inputs,
                        optim=optimizer,
                        name=model.__class__.__name__,
                        init_output_grads=not args.inference,
                        app_dir=utils.get_file_dir(__file__))

if __name__ == '__main__':
    main(sys.argv[1:])

Planning questions

To make the conversion process more straightforward, consider these planning questions.

Where are my data loaders?

All models need data and one of the easiest ways to feed in that data is with a PyTorch DataLoader. The output tensors that come from the DataLoader need to be converted into SambaTensors. See Prepare data loader.
What shape are my input tensors?

When you compile a SambaFlow model, the compute graph of your model is physically mapped onto an RDU. To perform this mapping, SambaFlow needs to know the shape of the input tensors. See Generate tensors.
Where is my model defined?

A useful feature of SambaFlow is that an optimizer can be included in the definition and forward section of a model. An optimizer can be mapped directly onto an RDU, greatly enhancing performance. See Define the model.
Where is my model instantiated?

The model must be explicitly converted to SambaFlow. Fortunately, only a single SambaFlow method needs to be used to do that. See Tie it all together with main().
Where is my loss function defined and what is it?

A loss function can be a part of a model’s definition. So, if your model uses a PyTorch loss function that SambaFlow supports, the function can be moved, as in Define the model. If your model doesn’t use a supported loss function it can be used externally. See Examine model code with external loss function.
Where is my optimizer defined and what is it?

Unlike loss functions, optimizers can’t be added directly to a model’s definition in SambaFlow. Loss functions are passed into SambaFlow during compilation and training. See Tie it all together with main().

Compile and run the model

To compile a model, you always use the following syntax:

$ python <model>.py compile --pef-name <pef_name>

Assuming you’ve saved the example code as cnn_conversion.py, run the following command.

$ python cnn_conversion.py compile --pef-name cnn_conversion.pef

To run the model, you pass in the PEF file that was generated during compilation. The syntax is:

$ python <model>.py run --pef <pef_name>

For this example, run the following command:

$ python cnn_conversion.py run --pef cnn_conversion.pef

See Compile and run your first model for details

Model conversion tips and tricks

This section offers some tips and tricks for model conversion.

Torch Dataloaders. If the last batch’s length is not exactly equal to your batch size, for example, if the size of the last batch is 28 and your PEF batch size is 32, compilation fails with a PEF mismatch error. Set the parameter drop_last_one=True to avoid that problem.
Data Visualization. SambaNova recommends that you don’t do data visualization directly on a SambaNova system.

Learn more!

To understand what the messages to stdout mean, see SambaNova messages and logs.
To learn how to run models inside a Python virtual environment, see Use Python virtual environments.
For information about supported PyTorch operators, see the SambaFlow API Reference .

Run language example applications

From this tutorial you learn how to run a language application example on a SambaNova system and how to use application parameters.

BERT model overview

BERT (Bidirectional Encoder Representations from Transformers) is a machine learning model based on Transformers that was developed by Google in 2018.

The original BERT implementation has two models:

BERT Base: 12 encoders, 12 bidirectional self-attention heads, 110 million parameters, 768 dimension size
BERT Large: 24 encoders, 16 bidirectional self-attention heads, 340 million parameters, 1024 dimension size

For more information about BERT, including an illustration, see the original paper.

We use the Transformers library from Hugging Face to run BERT models on a DataScale system.

Our scripts include modifications of the original scripts that ensure that the model can run on SambaNova RDU chips.

The commands below are used to run BERT models on a DataScale system.

Prepare your environment

To prepare your environment, you:

Check your SambaFlow installation.
Make a copy of the tutorial files.
Download the data files from the internet.

Check your SambaFlow installation

You must have the sambaflow package installed to run this example and any of the tutorial examples.

To check if the package is installed, run this command:
- For Ubuntu Linux
  $ dpkg -s sambaflow
- For Red Hat Enterprise Linux
  $ rpm -qi sambaflow
Examine the output and verify that the SambaFlow version that you are running matches the documentation you are using.
If you see a message that sambaflow is not installed, contact your system administrator.

Create a copy of SambaFlow tutorials

If you haven’t done so already, create your own copy of all tutorial applications so you can experiment:

Copy the content of /opt/sambaflow/apps to a directory inside your home directory. For example:
```
$ mkdir $HOME/sambaflow-apps
$ cp -r /opt/sambaflow/apps/* $HOME/sambaflow-apps
```

If you copied the contents of opt/sambaflow/ops in an earlier release, make a new copy.

Prepare the dataset

Before you compile and run the model, you have to download the dataset. Follow these steps.

Create a subdirectory for the bert datasets in your home directory. In this example we use $HOME/datasets.
```
$ mkdir -p $HOME/datasets/bert
```
Set the DATADIR environment variable to point to this location.
```
$ export DATADIR=$HOME/datasets/bert
```
Download the datasets, which are part of SQuAD (Stanford Question Answering Dataset). SQuAD is a popular dataset used for training and evaluating question answering models, particularly those that leverage pre-trained language models like BERT.
```
$ wget -P $DATADIR https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
$ wget -P $DATADIR https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
```

Compile for training

The following set of command performs some setup and then it compiles the model for training. The arguments to compile fine-tune the output. Many of the arguments are specific to transformers_hook.py.

The output of the training run is a PEF file, transformers_hook.pef, which we pass in when we start a training run in the next step.

$ export OUTDIR=$HOME/app-test
$ export DATADIR=$HOME/data/bert
$ mkdir -p $DATADIR
$ python $HOME/sambaflow-apps/nlp/transformers_on_rdu/transformers_hook.py compile \
  --tokenizer_name bert-large-uncased \
  --model_name_or_path bert-large-uncased \
  --do_eval \
  --do_lower_case \
  --data_dir $DATADIR \
  --max_seq_length 384 \
  --per_device_train_batch_size 32 \
  --per_device_eval_batch_size 32 \
  -b 32 \
  --output_dir=${OUTDIR}/hf_output_squad_compile \
  --overwrite_output_dir \
  --seed 1206287 \
  --task_name squad \
  --module_name squad \
  --mac-human-decision $HOME/sambaflow-apps/nlp/transformers_on_rdu/human_decisions/compiler_configs/faster_compile.json \
  --mac-v2 \
  --cache_dir ${OUTDIR}/squad_cache \
  --pef transformers_hook \
  --output-folder=${OUTDIR}

Initiate a training run

To initiate a training run, we have to call run and pass in parameters including the PEF file, transformers_hook.pef.

$ export OUTDIR=$HOME/app-test
$ export DATADIR=$HOME/data/bert
$ python $HOME/sambaflow-apps/nlp/transformers_on_rdu/transformers_hook.py run \
  --model_name_or_path bert-large-uncased \
  --tokenizer_name bert-large-uncased \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir $DATADIR \
  -p ${OUTDIR}/transformers_hook/transformers_hook.pef \
  --max_seq_length 384 \
  --per_device_train_batch_size 32 \
  --per_device_eval_batch_size 32 \
  -b 32 \
  --output_dir=${OUTDIR}/hf_output_squad_run  \
  --overwrite_output_dir \
  --seed 1206287 \
  --task_name squad \
  --module_name squad \
  --learning_rate 3e-05  \
  --eval_steps 1000 \
  --num_train_epochs 0.2 \
  --cache_dir ${OUTDIR}/squad_cache

Compile for inference

Compiling for inference is separate from compiling for a training run, and we name the PEF file transformers_hook_inf.

$ export OUTDIR=$HOME/app-test
$ export DATADIR=$HOME/data/bert
$ python $HOME/sambaflow-apps/nlp/transformers_on_rdu/transformers_hook.py compile \
  --inference \
  --model_name_or_path bert-large-uncased \
  --tokenizer_name bert-large-uncased \
  --do_eval \
  --do_lower_case \
  --data_dir $DATADIR \
  --max_seq_length 384 \
  --per_device_eval_batch_size 32 \
  -b 32 \
  --output_dir=${OUTDIR}/hf_output_squad_inference_compile \
  --overwrite_output_dir \
  --seed 1206287 \
  --task_name squad \
  --module_name squad \
  --mac-human-decision /opt/sambaflow/apps/nlp/transformers_on_rdu/human_decisions/compiler_configs/faster_compile.json \
  --mac-v2 \
  --cache_dir ${OUTDIR}/squad_cache \
  --pef transformers_hook_inf \
  --output-folder=${OUTDIR}

Run for inference

To run for inference, we specify --inference as an argument to run and specify the PEF file that we compiled for inference (transformers_hook_inf).

Note that after compilation, the mode file becomes available in the checkpoint-500 directory, so we specify that when we run inference.

$ export OUTDIR=$HOME/app-test
$ export DATADIR=$HOME/data/bert
$ python $HOME/sambaflow-apps/nlp/transformers_on_rdu/transformers_hook.py run \
  --inference \
  --model_name_or_path ${OUTDIR}/hf_output_squad_run/checkpoint-500 \
  --do_eval \
  --do_lower_case \
  --data_dir $DATADIR \
  -p ${OUTDIR}/transformers_hook_inf/transformers_hook_inf.pef \
  --max_seq_length 384 \
  --per_device_eval_batch_size 32 \
  -b 32 \
  --output_dir=${OUTDIR}/hf_output_squad_inference \
  --overwrite_output_dir \
  --seed 1206287 \
  --task_name squad \
  --module_name squad \
  --learning_rate 3e-05  \
  --eval_steps 6000 \
  --tokenizer_name bert-large-uncased \
  --per_device_train_batch_size 32 \
  --cache_dir ${OUTDIR}/squad_cache

Using LayerNorm instead of BatchNorm

The SambaNova hardware architecture takes full advantage of pipeline parallelism. BatchNorm is a popular way to improve training performance because pipelining on the batch dimension of a tensor often produces the best performance. However, using parallelization with BatchNorm requires resynchronization after each normalization operation: samples must be batched before they can be processed at each stage and passed to the next.

To avoid this need for synchronization, SambaNova recommends using LayerNorm in most settings.

The structure of your model and data affects whether BatchNorm or LayerNorm gives you the best results. Both approaches are supported.

You can read more about various normalization methods here: link: https://medium.com/techspace-usict/normalization-techniques-in-deep-neural-networks-9121bf100d8.

Customize the model code

This tutorial uses a simple CNN model from a popular PyTorch tutorial. The model solves the classic machine learning problem of recognizing hand-written letters from the MNIST dataset. It’s a simple 2-layer CNN that uses Conv2d, BatchNorm2d, MaxPool2d. The output layer is a fully connected one.

Original code (BatchNorm)

Here is the original model code:

# Convolutional neural network (two convolutional layers)
class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super(ConvNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc = nn.Linear(7*7*32, num_classes)

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        return out

Revised code (LayerNorm)

As an alternative, you can revise the code to use LayerNorm. The following code fragment replaces BatchNorm2d with LayerNorm and includes a few other changes. Here’s the revised code with comments below.

class ConvNet(nn.Module):
    def __init__(self, num_classes=10, input_shape=[28, 28]): (1)
        super(ConvNet, self).__init__()

        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
            nn.LayerNorm([16] + input_shape),   (2)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))

        input_shape = [input_shape[0] // 2, input_shape[1] // 2]  (3)
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
            nn.LayerNorm([32] + input_shape),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))

        self.fc = nn.Linear(7*7*32, num_classes)
        self.criterion = nn.CrossEntropyLoss()  (4)

    def forward(self, x, labels):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        loss = self.criterion(out, labels)
        return loss, out  (5)

1 To perform LayerNorm use the normalized shape (as specified in the PyTorch documentation). For that we will use the input shape so we have to pass it as an argument.

Add the input_shape to the number of features used for BatchNorm2d() and use the result as an argument for the LayerNorm() function.

The input argument for LayerNorm is [C, H, W] so you need to provide an additional H, W for the input argument.

3 For the second layer, reduce the input_shape. Because the previous MaxPool2d call uses stride=2, divide the input_shape by 2. If you use a different stride in your model, divide input_shape by that value.

4 We recommend adding the loss function to the model’s class so that it will be calculated on the RDU. Some loss functions might not be supported yet in SambaFlow; in that case you can calculate the loss on the host.

5 The forward() function returns the output and loss inside the model.

Compile the Model

Before you can run the model on an RDU you have to compile it. You can use the samba.session.compile() service function. You pass your model to that function and it produces a PEF file — a binary file that will be uploaded to the RDU.

Here’s how you use the compile() function.

    args = parse_app_args(dev_mode=True,
                          common_parser_fn=add_common_args)
    utils.set_seed(256)
    model = ConvNet(args.num_classes)
    samba.from_torch_(model)

    inputs = get_inputs(args)

    optimizer = samba.optim.AdamW(model.parameters(), lr=args.lr) if not args.inference else None
    if args.command == "compile":
        samba.session.compile(model,
                              inputs,
                              optimizer,
                              name='cnn_mnist',
                              app_dir=utils.get_file_dir(__file__),
                              squeeze_bs_dim=True,
                              config_dict=vars(args),
                              pef_metadata=get_pefmeta(args, model))

Prepare the datasets

Data preparation is different in the original tutorial and in our revision for LayerNorm.

Original data preparation (BatchNorm)

This is how dataset preparation is done in the original tutorial:

# MNIST dataset
train_dataset = torchvision.datasets.MNIST(root='../../data/',
                                           train=True,
                                           transform=transforms.ToTensor(),
                                           download=True)

test_dataset = torchvision.datasets.MNIST(root='../../data/',
                                          train=False,
                                          transform=transforms.ToTensor())

# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size,
                                          shuffle=False)

Revised data preparation (LayerNorm)

To use the loaders with LayerNorm in SambaFlow you have to add one parameter to the DataLoader calls: drop_last=True.

The parameter ensures that every batch has the same batch size. Read more here.

Here is the code with the additional parameter:

    # MNIST dataset
    train_dataset = torchvision.datasets.MNIST(root='../../data/',
                                            train=True,
                                            transform=transforms.ToTensor(),
                                            download=True)

    test_dataset = torchvision.datasets.MNIST(root='../../data/',
                                            train=False,
                                            transform=transforms.ToTensor())

    # Data loader
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                               batch_size=args.batch_size,
                                               shuffle=True,
                                               num_workers=7,
                                               drop_last=True)

    test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                              batch_size=args.batch_size,
                                              shuffle=False,
                                              num_workers=7,
                                              drop_last=True)

Train the model

To train the model on an RDU, you have to convert it from using native PyTorch Tensors to SambaTensors.

Original training code (BatchNorm)

Here’s the original code used to train the model:

# Train the model
total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        images = images.to(device)
        labels = labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (i+1) % 100 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

Revised training code (LayerNorm)

And here is the revised code, which uses SambaTensor and converts them back (discussed below).

    # Train the model
    total_step = len(train_loader)
    hyperparam_dict = {"lr": args.lr}
    for epoch in range(num_epochs):
        for i, (images, labels) in enumerate(train_loader):
            images = samba.from_torch(images, name='image', batch_dim=0)  (1)
            labels = samba.from_torch(labels, name='label', batch_dim=0)

            loss, outputs = samba.session.run(input_tensors=[images, labels],
                                              output_tensors=model.output_tensors,
                                              hyperparam_dict=hyperparam_dict,
                                              data_parallel=args.data_parallel,
                                              reduce_on_rdu=args.reduce_on_rdu)
            loss, outputs = samba.to_torch(loss), samba.to_torch(outputs) (2)

            if (i+1) % 100 == 0:
                print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
                    .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

1	Convert images and labels to SambaTensors from PyTorch Tensors.
2	Convert the output and loss SambaTensors back to PyTorch Tensors.

When using an RDU, all three steps are computed in a single samba.session.run() call: forward, backward, and optimizer.

Test the model

The code for testing the model on the RDU is similar to the training code.

Because we only need to run the forward section, we can pass that to the runtime as a parameter. Here’s the code:

    # Test the model
    model.eval()  # eval mode (batchnorm uses moving mean/variance instead of mini-batch mean/variance)
    with torch.no_grad():
        correct = 0
        total = 0
        for images, labels in test_loader:
            images = samba.from_torch(images, name='image', batch_dim=0)
            labels = samba.from_torch(labels, name='label', batch_dim=0)

            loss, outputs = samba.session.run(input_tensors=[images, labels],
                                              section_types = ["fwd"],    (1)
                                              output_tensors=model.output_tensors,
                                              data_parallel=args.data_parallel,
                                              reduce_on_rdu=args.reduce_on_rdu)
            outputs = samba.to_torch(outputs)
            labels = samba.to_torch(labels)

            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

        print('Test Accuracy of the model on the 10000 test images: {} %'.format(100 * correct / total))

Main function

The main function can use command-line arguments such as compile and run.

In the first pass, you compile the model and produce a PEF file that will be placed on the RDU. (1) below.
In the second pass, you use the run command and specify the PEF file that you want to use with the model. (2) below.

Here is the code for the main() function:

def main():
    args = parse_app_args(dev_mode=True,
                          common_parser_fn=add_common_args)
    utils.set_seed(256)
    model = ConvNet(args.num_classes)
    samba.from_torch_(model)

    inputs = get_inputs(args)

    optimizer = samba.optim.AdamW(model.parameters(), lr=args.lr) if not args.inference else None
    if args.command == "compile": (1)
        samba.session.compile(model,
                              inputs,
                              optimizer,
                              name='cnn_mnist',
                              app_dir=utils.get_file_dir(__file__),
                              squeeze_bs_dim=True,
                              config_dict=vars(args),
                              pef_metadata=get_pefmeta(args, model))

    elif args.command == "run": (2)
        #Run compiled model
        utils.trace_graph(model, inputs, optimizer, pef=args.pef, mapping=args.mapping)
        train(args, model, optimizer)


if __name__ == '__main__':
    main()

You can find the full example here.

How to use data parallel mode

Internally, SambaFlow supports several types of model parallelization. Model parallelization makes running the model (training, fine-tuning, etc) more efficient. You have control over parallelization in these ways:

Data parallel mode. You can compile and run the model in data parallel mode, discussed here. Some applications run much faster in data parallel mode.
Tensor parallel mode. You can instruct the compiler to perform tensor parallel optimization, which also results in running the model more efficiently. See How to use tensor parallel mode (Beta).

What is data parallel?

Data-parallelism is a method where the dataset is split into several parts, and each part is processed at the same time by different replicas of the application. These replicas run on separate computing resources, each working on a different chunk of the data.

Each replica also divides its chunk into smaller mini-batches for processing. When running in data-parallel mode, the system automatically splits the model and the data across the available resources, launching multiple copies of the model that work in parallel. This makes efficient use of hardware without requiring the user to manage the details.

4-way data parallel
- The dataset is split into four parts.
- Each part is processed simultaneously by four different processing units.
- Each processing unit handles its own unique portion of the data in parallel.
- 4-way means up to 4 ways of splitting data (includes 1, 2, and 4 ways).
8-way data parallel
- The dataset is split into eight parts.
- Each part is processed simultaneously by eight different processing units.
- Each processing unit handles its own unique portion of the data in parallel.
- 8-way means up to 8 ways of splitting data (includes 1, 2, 4, and, 8 ways).
16-way data parallel
- The dataset is split into sixteen parts.
- Each part is processed simultaneously by sixteen different processing units.
- Each processing unit handles its own unique portion of the data in parallel.
- 16-way means up to 16 ways of splitting data (includes 1, 2, 4, 8, and, 16 ways).

See Resources for links to technical papers and other background information.

Data parallel mode only has an effect during training. You see benefits if latency for a single training iteration exceeds about 100ms. For very low latency graphs, you don’t see much gain and might even see deterioration.

How does SambaFlow use data parallel operations?

SambaFlow^TM has built-in support for data parallel operations. Many ML applications can take advantage of this mode for improved performance.

Internally, SambaFlow makes use of:

The MPICH MPI library.
The torch.distributed framework. All functionality provided by torch.distributed is available in a SambaFlow data parallel application. A required key feature of torch.distributed is the ability to create a distributed data sampler to feed data to the replicas.

MPICH and torch.distributed launch application replicas and handle basic communication between those replicas.
A custom communications library (the Collective Communication Library, or CCL). The CCL library provides low-level features that enable acceleration of gradient tensor syncing between replicas.

Gradient tensors held in RDU memory can be shared directly; they don’t have to use host CPU and memory. The ALL-GATHER and ALL-REDUCE functions that operate on the gradient tensors are also parallelized. The CCL enables this acceleration across RDUs in the same node, as well as between RDUs in different nodes.

You can leverage this functionality by making a few modifications to the application.

Modify an application to work in data parallel mode

To support data parallel mode, make these changes:

Ensure that the application makes use of the PyTorch torch.utils.data.distributed framework.
Consider sharding data among replicas, for example, by using DistributedSampler() (See the PyTorch documentation for details).

The MPI framework that SambaNova uses for data parallel operations is supported as part of PyTorch. The framework allows for easy distribution of dataset samples to all replicas.
Use the PyTorch DataLoader() (recommended).
Modify the samba.session.run() method to accept the --data-parallel and --reduce_on_rdu arguments when they are passed in at the command line.

Here’s a sample code snippet for the samba.session.run() method. You can specify X_mb and y_mb arguments specific to data parallel training runs on the command line. See Run a data parallel application.

loss, pred = samba.session.run(input_tensors=[X_mb, y_mb],
                                output_tensors=model.output_tensors,
                                hyperparam_dict=hyperparam_dict,
                                data_parallel=args.data_parallel,
                                reduce_on_rdu=args.reduce_on_rdu)

Compile a data parallel application

To compile an application for data parallel operations, pass these arguments to the SambaFlow compile command:

--data-parallel: Causes the compiler to generate data needed for data parallel execution.
-ws: Short for world size. This argument defines the minimum number of application replicas to be launched when the model is trained in data parallel mode. For compilation, always set the value to 2.

A compilation command for data parallel mode looks like this:

python model.py compile --data-parallel -ws 2 --pef-name=$PEF_NAME

Run a data parallel application

Data parallel applications are launched via the standard MPI command mpirun or another MPI-compliant launcher such as Slurm or Kubernetes with appropriate plugins. The MPI launcher creates a separate process for each application replica and enables each process to later communicate with the others. The mpirun command takes arguments specific to data parallel operations, as well as the standard SambaFlow run command.

-np

Number of replicas to be launched. The number must be at least as large as the -ws argument specified during compilation but it can be larger, usually up to the total number of available RDUs. When mpirun is called with a specified number of replicas, that number is passed to the distributed sampler to divide the dataset.

--data-parallel

Enables data parallel execution at runtime. SambaFlow will automatically handle gradient synchronization during the session.run() call.

--reduce-on-rdu

Enables direct syncing of gradient tensors between RDUs and their associated device memories using CCL (rather than syncing via the host), greatly improving performance. If this optional argument is not specified, gradient tensor syncing happens via the host.

SambaNova recommends that you enable reduce-on-rdu in conjunction with data-parallel.

--host<host-name-list>

Allows a user to specify the host(s) on which replicas should be launched. If —host is omitted, mpirun launches all processes and replicas on the local machine. With —host (which takes a comma-separated list of host names), mpirun logs in to those hosts and launches the specified number of processes on those machines, evenly divided across them. You don’t even need to be logged into one of the hosts you’ve specified.

A command to run training in data parallel mode might look like this:

$ /opt/mpich-3.4.3/bin/mpirun -np $X --host $HOST_NAMES python parallel.py run --data-parallel --reduce-on-rdu --pef $PEF_NAME

Data parallel best practices

Follow these best practices when running applications in data parallel mode:

Ensure that each replica can access training data and PEF. On a networked volume, each SambaNova node must have access to the training data and the compiled PEF.
In data parallel mode, an application runs concurrently on multiple RDUs. Certain actions are repeated:
- mpirun merges the stdout of all replicas, making it easy to see all output at once or redirect to a file.
- To avoid the merge to stdout, you can designate one of the replicas (or “ranks” in MPI parlance) to perform the logging.
- Use torch.distributed.get_rank() to get a unique identifier for each process in the MPI process group to create unique filenames or paths for each replica.
Use torch.distributed.barrier() for synchronization between replicas (beyond the automatic gradient sync carried out by SambaFlow).
Specify parameters that are common to all replicas can on the command line (see Run a data parallel application), a common file, or broadcast via torch.distributed.broadcast().
An application can use any torch.distributed call with the exception of torch.distributed.init_process_group(). SambaFlow calls torch.distributed.init_process_group() automatically when you pass in the --data-parallel flag.

Data parallel example command

Here’s an example of a command you might use to run the transformers_hook.py example in data parallel mode.

/opt/mpich-3.4.3/bin/mpirun -np 4  python transformers_hook.py run
    --config_name modules/configs/sweep_configs/.json
    --tokenizer_name bert-large-uncased
    --module_name mlm_ns
    --task_name mlm_ns
    --max_seq_length 512 -b 32
    --output_dir=hf_output
    --overwrite_output_dir
    --do_train
    --per_device_train_batch_size 32
    --input_dir 
    --cache 
    --max_predictions_per_seq 20
    --save_steps -1
    --warmup_steps 0
    --logging_steps 1
    --weight_decay 0.01
    --learning_rate 0.00035
    --non_split_head
    --dense_adam
    --adam_beta2 0.98
    --max_grad_norm_clip 1.0
    --validate_stat_perf
    --validate_tying_plus_embed_train
    --skip_checkpoint -p MY_PEF.pef
    --max_steps 10
    --steps_this_run 10
    --data-parallel
    --reduce-on-rdu

See Run a data parallel application for a list of arguments that are specific to data parallel mode. The other arguments are application specific. Run the application itself with --help for some information.

Data parallel mode and tensor parallel mode

You can theoretically use both data parallel and tensor parallel mode with the same compile and run cycle but note these points:

Tensor parallel batch mode can support only RDUs in a single node (which means up to 8 RDUs). Data parallel mode can run across multiple nodes (no limit on the number of RDUs).
With tensor parallel batch mode, the number of RDUs is set during compilation and cannot change at runtime. With data parallel mode, the number of chips can be assigned at runtime, so one PEF can run on the number of RDUs you specify.
The two modes might treat batch size differently. For example, assume the model uses batch size 8.
- If you compile using tensor parallel batch mode on 4 RDUs, then the compiled PEF includes instructions for 4 RDUs, each with batch size 2.
- In data parallel mode, the PEF instructions target a single RDU with batch size 8. If you deploy this PEF to run on 4 RDUs in data parallel mode, then you need to have data input with batch size 32 (4 times 8).

Resources

Our engineering team suggests the following resources for background information:

General technical papers and articles
Implementations
- Writing Distributed Applications with PyTorch
- Horovod is a distributed deep learning training framework with the goal of making distributed deep learning fast and easy to use.
- Bringing HPC Techniques to Deep Learning
MPI (Message Passing Interface)
- A Comprehensive MPI Tutorial Resource includes MPI tutorials, and books.
- MPI for Dummies is a slide deck produced jointly by Pavan Balaji, Argonne National Laboratory, and Torsten Hoefler, ETH Zürich.
- Multiple MPI libraries are avalable. SambaNova uses MPICH
The communications protocol and programming interface used by Infiniband and RoCE (RDMA over Converged Ethernet) is RDMA.