AMD Instinct MI200 Accelerator IFWI Update Tool, amdfwflash, User Guide
V 1.2
============================================================================

Introduction
============================================================================
This document provides step-by-step instructions for updating IFWI (Integrated
Firmware Image) image using the AMD FW Flash tool, amdfwflash, on the AMD
Instinct MI200 Server Platforms.

The v1.1 of the tool ships with three vesions of IFWI mu1 (maintenance update1)
, mu2 (maintenance update2), mu3 (maintenance update2) and the ga version.
mu3 is the latest IFWI and the tool by default updates the IFWI to mu3.

The tool also provides capapbility to update your IFWI to a desired level. For
example if there is a need to update your MI200 platform to mu1 or mu2 from
the ga, then this can also be done. Steps to do so are documented below.

Note
============================================================================
The AMD FW Flash tool, amdfwflash, is not
intended to be used in a Virtual Machine/Guest OS environment.

WARNING:
Using the amdfwflash tool in a Virtual Machine/Guest OS may result in an
undefined behavior and unsupported configuration.


Before You Start
============================================================================

* Identify the server with AMD Instinct MI200 Accelerator(s) that
needs IFWI to be updated or need GPU to be replaced
* Ensure that you have the correct login credentials to the server. You
must have 'sudo' or 'root' permissions on the server to execute the
amdfwflash tool to update the firmware
* Ensure that you have access to the BMC/IPMI to access the system
console
* Ensure network access to the AMD FW Flash tool repository
(repo.radeon.com)
* Ensure all applications are closed before running the tool and make
sure no OS updates are pending in the background.  Notify users of
the server notified about server maintenance for IFWI update

NOTE: It is strongly recommended that IFWI update tool be run from the
system console and not over the network to ensure that the flash update
process does not get interrupted by network outage and lose connection.

1. Introduction
============================================================================
To update IFWI on AMD Instinct MI200 Accelerator(s) or when replacing
AMD Instinct MI200 Accelerator(s) in a server, please configure the
system for IFWI maintenance.  Once you have configured the system for
firmware maintenance, please run the amdfwflash command to update or
rollback the IFWI to the desired version.

2. Configure System for IFWI Maintenance or AMD Instinct MI200 Replacement
============================================================================
Download and Install the AMD FW Flash Tool from repo.radeon.com Repository
1. Login to the server with MI200 GPUs that need IFWI to be updated
   $ ssh user@mi200_server

2. Set up the AMD FW Flash Tool Package repository
* Set up Ubuntu OS apt repo:
   Step 1:
   wget -q -O - https://repo.radeon.com/fwupdator/amdfw.gpg.key | sudo apt-key add -

    Step 2:
echo 'deb [arch=amd64] https://repo.radeon.com/fwupdator/amdfwflash/1.2/deb/ ubuntu main' | sudo tee /etc/apt/sources.list.d/amdfwflash.list

* Set up RHEL 8 or RHEL 9 yum repo:

echo -e '[amdfwflash]\nname=amdfwflash\nenabled=1\nautorefresh=0\ngpgkey=https://repo.radeon.com/fwupdator/amdfw.gpg.key\nbaseurl=https://repo.radeon.com/fwupdator/amdfwflash/1.2/rpm\ngpgcheck=1' | sudo tee /etc/yum.repos.d/amdfwflash.repo

* Set up SLES 15 SP3 or SP4 zypper repo:

echo -e '[amdfwflash]\nenabled=1\nautorefresh=0\ngpgkey=https://repo.radeon.com/fwupdator/amdfw.gpg.key\nbaseurl=https://repo.radeon.com/fwupdator/amdfwflash/1.2/rpm\ntype=rpm-md\ngpgcheck=1' | sudo tee /etc/zypp/repos.d/amdfwflash.repo

3. Update the package repository
* Ubuntu OS
    sudo apt update

  To verify, search `amdfwflash` package:
    sudo apt search amdfwflash

* RHEL 8 or RHEL 9
    sudo yum update

  To verify, search `amdfwflash` package:
    sudo yum search amdfwflash

* SLES 15 SP3 or SP4
    sudo zypper update

  To verify, search `amdfwflash` package:
    sudo zypper search amdfwflash

4. Install the AMD FW Flash Tool, amdfwflash, package
* Ubuntu OS
    sudo apt install amdfwflash

* RHEL 8 or RHEL 9ls
    sudo yum install amdfwflash

* SLES 15 SP3 or SP4
    sudo zypper install amdfwflash

5. Verify the installation
* Ubuntu OS
    dpkg -l | grep amdfwflash

* RHEL 8, RHEL 9
    rpm -qa | grep amdfwflash

* SLES 15 SP3, or SLES 15 SP4
    rpm -qa | grep amdfwflash

Configure System to blacklist amdgpu driver
6. Add the amdgpu driver to blacklist
* Ubuntu OS
  Step 1:
    echo blacklist amdgpu | sudo tee /etc/modprobe.d/blacklist-amdgpu.conf

  Step 2:
    sudo update-initramfs -u

* RHEL 8, or RHEL 9 Use grubby [2]
  Step 1:
    echo blacklist amdgpu | sudo tee /etc/modprobe.d/blacklist-amdgpu.conf

  Step 2:
    sudo grubby --args "amdgpu.blacklist=1 rd.driver.blacklist=amdgpu" --update-kernel ALL

  For CentOS, use modprobe.blacklist instead of rd.driver.blacklist boot
  option in the grubby command above.

* SLES 15 SP3 or SLES 15 SP4 (blacklist amdgpu, add iomem=relaxed param)
  Step 1:
    echo blacklist amdgpu | sudo tee /etc/modprobe.d/60-blacklist-amdgpu.conf

  Step 2: Append "iomem=relaxed" to the kernel boot parameters in /etc/default/grub file:
    sudo sed -i 's/^GRUB_CMDLINE_LINUX_DEFAULT="/GRUB_CMDLINE_LINUX_DEFAULT="iomem=relaxed /' /etc/default/grub

  Step 3: Update boot parameters and rebuild kernels to blacklist amdgpu driver:
    sudo grub2-mkconfig -o /boot/grub2/grub.cfg
    sudo /usr/bin/dracut --force --regenerate-all

7. Reboot the server for IFWI Maintenance Update or power off to replace GPUs
    sudo reboot
OR
    sudo poweroff

  NOTE: If replacing AMD Instinct MI200 Accelerator in the system, power
  off the system.

  Refer to the section "3 Procedure to Update and Verify AMD Instinct MI200
  IFWI Version" section for steps to update or rollback IFWI to the desired
  version.

3. Procedure to Update and Verify AMD Instinct MI200 IFWI Version
============================================================================

After completing the steps in "Configure System for IFWI Maintenance or AMD
Instinct MI200 Replacement", please follow the steps documented below to
update or rollback AMD Instinct MI200 IFWI to the desired version.

3.1 Steps to Update IFWI to the MI200 IFWI Maintenance Update Version
1. Login to the BMC/IPMI interface of the server identified for IFWI update

2. Open the Remote Console / Virtual Console to the server

3. Login to the server (NOTE: You must have sudo or root permissions to
execute amdfwflash tool to update IFWI on MI200 GPUs)

4. Verify that amdgpu driver is not loaded
    lsmod | grep amdgpu
  NOTE: If the output of the above command shows amdgpu listed, then stop
  and do not proceed. Check the OS settings to ensure amdgpu driver is
  blacklisted correctly, and reboot the system. Repeat step 4.

5. Run the amdfwflash utility to list the GPU devices
    sudo /opt/amdfwflash/sbin/amdfwflash --list-devices

  NOTE: The output should list all the GPU devices in the system. If the output
  does not list all the GPU devices, please contact AMD Customer Support.

6. Run the amdfwflash utility to update the IFWI of all GPU in the system
to the MI200 IFWI Maintenance Update 2 (latest) version
    sudo /opt/amdfwflash/sbin/amdfwflash --update-ifwi or
    sudo /opt/amdfwflash/sbin/amdfwflash --update-ifwi mu3

7. To update the IFWI of all GPU ins the system to the MI200 IFWI Maintenance
Update 1 version, do the following:
    sudo /opt/amdfwflash/sbin/amdfwflash --update-ifwi mu1

8. To update the IFWI of all GPU ins the system to the MI200 IFWI Maintenance
Update 2 version, do the following:
    sudo /opt/amdfwflash/sbin/amdfwflash --update-ifwi mu2

9. Capture the console output and system log to a file

10. The amdfwflash tool saves a copy of the old IFWI images under /tmp before
updating the IFWI. Archive the generated IFWI images from /tmp folder for
later reference
    tar cvf ifwi-backup.tar /tmp/amdfwflash/ifwi/backup

11. Remove the amdgpu driver from blacklist
* Ubuntu OS
    sudo rm /etc/modprobe.d/blacklist-amdgpu.conf

* RHEL 8, or RHEL 9 [2]
  Step 1:
    sudo rm /etc/modprobe.d/blacklist-amdgpu.conf

  Step 2:
    sudo grubby --remove-args "amdgpu.blacklist=1 rd.driver.blacklist=amdgpu" --update-kernel ALL

  For CentOS, use modprobe.blacklist instead of rd.driver.blacklist boot
  option in the grubby command above.

* SLES 15 SP3 or SLES 15 SP4
  Step 1:
    sudo rm /etc/modprobe.d/60-blacklist-amdgpu.conf

  Step 2: Remove "iomem=relaxed" from the kernel boot parameters in /etc/default/grub file:
    sudo sed -i 's/iomem=relaxed //' /etc/default/grub

  Step 3: Update boot parameters and rebuild kernels to remove blacklist:
    sudo grub2-mkconfig -o /boot/grub2/grub.cfg
    sudo /usr/bin/dracut --force --regenerate-all

12. Reboot the system (A/C power cycle is recommended) for the IFWI update
to take effect
    sudo reboot
OR
    sudo ipmitool power cycle

13. Go to Section "4 Verify AMD Instinct MI200 IFWI Versions" to complete
the IFWI update. After successful verification of the IFWI update, the server
can resume normal operation.

3.2 Steps to Rollback IFWI to the GA Version
1. Login to the BMC/IPMI interface of the server identified for IFWI update

2. Open the Remote Console / Virtual Console to the server

3. Login to the server (NOTE: You must have sudo or root permissions to
execute amdfwflash tool to update IFWI on MI200 GPUs)

4. Verify that amdgpu driver is not loaded
    lsmod | grep amdgpu
  NOTE: If the output of the above command shows amdgpu listed, then stop
  and do not proceed. Check the OS settings to ensure amdgpu driver is
  blacklisted correctly, and reboot the system. Repeat step 4.

5. Run the amdfwflash utility to list the GPU devices
    sudo /opt/amdfwflash/sbin/amdfwflash --list-devices

  NOTE: The output should list all the GPU devices in the system. If the output
  does not list all the GPU devices, please contact AMD Customer Support.

6. Run the amdfwflash to rollback the IFWI of all GPU to the GA version
    sudo /opt/amdfwflash/sbin/amdfwflash --rollback-ifwi

7. Run the amdfwflash to rollback to IFWI of all GPU to the mu1 version from
mu2 version
    sudo /opt/amdfwflash/sbin/amdfwflash --rollback-ifwi mu1

8. Run the amdfwflash to rollback to IFWI of all GPU to the mu2 version from
mu3 version
    sudo /opt/amdfwflash/sbin/amdfwflash --rollback-ifwi mu2

10. Capture the console output and system log to a file

11. The amdfwflash tool saves a copy of the old IFWI images under /tmp before
updating the IFWI. Archive the generated IFWI images from /tmp folder for
later reference
    tar cvf ifwi-backup.tar /tmp/amdfwflash/ifwi/backup

12. Remove the amdgpu driver from blacklist
* Ubuntu OS
    sudo rm /etc/modprobe.d/blacklist-amdgpu.conf

* RHEL 8 or RHEL 9 [2]
  Step 1:
    sudo rm /etc/modprobe.d/blacklist-amdgpu.conf

  Step 2:
    sudo grubby --remove-args "amdgpu.blacklist=1 rd.driver.blacklist=amdgpu" --update-kernel ALL

* SLES 15 SP3 or SLES 15 SP4
  Step 1:
    sudo rm /etc/modprobe.d/60-blacklist-amdgpu.conf

  Step 2: Remove "iomem=relaxed" from the kernel boot parameters in /etc/default/grub file:
    sed -i 's/iomem=relaxed //' /etc/default/grub

  Step 3: Update boot parameters and rebuild kernels to remove blacklist:
    sudo grub2-mkconfig -o /boot/grub2/grub.cfg
    sudo /usr/bin/dracut --force --regenerate-all

13. Reboot the system (A/C power cycle is recommended) for the IFWI update
to take effect
    sudo reboot
OR
    sudo ipmitool power cycle

14.  Go to Section "4 Verify AMD Instinct MI200 IFWI Versions" to complete
the IFWI update. After successful verification of the IFWI update, the server
can resume normal operation.

4 Verify AMD Instinct MI200 IFWI Versions
============================================================================
1. Log in to the system.

2. Run the amdfwflash utility to list the GPU devices
    sudo /opt/amdfwflash/sbin/amdfwflash --list-devices

  NOTE: The output should list all the GPU devices in the system. If the output
  does not list all the GPU devices, please contact AMD Customer Support.

3. If ROCm software is installed, run the rocm-smi --showhw command to
display the IFWI version under VBIOS column
    /opt/rocm/bin/rocm-smi --showhw

  Note: if your environment has amdgpu driver blacklisted under normal
  operation, run the following command to load the driver before running
  rocm-smi
    sudo modprobe amdgpu

4. Verify that all the MI200 GPUs have been updated to the same IFWI version

NOTE: If there are any errors in the console output, please contact AMD
Customer Support After successful verification of the IFWI update, the server
can resume normal operation.

5 Uninstallation of AMD FW Flash Tool amdfwflash Package
============================================================================
Uninstall the amdfwflash package
* Ubuntu OS
    sudo apt remove amdfwflash

* RHEL 8 or RHEL 9
    sudo yum remove amdfwflash

* SLES15 SP3 or SP4
    sudo zypper rm amdfwflash

6 Procedure to Replace MI200 GPU (RMA)
============================================================================
The IFWI version of all AMD Instinct MI200 Accelerators in a system must be
at the same version for the proper operation of the system. When replacing
AMD Instinct MI200 Accelerator(s) in a system, the system must be configured
for AMD Instinct MI200 Replacement. Please see the section "Configure System
for IFWI Maintenance or AMD Instinct MI200 Replacement" for steps on how to
configure the system. Once the system has been configured for AMD Instinct
MI200 Replacement, power off the system and replace the AMD Instinct MI200
Accelerator according to the assembly instruction manual (Need a reference).
After replacing the AMD Instinct MI200 Accelerator, power on the system and
follow the steps in "Procedure to Update and Verify AMD Instinct MI200 IFWI
Version" to update or rollback IFWI on all AMD Instinct MI200 Accelerator
to the desired version.

7 Additional Support
============================================================================
For any additional questions or support requests, please contact AMD Customer
Support.

8 References
============================================================================
1. https://documentation.suse.com/sles/15-SP4/html/SLES-all/cha-mod.html
2. https://access.redhat.com/solutions/41278
