Breakout - GPU HPC an der Fakultät Elektrotechnik
Der Rechner breakout.hs-augsburg.de ist ein im Rechenzentrum installierter GPU Rechner für Maschinelles Lernen.
- SuperMicro SYS-7048GR-TR
- Mainboard Supermicro X10DRG-Q
- 2 x Intel Xeon Broadwell E5-2690v4 (2.6 GHz, 14 Core)
- 128 GB DDR4 RAM mit 2400 MHz
- 2 x 480 GB SATA3 SSD Intel DC S3500
- 1 x 400 GB PCIe NVME SSD P3500
- 1 x 12 TB Western Digital DC HC520 SATA3 (12/2021 neu)
- Intel X540-T2 10GB Base-T Ethernet Netzwerkanschluss
- 4 x NVIDIA Geforce GTX 1080 mit GP104 Pascal, 2560 Cores, 8 GB RAM
- Debian Buster
- NVidia Treiber 470.82.01
- Kernel 4.19.0-18
- Cuda 11.4
- Tensorflow, Torch
- Docker 20.10.11, nvidia-docker2 2.8.0-1
Nutzungshinweise
Alle Angehörigen der Hochschule können sich mit dem Account des Rechenzentrums auf der breakout über ssh einloggen. Der Zugang über den ssh Standardport 22 ist nur innerhalb des Intranets (ggf. über vpn) erreichbar. Zusätzlich ist der Port 2222 für ssh auch von außerhalb zugänglich. Die Beschreibung geht von einem terminal unter MacOS aus. Einloggen über ssh:
ssh -p 2222 <rzaccount>@breakout.hs-augsburg.de
Auf der breakout wird dann das Benutzerverzeichnis, das zu dem account gehört, gemountet.
Grafik
X Forwarding
Einfache X Anwendungen können über X Forwarding gestartet werden. Dazu ist ein X Server auf dem Clientrechner (also dem Mac) erforderlich. Die Option “-Y” aktiviert dazu das X Forwarding der ssh shell.
MacBook: ssh -Y -p 2222 <rzaccount>@breakout.hs-augsburg.de
Dann kann man auf der breakout als X Beispielprogramm “xlogo” starten.
breakout: xlogo
VirtualGL und TurboVNC
Das X Forwarding ist jedoch mit 3D Beschleunigung und einer langsamen Internetanbindung nicht so gut geeignet. Deshalb ist auf der breakout auch TurboVNC und VirtualGL installiert. Auf der breakout ist das in der Hintergrundbeschreibung auf http://www.virtualgl.org/About/Background in Figure 5 “In-Process GLX Forking with an X Proxy” dargestellte Verfahren konfiguriert. Auf der breakout läuft dazu der Standard X Server für die 3D Beschleunigung. Vom Nutzer wird dann noch der XProxy Server “XVnc” und LXDE gestartet. Dieser vncserver ist dann wie ein “Remote Desktop”, d.h. es werden nur Bilddaten vom Server zum Client geschickt. Der vncserver stellt die vnc Daten an einem Port 5900 + n zur Verfügung. Dabei ist n die Displayvariable des aktuellen vncservers. Die breakout ist allerdings so konfiguriert, das der Port nicht von außerhalb erreichbar ist. Deshalb muss ssh mit Portforwarding gestartet werden. Welchen Port man forwarden muss, ergibt sich erst nach dem Start des vncservers.
Auf dem Client muss dazu ein VNC Client installiert werden. Da auf der breakout der vncserver von TurboVNC installiert ist, empfehle ich den TurboVNC Client. Siehe http://www.turbovnc.org
Zunächst vom Client (hier: MacBook) auf der breakout einloggen
MacBook: ssh -p 2222 fritz@breakout.hs-augsburg.de
Dann auf der breakout den vncserver starten. Das sieht dann so aus:
fritz@breakout:~$ vncserver Desktop 'TurboVNC: breakout:1 (fritz)' started on display breakout:1 Starting applications specified in /home/fritz/.vnc/xstartup.turbovnc Log file is /home/fritz/.vnc/breakout:1.log fritz@breakout:~$
Hier wurde als Display “breakout:1” dynamisch ausgewählt. Deshalb muss der vnc client auf den Port 5901 zugreifen. Dieser port 5901 wird mit port forwarding von der breakout über eine ssh session auf den lokalen Rechner geleitet. Deshalb jetzt eine zweite ssh Session mit port forwarding von Port 5901.
MacBook: ssh -p 2222 -L 5901:localhost:5901 fritz@breakout.hs-augsburg.de
Damit stehen jetzt die vnc Daten auf dem Clientrechner an Port 5901 zur Verfügung. Der TurboVNC Client muss deshalb mit “localhost:5901” verbunden werden.
Um die OpenGL Beschleunigung bei einer Applikation zu nutzen muss diese mit vglrun gestartet werden. Dies kann mit
breakout: vglrun glxgears
getestet werden. Es sollten drehende Zahnräder erscheinen.
Cuda
Auf der breakout ist NVidia Cuda installiert. Um den Cuda Compiler nutzen zu können muss in die Datei <HOME>/.profile
# Add the CUDA compiler PATH="$PATH:/usr/local/cuda/bin"
eingetragen werden. Danach Ausloggen und wieder einloggen.
Graphikkarten
Auf der Breakout sind vier Grafikkarten installiert.
nvidia-smi - Zustand der Karten abfragen
Der Zustand der Grafikkarten kann mit
beckmanf@breakout:~$ nvidia-smi Wed Dec 26 08:13:26 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1080 Off | 00000000:02:00.0 Off | N/A | | 36% 54C P0 42W / 180W | 10MiB / 8119MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 1080 Off | 00000000:03:00.0 Off | N/A | | 27% 34C P8 7W / 180W | 10MiB / 8119MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 1080 Off | 00000000:83:00.0 Off | N/A | | 27% 37C P8 7W / 180W | 10MiB / 8119MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 1080 Off | 00000000:84:00.0 Off | N/A | | 90% 76C P2 173W / 180W | 7323MiB / 8119MiB | 99% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 903 G /usr/bin/X 5MiB | | 1 903 G /usr/bin/X 5MiB | | 2 903 G /usr/bin/X 5MiB | | 3 903 G /usr/bin/X 5MiB | | 3 14538 C python 7311MiB | +-----------------------------------------------------------------------------+
überprüft werden. Im Beispiel oben kann man sehen:
- Es gibt vier GeForce GTX 1080 Grafikkarten
- Grafikkarte “3” ist gerade in Betrieb - der Lüfter läuft auf 90% und die Temperatur beträgt 76 GradC
- Der Prozess mit Process ID 14538 “python” läuft auf Karte 3. Der Speicher ist mit 7323 MiB fast voll.
Running long jobs
tmux - Keep a session running even when you logout
With tmux you can keep a session running even when you logout. You can later login again and the session is still there. Create a new session:
tmux new-session -s fredo
Now you can start a program. You can leave the tmux session (and the program) running when you type CTRL-b d. This will detach you from the tmux session. Then you can logout from you ssh session and keep everything running on the breakout. Then you can login to breakout via ssh again. You can reattach to tmux with
tmux attach-session -t fredo
You should see the output from your running program.
kerberos - keep your file system alive
When you login to the breakout via your RZ account, then your home directory is mounted on the breakout from the RZ file server via nfs. When you logout from the breakout, then your home directory is unmounted after 5 minutes if you have no job still running. If you have a job running, e.g. via tmux or a job in the background then your home directory remains mounted.
If you leave a job running for more than about 10 hours you get errors when you try to access files in your home directory. The reason is that the mounting process requires an authentication which is done via the kerberos service. When you login to the breakout with your password, then you automagically receive a kerberos ticket which is derived from the login credentials. This is required by the automounter of your home directory - without a kerberos ticket the nfs server does not allow the access to your files. When I run the pytorch example Running the imagenet training, then this takes about 5 days. After approximately 10 hours runtime I receive the following bus error message
Epoch: [12][4980/5005] Time 0.523 (0.524) Data 0.000 (0.034) Loss 2.5527 (2.5143) Acc@1 44.922 (44.781) Acc@5 69.922 (69.733) Epoch: [12][4990/5005] Time 0.525 (0.524) Data 0.000 (0.034) Loss 2.7477 (2.5144) Acc@1 44.141 (44.778) Acc@5 66.016 (69.732) Epoch: [12][5000/5005] Time 0.520 (0.524) Data 0.000 (0.034) Loss 2.3334 (2.5144) Acc@1 46.094 (44.776) Acc@5 70.312 (69.730) Test: [0/196] Time 3.587 (3.587) Loss 1.6937 (1.6937) Acc@1 58.203 (58.203) Acc@5 86.328 (86.328) Test: [10/196] Time 0.159 (0.814) Loss 2.3972 (2.0702) Acc@1 39.062 (51.598) Acc@5 75.391 (77.131) ... Test: [170/196] Time 2.123 (0.635) Loss 1.9238 (2.3964) Acc@1 46.094 (45.463) Acc@5 81.641 (72.149) Test: [180/196] Time 0.159 (0.630) Loss 2.1114 (2.4070) Acc@1 44.531 (45.254) Acc@5 78.125 (71.996) Test: [190/196] Time 1.742 (0.633) Loss 1.7933 (2.3935) Acc@1 53.516 (45.492) Acc@5 87.891 (72.215) * Acc@1 45.864 Acc@5 72.442 Traceback (most recent call last): File "main.py", line 398, in <module> File "main.py", line 113, in main ... File "/rz2home/beckmanf/miniconda3/lib/python3.7/site-packages/torch/serialization.py", line 141, in _with_file_like PermissionError: [Errno 13] Permission denied: 'checkpoint.pth.tar' Bus-Zugriffsfehler beckmanf@breakout:~/pytorch/examples/imagenet$
The reason for this bus error is that the pytorch program tries to write the file “checkpoint.pth.tar” to the home directory but the home directory cannot be accessed because of the kerberos ticket expired.
You can check the status of your current kerberos ticket with “klist”.
beckmanf@breakout:~$ klist Ticket cache: FILE:/tmp/krb5cc_12487_ssddef Default principal: beckmanf@RZ.HS-AUGSBURG.DE Valid starting Expires Service principal 27.12.2018 08:28:43 27.12.2018 18:28:43 krbtgt/RZ.HS-AUGSBURG.DE@RZ.HS-AUGSBURG.DE renew until 28.12.2018 08:28:37
The kerberos ticket lifetime is 10h and the renew time is 24h. So after 18:28:43 you cannot access your home directory anymore. You can apply for a new ticket with longer lifetime and a longer renew time with “kinit”.
beckmanf@breakout:~$ kinit -l 2d -r 7d Password for beckmanf@RZ.HS-AUGSBURG.DE:
In the example above you apply for a ticket lifetime of 2 days and a renew time of 7 days. You can check the result with klist again.
beckmanf@breakout:~$ klist Ticket cache: FILE:/tmp/krb5cc_12487_ssddef Default principal: beckmanf@RZ.HS-AUGSBURG.DE Valid starting Expires Service principal 27.12.2018 08:30:09 27.12.2018 18:30:09 krbtgt/RZ.HS-AUGSBURG.DE@RZ.HS-AUGSBURG.DE renew until 03.01.2019 08:30:05
The kerberos ticket lifetime is still only 10h but the renew time is now seven days.
Renew a kerberos ticket
To get a new kerberos ticket you have to provide your password. But you can renew your ticket and extend the lifetime without a password until the maximum renew time expires. You must have a valid non-expired ticket when you start the renew process. In the example above you would have to do the renew until 18:30:09. You can renew with “kinit -R”. You do not need a password to do that.
Start a job with automatic kerberos ticket renew
You can do the ticket renew process automatically. When you start a job with “krenew”, then your existing kerberos ticket will be copied to a new ticket cache location and the renew process is automatically done until the renew time expires or the job is done. The ticket cache is copied because the kerberos cache that you received at login (here: /tmp/krb5cc_12487_ssddef) will be deleted at logout. To start the example from pytorch imagenet training, this would be done like this:
krenew python -- main.py --gpu=2 -a resnet18 /fast/imagenet
If you do this inside a tmux session, then you can detach and logout. The job will run for up to seven days. When you login later you can check the status of the jobs kerberos ticket again with klist. You have to provide the filename of the jobs ticket cache.
klist /tmp/krb5cc_12487_ftXjk0
In my example the new cache name from krenew was /tmp/krb5cc_12487_ftXjk0.
Login via Public Key Authentication
When you login via Public Key Authentication, then you do not receive a new kerberos ticket. If you do not have a valid kerberos ticket, then you cannot access “$HOME/.ssh/authorized_keys” and you are falling back to default password login and receive a new kerberos ticket. If you did the login via Public Key, then your “klist” will not show any kerberos ticket because that is active from some other login session. However you can still run “kinit” and receive a new kerberos ticket. That will be stored in the default kerberos ticket cache location at “/tmp/krb5cc_<uid>”.
PyTorch
I installed PyTorch via miniconda in my home directory. Anaconda/Miniconda is an installation method for python tools. The installation of miniconda is described here. I used the 64 Bit version for python 3.7. The download is here. So I did:
cd wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh conda update conda
The conda files are installed in your home directory under $HOME/miniconda3. You have to add the path to the conda binaries to your PATH variable by adding this section
if [ -d "$HOME/miniconda3" ]; then export PATH=$HOME/miniconda3/bin:$PATH fi
to your .profile file in your home directory. The you have to logout and login again. Now the conda program should be available. Check with:
beckmanf@breakout:~$ which conda /rz2home/beckmanf/miniconda3/bin/conda
Now you can update the conda installations with:
conda update conda
The installation of PyTorch is done via
conda install pytorch torchvision -c pytorch
Running the CIFAR-10 Tutorial tutorial via jupyter notebook
I did the CIFAR-10 classifier tutorial via a jupyter notebook. Jupyter notebook is a webfrontend such that the python code can be executed via a webbrowser. To install the jupyter framework I installed
conda install notebook
cd mkdir -p pytorch/cifar10 cd pytorch/cifar10 beckmanf@breakout:~/pytorch/cifar10$ jupyter notebook --no-browser [I 11:59:55.306 NotebookApp] The port 8888 is already in use, trying another port. [I 11:59:55.405 NotebookApp] Serving notebooks from local directory: /rz2home/beckmanf/pytorch/cifar10 [I 11:59:55.405 NotebookApp] 0 active kernels [I 11:59:55.405 NotebookApp] The Jupyter Notebook is running at: [I 11:59:55.405 NotebookApp] http://localhost:8889/?token=3d22f49d309a3e4fc0834dd58e3f7f36152d34e7a318aa3a [I 11:59:55.405 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [C 11:59:55.405 NotebookApp] Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://localhost:8889/?token=3d22f49d309a3e4fc0834dd58e3f7f36152d34e7a318aa3a
In this example the jupyter web server is at port number 8889 on the breakout. The breakout is configured such that this port can NOT be reached from outside. Therefore you have to tunnel this port via ssh to your client machine. So do the following on your client with your account name.
FriedrichsMacBook:~ fritz$ ssh -p 2222 -L 8889:localhost:8889 beckmanf@breakout.hs-augsburg.de
Now you can open the jupyter notebook via a local webbrowser on your client machine. The url is the one which was given above including the token.
Running the imagenet training
The imagenet-12 dataset is a set of 1.3 million images which are hand labeled and categorized in 1000 categories. The data is available on the breakout at /fast/imagenet. The training is done with the pytorch examples. Install the pytorch examples from the git repository:
cd cd pytorch git clone https://github.com/pytorch/examples.git cd examples cd imagenet
Now you can run the pytorch imagenet training with
python main.py --gpu=2 -a resnet18 /fast/imagenet
The training takes about 5 days on the breakout. Refer to Running long jobs to see how you can run that long jobs on the breakout.
Bauingenieure - Photoscan
The photoscan software is installed under /opt/photoscan-pro. To run the software via the graphical user interface start the gui session via vncserver as described above. Then open a terminal and start photoscan via:
Start the Software
vglrun /opt/photoscan-pro/photoscan.sh
License Activation
The software is currently installed with root as owner. Therefore only root can update the software and the license. To update the license, do:
sudo /opt/photoscan-pro/photoscan.sh --activate EGKKS-KRNPU-LRMLE-RJDTS-GE4SK
Torch
Alle debian Pakete für die Installation von Torch sind auf der breakout installiert. Torch selbst wird nicht über die Debian Paketinstallation installiert, sondern im Homeverzeichnis direkt aus git. Im Beispiel wird eine Version ausgecheckt, die funktioniert hat. Der Schritt install-deps.sh wird übersprungen, da dort mit sudo Pakete installiert werden. Diese Pakete kann man als normaler user aufgrund der sudo Rechte nicht installieren und sie sind auf der breakout auch schon installiert.
cd git clone https://github.com/torch/distro.git ~/torch --recursive git checkout efb9226e924d69513eea28f5f701cb5f5ca cd torch TORCH_LUA_VERSION=LUA52 ./install.sh source "$HOME/torch/install/bin/torch-activate"
Now add to .profile
# NVidia cuDNN library if [ -f "/home/fritz/cuda/cudnn/cuda/lib64/libcudnn.so.6" ]; then export CUDNN_PATH="/home/fritz/cuda/cudnn/cuda/lib64/libcudnn.so.6" fi # Torch environment settings if [ -f "$HOME/torch/install/bin/torch-activate" ]; then source "$HOME/torch/install/bin/torch-activate" fi
Als Beispiel kann man http://torch.ch/blog/2015/07/30/cifar.html ausprobieren. Dort werden 50000 Bilder aus dem CIFAR-10 Benchmark klassifiziert.
cd git clone https://github.com/szagoruyko/cifar.torch.git cd cifar.torch OMP_NUM_THREADS=2 th -i provider.lua # Opens torch shell - inside th: provider = Provider() provider:normalize() torch.save('provider.t7',provider) exit # Now back on shell CUDA_VISIBLE_DEVICES=0 th train.lua --model=vgg_bn_drop -s logs/vgg
The previous training uses the cuda compiled torch neural network models. NVidia provides specially crafted cuDNN models which are faster. To use these models:
CUDA_VISIBLE_DEVICES=0 th train.lua --model=vgg_bn_drop --backend=cudnn -s logs/cudnn
The network can also be trained without cuda/gpu support:
OMP_NUM_THREADS=16 th train.lua --model=vgg_bn_drop --type=float -s logs/cpu
Docker
Mit Docker können zusätzliche Softwarepakete laufen ohne die Basisinstallation zu ändern. Vorraussetzung
- Ihr Account muss Mitglied der Gruppe “docker” sein
Testen Sie ob Sie Mitglied der Gruppe docker sind mit
groups
Wenn Sie nicht Mitglied der Gruppe docker sind, dann funktionieren die folgenden Aktionen nicht. Bitte beachten Sie, dass Aktionen unter Docker sicherheitsrelevant sind. Durch das Mounten von Verzeichnissen mit der -v Option können auch Dateien im Host verändert werden, die unter root Rechten stehen.
Einfacher Test
siehe: https://docs.docker.com/engine/getstarted/step_one/#step-3-verify-your-installation
docker run hello-world
NVidia Digits
siehe: https://github.com/NVIDIA/nvidia-docker/wiki/DIGITS
nvidia-docker run --name digits -d -P nvidia/digits
- Option -d will run the docker image as daemon.
- Option -P will assign the used ports inside docker to random ports on the host.
To check which ports are assigned and which containers are running:
docker ps
In my example it looks like this:
fritz@breakout:~/docker$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES f9942fca476a nvidia/digits "python -m digits" 32 minutes ago Up 3 seconds 0.0.0.0:32772->5000/tcp digits fritz@breakout:~/docker$
The section “PORTS” shows that port 5000 from the docker container is mapped to port 32772 on the host. Now you can run a web browser with “http://breakout.hs-augsburg.de:32772” and see the NVidia Digits web interface.
To stop NVidia Digits run
docker stop digits docker rm digits
Tensorflow
With Python 2
Tensorflow version 1.4 supports Cuda 8.0 while all following versions require Cuda 9. The supported tensorflow version on this machine is 1.4. The recommended way to install tensorflow is “virtualenv”.
https://www.tensorflow.org/versions/r1.4/install/
Change your .profile and add the following
# nvidia cuDNN library LD_LIBRARY_PATH="/usr/local/cuda/lib64:/home/fritz/cuda/cudnn/cuda/lib64:$LD_LIBRARY_PATH"
to make the cuda and cudnn library accessible. Logout and login. Tensorflow 1.4 requires cuda 8.0 and cudnn 6.0. This machine uses python 2.7.
Install tensorflow:
virtualenv --system-site-packages ~/tensorflow source ~/tensorflow/bin/activate pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.0-cp27-none-linux_x86_64.whl
Then validate that the installation worked.
With Python 3
Alternatively, you can also use Tensorflow with Python 3 on the server. Similar to the python2 version described above, only TensorFlow 1.4 is supported, but cuDNN 7.0 is used. Just add the following code to your ~/.profile
if [ -d "/fast/usr/bin" ] ; then PATH="/fast/usr/bin:$PATH" fi if [ -d "/fast/usr/local/cuda-8.0/lib64" ] ; then export LD_LIBRARY_PATH="/fast/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH" fi
Once you reconnected to the server, you are ready to use python3 with TensorFlow.
Deskproto
The Deskproto CAM software for milling is installed and can be started with the GUI. Please start the graphical desktop manager via TurboVNC as described in the TurboVNC chapter and launch deskproto from within the desktop manager.
First Time Setup
The first run of Deskproto requires two setup steps. First run Deskproto from your home directory.
cd /opt/deskproto/DeskProto71.AppImage
Select your language, Scaling and choose any machine. We will overwrite that in the next step. Once Deskproto has started, close it. Starting Deskproto for the first time will create two directories
~/.local/share/'Delft Spline Systems'/Deskproto ~/.config/'Delft Spline Systems'
which contain drivers, help pages e.t.c. We have the StepFour XPERT 1000s mill in the lab and use Hufschmied cutters. We have added those cutters and the 1000s in this Driver directory /opt/deskproto/Drivers. I have made a setup file which configures our mill and the other driver directory. To use it, copy the setup file to your local place.
cd cp /opt/deskproto/DeskProto.conf ~/.config/'Delft Spline Systems'
Startup of Deskproto
After you have overwritten the configuration file, you can start Deskproto. Due to a bug the file access to your nfs mounted home directory is slow. Any file dialog will take quite a while (maybe 2 minutes) to display files in your home directory. You can redefine the HOME variable for deskproto and start it.
cd HOME=/fast /opt/deskproto/DeskProto71.AppImage