NixOS configuration for HPC cluster https://docs.hpc.informatik.hs-fulda.de/
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

79 lines
3.0 KiB

1 year ago
1 year ago
  1. # Infrastructure Deployment
  2. The whole cluster infrastructure is build using [NixOS](https://nixos.org/).
  3. The configuration repository is hosted at {{ config.repo_url }} and is deployed using [colmena](https://github.com/zhaofengli/colmena).
  4. ## Building the configuration
  5. To build the configuration, as system with [Nix](https://nix.dev/install-nix) installed is required.
  6. To activate the environment, run `nix develop` inside the configuration folder.
  7. This will fetch all required build dependecies and makes them available in the environment.
  8. Building the whole configuration is as easy as running:
  9. ```
  10. colmana build --verbose --show-trace
  11. ```
  12. *Go grap a coffee, this can take a while*
  13. ## Deploying
  14. > Note: Deployment requires SSH access as the `root` user to all machines.
  15. To deploy a configuration change or updates to the cluster, run the following command:
  16. ```
  17. colmena apply switch
  18. ```
  19. ### Using the manager as a SSH jump host
  20. SSH access to the nodes is limited.
  21. Therefore it the manager system can be used as a jump host.
  22. To do so, add the following lines to your local `~/.ssh/config` file (before the the `Host *` entry):
  23. ```
  24. Host 10.32.47.1??
  25. IdentitiesOnly yes
  26. ProxyJump root@10.32.47.10
  27. ```
  28. ## Updating
  29. Updating all systems can be done by running the following command in the configuration repository:
  30. ```
  31. nix flake update
  32. ```
  33. This will update all dependencies including the NixOS operation system.
  34. After doing the update, the changed config (with the updated dependencies) must be [deployed](#deploying).
  35. ## Gather node information
  36. The configuration repository relies on some information gathered from the machines itself.
  37. After bootstrapping a machine, these information need to be gathered from the machines into the configuration repository.
  38. To gather there data, run the following command:
  39. ```
  40. ./gather.sh
  41. ```
  42. ## Secret management
  43. The config repository contains several secrets which are secured by [sops](https://github.com/getsops/sops) and the according [Nix integration](https://github.com/Mic92/sops-nix).
  44. To edit a config file, run the following command:
  45. ```
  46. sops <path/to/secrets/file>
  47. ```
  48. This requires the editor to have its PGP-key fingerprint be part of the `adminKeys` list in `sops.nix`.
  49. Altering the list requires one of the previous members to [update the keys](#update-keys).
  50. ### Update keys
  51. Whenever a key, either the SSH key of a machine or the PGP key of an administrator, changes, the secret files need updating.
  52. To do so, run the following command:
  53. ```
  54. find \( -name "secrets.yaml" -or -path "*/secrets/**" -type f \) -exec sops updatekeys {} \;
  55. ```
  56. ## Bootstrapping a node
  57. Compute nodes can be bootstrapped using PXE boot.
  58. The manager will provide a touchless boot image which will install the node with the current deployment automatically.
  59. Booting the node from PXE (network boot) is enough to activate the bootstrapping process.
  60. After bootstrapping a node, make sure to [gather the node data](#gather-node-information) and [update the secret keys](#update-keys).