Skip to content

Commit

Permalink
Fix cloud data dir
Browse files Browse the repository at this point in the history
* VM on the cloud might not have enough space on all partitions. Add a workaround which should cover most cases
* Use branch and commit name to versionize reports directories
* Fix parsing error when temperature is not available in nvidia-smi outputs
* export MILABENCH_* env vars to remote
  • Loading branch information
satyaog committed Apr 10, 2024
1 parent 5267334 commit e8d783a
Show file tree
Hide file tree
Showing 10 changed files with 173 additions and 23 deletions.
31 changes: 31 additions & 0 deletions config/cloud-multinodes-system.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
system:
# Nodes list
nodes:
# Alias used to reference the node
- name: manager
# Use 1.1.1.1 as an ip placeholder
ip: 1.1.1.1
# Use this node as the master node or not
main: true
# User to use in remote milabench operations
user: user

- name: node1
ip: 1.1.1.1
main: false
user: username

# Cloud instances profiles
cloud_profiles:
azure__a100:
username: ubuntu
size: Standard_NC24ads_A100_v4
location: eastus2
azure__a100_x2:
username: ubuntu
size: Standard_NC48ads_A100_v4
location: eastus2
azure__a10_x2:
username: ubuntu
size: Standard_NV72ads_A10_v5
location: eastus2
8 changes: 8 additions & 0 deletions config/cloud-system.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,11 @@ system:
username: ubuntu
size: Standard_NC24ads_A100_v4
location: eastus2
azure__a100_x2:
username: ubuntu
size: Standard_NC48ads_A100_v4
location: eastus2
azure__a10_x2:
username: ubuntu
size: Standard_NV72ads_A10_v5
location: eastus2
3 changes: 2 additions & 1 deletion config/examples/cloud-multinodes-system.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ system:

# Cloud instances profiles
cloud_profiles:
# The cloud platform to use in the form of {PLATFORM}__{PROFILE_NAME}
# The cloud platform to use in the form of {PLATFORM} or
# {PLATFORM}__{PROFILE_NAME}
azure:
# covalent-azure-plugin args
username: ubuntu
Expand Down
76 changes: 76 additions & 0 deletions docs/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,3 +69,79 @@ The following command will print out a report of the tests that ran, the metrics
milabench report --runs $MILABENCH_BASE/runs/some_specific_run --html report.html
The report will also print out a score based on a weighting of the metrics, as defined in the file ``$MILABENCH_CONFIG`` points to.


Use milabench on the cloud
~~~~~~~~~~~~~~~~~~~~~~~~~~


Setup Terraform and a free Azure account
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

# | Install azure cli (it does not need to be in the same environment than
milabench)
| ``pip install azure-cli``
# Setup a free account on
`azure.microsoft.com <https://azure.microsoft.com/en-us/free/>`_
# Follow instructions in the
`azurerm documentation <https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/guides/service_principal_client_secret#creating-a-service-principal-using-the-azure-cli>`_
to generate a ``ARM_CLIENT_ID`` as well as a ``ARM_CLIENT_SECRET``. If you
don't have the permissions to create / assign a role to a service principal,
you can ignore the ``az ad sp create-for-rbac`` command to work directly with
your ``ARM_TENANT_ID`` and ``ARM_SUBSCRIPTION_ID``
# `Install Terraform <https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli>`_
# Configure the ``azurerm`` Terraform provider by
`exporting the environment variables <https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/guides/service_principal_client_secret#configuring-the-service-principal-in-terraform>`_


Create a cloud system configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Add a ``cloud_profiles`` section to the ``system`` configuration which lists the
supported cloud profiles.

.. notes::

Nodes that should be created on the cloud should have the ``1.1.1.1`` ip
address placeholder. Other ip addresses will be used as-is and no cloud
instance will be created for that node

.. notes::

A cloud profile entry needs to start with a covalent plugin (e.g. `azure`). To
define multiple profiles on the same cloud platform, use the form
``{PLATFORM}__{PROFILE_NAME}`` (e.g. ``azure__profile``). All cloud profile
attributes will be used as is as argument for the target covalent plugin

.. code-block:: yaml
system:
nodes:
- name: manager
# Use 1.1.1.1 as an ip placeholder
ip: 1.1.1.1
main: true
user: <username>
- name: node1
ip: 1.1.1.1
main: false
user: <username>
# Cloud instances profiles
cloud_profiles:
# The cloud platform to use in the form of {PLATFORM} or
# {PLATFORM}__{PROFILE_NAME}
azure__free:
# covalent-azure-plugin args
username: ubuntu
size: Standard_B2ats_v2
location: eastus2
Run milabench on the cloud
^^^^^^^^^^^^^^^^^^^^^^^^^^

# | Initialize the cloud instances
| ``milabench cloud --system {{SYSTEM_CONFIG.YAML}} --setup --run-on {{PROFILE}} >{{SYSTEM_CLOUD_CONFIG.YAML}}``
# | Prepare, install and run milabench
| ``milabench [prepare|install|run] --system {{SYSTEM_CLOUD_CONFIG.YAML}}``
21 changes: 19 additions & 2 deletions milabench/cli/cloud.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,9 @@
from omegaconf import OmegaConf
import yaml

from ..common import get_multipack
from milabench.fs import XPath

from ..common import get_multipack

_SETUP = "setup"
_TEARDOWN = "teardown"
Expand All @@ -25,8 +26,12 @@ def _flatten_cli_args(**kwargs):
)


def _or_sudo(cmd:str):
return f"( {cmd} || sudo {cmd} )"


def manage_cloud(pack, run_on, action="setup"):
assert run_on in pack.config["system"]["cloud_profiles"]
assert run_on in pack.config["system"]["cloud_profiles"], f"{run_on} cloud profile not found in {list(pack.config['system']['cloud_profiles'].keys())}"

key_map = {
"hostname":(lambda v: ("ip",v)),
Expand All @@ -38,6 +43,9 @@ def manage_cloud(pack, run_on, action="setup"):
run_on, *profile = run_on.split("__")
profile = profile[0] if profile else ""

remote_base = XPath("/data") / pack.dirs.base.name
local_base = pack.dirs.base.absolute().parent

nodes = iter(enumerate(pack.config["system"]["nodes"]))
for i, n in nodes:
if n["ip"] != "1.1.1.1":
Expand Down Expand Up @@ -66,6 +74,15 @@ def manage_cloud(pack, run_on, action="setup"):
f"--{action}",
*_flatten_cli_args(**plan_params)
]
if action == _SETUP:
cmd += [
"--",
"bash", "-c",
_or_sudo(f"mkdir -p '{local_base.parent}'") +
" && " + _or_sudo(f"chmod a+rwX '{local_base.parent}'") +
f" && mkdir -p '{remote_base}'"
f" && ln -sfT '{remote_base}' '{local_base}'"
]
p = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
Expand Down
2 changes: 1 addition & 1 deletion milabench/cli/covalent/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ def lattice(argv=(), deps_bash = None):
if argv:
dispatch_id = ct.dispatch(lattice, disable_run=False)(argv, deps_bash=deps_bash)
result = ct.get_result(dispatch_id=dispatch_id, wait=True)
return_code, stdout, _ = result.result if result.result is not None else (1, "", "")
return_code, _, _ = result.result if result.result is not None else (1, "", "")

if return_code == 0 and args.setup:
_executor:ct.executor.BaseExecutor = executor_cls(
Expand Down
2 changes: 1 addition & 1 deletion milabench/cli/covalent/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
covalent
covalent==0.232
covalent-ec2-plugin @ git+https://github.com/satyaog/covalent-ec2-plugin.git@feature/milabench
covalent-azure-plugin @ git+https://github.com/satyaog/covalent-azure-plugin.git@feature/milabench
28 changes: 14 additions & 14 deletions milabench/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -363,26 +363,25 @@ def _push_reports(reports_repo, runs):
device = _meta["cpu"]["brand"].replace(" ", "_")
break

tag = ([
t.name
for t in _repo.tags
if meta[0]["milabench"]["tag"].startswith(t.name)
] or [meta[0]["milabench"]["tag"]])[0]
reports_dir = XPath(reports_repo.working_tree_dir) / tag
build = "-".join([_repo.active_branch.name.replace(os.path.sep, "_"), next(_repo.iter_commits()).hexsha])
reports_dir = XPath(reports_repo.working_tree_dir) / build

run = XPath(run)
try:
run.copy(reports_dir / device / run.name)
except FileExistsError:
pass

device_reports.setdefault((device, tag), set())
device_reports[(device, tag)].update(
for _f in (reports_dir / device / run.name).glob("*.stderr"):
_f.unlink()

device_reports.setdefault((device, build), set())
device_reports[(device, build)].update(
(reports_dir / device).glob("*/")
)

for (device, tag), reports in device_reports.items():
reports_dir = XPath(reports_repo.working_tree_dir) / tag
for (device, build), reports in device_reports.items():
reports_dir = XPath(reports_repo.working_tree_dir) / build
reports = _read_reports(*reports)
reports = _filter_reports(*reports.values())
summary = make_summary(reports)
Expand All @@ -404,9 +403,10 @@ def _push_reports(reports_repo, runs):
"--left-text", device,
"--right-text", text,
"--right-color", _SVG_COLORS[text],
"--whole-link", str(reports_url / tag / device)
"--whole-link", str(reports_url / build / device)
],
capture_output=True
capture_output=True,
check=True
)
if result.returncode == 0:
(reports_dir / device / "badge.svg").write_text(result.stdout.decode("utf8"))
Expand All @@ -418,8 +418,8 @@ def _push_reports(reports_repo, runs):

for cmd, _kwargs in (
(["git", "pull"], {"check": True}),
(["git", "add", tag], {"check": True}),
(["git", "commit", "-m", tag], {"check": False}),
(["git", "add", build], {"check": True}),
(["git", "commit", "-m", build], {"check": False}),
(["git", "push"], {"check": True})
):
subprocess.run(
Expand Down
13 changes: 12 additions & 1 deletion milabench/log.py
Original file line number Diff line number Diff line change
Expand Up @@ -333,6 +333,16 @@ def on_end(self, entry, data, row):
self.refresh()


_NO_DEFAULT_FLAG=("__NO_DEFAULT__",)
def _parse_int(value, default=_NO_DEFAULT_FLAG):
try:
return int(value)
except TypeError:
if default is not _NO_DEFAULT_FLAG:
return default
raise


class LongDashFormatter(DashFormatter):
def make_table(self):
table = Table.grid(padding=(0, 3, 0, 0))
Expand Down Expand Up @@ -375,7 +385,8 @@ def on_data(self, entry, data, row):
for gpuid, data in gpudata.items():
load = int(data.get("load", 0) * 100)
currm, totalm = data.get("memory", [0, 0])
temp = int(data.get("temperature", 0))
# "temperature" is sometimes reported as None for some GPUs? A10?
temp = _parse_int(data.get("temperature", 0), 0)
row[f"gpu:{gpuid}"] = (
f"{load}% load | {currm:.0f}/{totalm:.0f} MB | {temp}C"
)
Expand Down
12 changes: 9 additions & 3 deletions milabench/remote.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,14 @@
INSTALL_FOLDER = str(ROOT_FOLDER)


def milabench_env() -> list:
return [
f"{envvar}={os.environ[envvar]}"
for envvar in os.environ
if envvar.split("_")[0] == "MILABENCH" and os.environ[envvar]
]


def scp(node, folder, dest=None) -> list:
"""Copy a folder from local node to remote node"""
host = node["ip"]
Expand Down Expand Up @@ -185,9 +193,7 @@ def milabench_remote_command(pack, *command, run_for="worker") -> ListCommand:
CmdCommand(
worker_pack(pack, worker),
"cd", f"{INSTALL_FOLDER}", "&&",
f"MILABENCH_BASE={os.environ.get('MILABENCH_BASE', '')}",
f"MILABENCH_CONFIG={os.environ.get('MILABENCH_CONFIG', '')}",
f"MILABENCH_SYSTEM={os.environ.get('MILABENCH_SYSTEM', '')}",
*milabench_env(),
"milabench", *command
),
host=host,
Expand Down

0 comments on commit e8d783a

Please sign in to comment.