Structure of main DLab directory
How to setup local development environment
DLab is an essential toolset for analytics. It is a self-service Web Console, used to create and manage exploratory environments. It allows teams to spin up analytical environments with best of breed open-source tools just with a single click of the mouse. Once established, environment can be managed by an analytical team itself, leveraging simple and easy-to-use Web Interface.
The following diagram demonstrate high-level logical architecture.
The diagram shows main components of DLab, which is a self-service for the infrastructure deployment and interaction with it. The purpose of each component is described below.
Self-Service is a service, which provides RESTful user API with Web User Interface for data scientist. It tightly interacts with Provisioning Service and Database. Self-Service delegates all user`s requests to Provisioning Service. After execution of certain request from Self-service, Provisioning Service returns response about corresponding action happened with particular resource. Self-service, then, saves this response into Database. So, each time Self-Service receives request about status of provisioned infrastructure resources – it loads it from Database and propagates to Web UI.
Billing is a module, which provides a loading of the billing report for the environment to the database. It can be running as part of the Self-Service or a separate process.
The Provisioning Service is a RESTful service, which provides APIs for provisioning of the user’s infrastructure. Provisioning Service receives the request from Self-Service, afterwards it forms and sends a command to the docker to execute requested action. Docker executes the command and generates a response.json file. Provisioning service analyzes response.json and responds to initial request of Self-Service, providing status-related information of the instance.
Security Service is RESTful service, which provides authorization API for Self-Service and Provisioning Service via LDAP.
Docker is an infrastructure-provisioning module based on Docker service, which provides low-level actions for infrastructure management.
Database serves as a storage with description of user infrastructure, user’s settings and service information.
The following diagrams demonstrate high-level physical architecture of DLab in AWS and Azure.
- Self-service node (SSN)
- Edge node
- Notebook node (Jupyter, Rstudio, etc.)
- Data engine cluster
- Data engine cluster as a service provided with Cloud
Creation of self-service node – is the first step for deploying DLab. SSN is a main server with following pre-installed services:
- DLab Web UI – is Web user interface for managing/deploying all components of DLab. It is accessible by the following URL: http[s]://SSN_Public_IP_or_Public_DNS
- MongoDB – is a database, which contains part of DLab’s configuration, user’s exploratory environments description as well as user’s preferences.
- Docker – used for building DLab Docker containers, which will be used for provisioning other components.
- Jenkins – is an alternative to Web UI. It is accessible by the following link: http[s]://SSN_Public_IP_or_Public_DNS/jenkins
Elastic(Static) IP address is assigned to an SSN Node, so you are free to stop|start it and and SSN node's IP address won’t change.
Setting up Edge node is the first step that user is asked to do once logged into DLab. This node is used as proxy server and SSH gateway for the user. Through Edge node users can access Notebook via HTTP and SSH. Edge Node has a Squid HTTP web proxy pre-installed.
The next step is setting up a Notebook node (or a Notebook server). It is a server with pre-installed applications and libraries for data processing, data cleaning and transformations, numerical simulations, statistical modeling, machine learning, etc. Following analytical tools are currently supported in DLab and can be installed on a Notebook node:
- Jupyter
- RStudio
- Zeppelin
- TensorFlow + Jupyter
- Deep Learning + Jupyter
Apache Spark is also installed for each of the analytical tools above.
After deploying Notebook node, user can create one of the cluster for it:
- Data engine - Spark standalone cluster
- Data engine service - cloud managed cluster platform (EMR for AWS or Dataproc for GCP) That simplifies running big data frameworks, such as Apache Hadoop and Apache Spark to process and analyze vast amounts of data. Adding cluster is not mandatory and is only needed in case additional computational resources are required for job execution.
DLab’s SSN node main directory structure is as follows:
/opt
└───dlab
├───conf
├───sources
├───template
├───tmp
│ └───result
└───webapp
- conf – contains configuration for DLab Web UI and back-end services;
- sources – contains all Docker/Python scripts, templates and files for provisioning;
- template – docker’s templates;
- tmp –temporary directory of DLab;
- tmp/result – temporary directory for Docker’s response files;
- webapp – contains all .jar files for DLab Web UI and back-end services.
SSN node structure of log directory is as follows:
/var
└───opt
└───dlab
└───log
├───dataengine
├───dateengine-service
├───edge
├───notebook
└───ssn
These directories contain the log files for each template and for DLab back-end services.
- ssn – contains logs of back-end services;
- provisioning.log – Provisioning Service log file;
- security.log – Security Service log file;
- selfservice.log – Self-Service log file;
- edge, notebook, dataengine, dataengine-service – contains logs of Python scripts.
Deployment of DLab starts from creating Self-Service(SSN) node. DLab can be deployed in AWS, Azure and Google cloud. For each cloud provider, prerequisites are different.
Prerequisites:
- SSH key for EC2 instances. This key could be created through Amazon Console.
- IAM user
- AWS access key ID and secret access key
- The following permissions should be assigned for IAM user:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"iam:ListRoles",
"iam:CreateRole",
"iam:CreateInstanceProfile",
"iam:PutRolePolicy",
"iam:AddRoleToInstanceProfile",
"iam:PassRole",
"iam:GetInstanceProfile"
],
"Effect": "Allow",
"Resource": "*"
},
{
"Action": [
"ec2:DescribeImages",
"ec2:CreateTags",
"ec2:DescribeRouteTables",
"ec2:CreateRouteTable",
"ec2:AssociateRouteTable",
"ec2:DescribeVpcEndpoints",
"ec2:CreateVpcEndpoint",
"ec2:ModifyVpcEndpoint",
"ec2:DescribeInstances",
"ec2:RunInstances"
],
"Effect": "Allow",
"Resource": "*"
},
{
"Action": [
"s3:ListAllMyBuckets",
"s3:CreateBucket",
"s3:PutBucketTagging",
"s3:GetBucketTagging"
],
"Effect": "Allow",
"Resource": "*"
}
]
}
To build SSN node, following steps should be executed:
- Clone Git repository and make sure that all pre-requisites are installed.
- Go to dlab directory.
- Execute following script:
/usr/bin/python infrastructure-provisioning/scripts/deploy_dlab.py --conf_service_base_name dlab_test --aws_access_key XXXXXXX --aws_secret_access_key XXXXXXXXXX --aws_region us-west-2 --conf_os_family debian --conf_cloud_provider aws --aws_vpc_id vpc-xxxxx --aws_subnet_id subnet-xxxxx --aws_security_groups_ids sg-xxxxx,sg-xxxx --key_path /root/ --conf_key_name Test --conf_tag_resource_id dlab --aws_account_id xxxxxxxx --aws_billing_bucket billing_bucket --aws_report_path /billing/directory/ --action create
This python script will build front-end and back-end part of DLab, create SSN docker image and run Docker container for creating SSN node.
List of parameters for SSN node deployment:
Parameter | Description/Value |
---|---|
conf_service_base_name | Any infrastructure value (should be unique if multiple SSN’s have been deployed before) |
aws_access_key | AWS user access key |
aws_secret_access_key | AWS user secret access key |
aws_region | AWS region |
conf_os_family | Name of the Linux distributive family, which is supported by DLab (Debian/RedHat) |
conf_cloud_provider | Name of the cloud provider, which is supported by DLab (AWS) |
aws_vpc_id | ID of the Virtual Private Cloud (VPC) |
aws_subnet_id | ID of the public subnet |
aws_security_groups_ids | One or more ID`s of AWS Security Groups, which will be assigned to SSN node |
key_path | Path to admin key (without key name) |
conf_key_name | Name of the uploaded SSH key file (without “.pem” extension) |
conf_tag_resource_id | The name of tag for billing reports |
aws_account_id | The The ID of Amazon account |
aws_billing_bucket | The name of S3 bucket where billing reports will be placed |
aws_report_path | The path to billing reports directory in S3 bucket. This parameter isn't required when billing reports are placed in the root of S3 bucket. |
action | In case of SSN node creation, this parameter should be set to “create” |
Note: If the following parameters are not specified, they will be created automatically:
- aws_vpc_id
- aws_subnet_id
- aws_sg_ids
Note: If billing won't be using, the following parameters are not required:
- aws_account_id
- aws_billing_bucket
- aws_report_path
After SSN node deployment following AWS resources will be created:
- SSN EC2 instance
- Elastic IP for SSN instance
- IAM role and EC2 Instance Profile for SSN
- Security Group for SSN node (if it was specified, script will attach the provided one)
- VPC, Subnet (if they have not been specified) for SSN and EDGE nodes
- S3 bucket – its name will be <service_base_name>-ssn-bucket. This bucket will contain necessary dependencies and configuration files for Notebook nodes (such as .jar files, YARN configuration, etc.)
- S3 bucket for for collaboration between Dlab users. Its name will be <service_base_name>-shared-bucket
Prerequisites:
- IAM user with Contributor permissions.
- Service principal and JSON based auth file with clientId, clientSecret and tenantId.
Note: The following permissions should be assigned to the service principal:
- Windows Azure Active Directory
- Microsoft Graph
- Windows Azure Service Management API
To build SSN node, following steps should be executed:
- Clone Git repository and make sure that all pre-requisites are installed
- Go to dlab directory
- To have working billing functionality please review Billing configuration note and use proper parameters for SSN node deployment
- To use Data Lake Store please review Azure Data Lake usage pre-requisites note and use proper parameters for SSN node deployment
- Execute following deploy_dlab.py script:
/usr/bin/python infrastructure-provisioning/scripts/deploy_dlab.py --conf_service_base_name dlab_test --azure_region westus2 --conf_os_family debian --conf_cloud_provider azure --azure_vpc_name vpc-test --azure_subnet_name subnet-test --azure_security_group_name sg-test1,sg-test2 --key_path /root/ --conf_key_name Test --azure_auth_path /dir/file.json --action create
This python script will build front-end and back-end part of DLab, create SSN docker image and run Docker container for creating SSN node.
List of parameters for SSN node deployment:
Parameter | Description/Value |
---|---|
conf_service_base_name | Any infrastructure value (should be unique if multiple SSN’s have been deployed before) |
azure_resource_group_name | Resource group name (could be the same as service base name |
azure_region | Azure region |
conf_os_family | Name of the Linux distributive family, which is supported by DLab (Debian/RedHat) |
conf_cloud_provider | Name of the cloud provider, which is supported by DLab (Azure) |
azure_vpc_name | Name of the Virtual Network (VN) |
azure_subnet_name | Name of the Azure subnet |
azure_security_groups_name | One or more Name`s of Azure Security Groups, which will be assigned to SSN node |
azure_ssn_instance_size | Instance size of SSN instance in Azure |
key_path | Path to admin key (without key name) |
conf_key_name | Name of the uploaded SSH key file (without “.pem” extension) |
azure_auth_path | Full path to auth json file |
azure_offer_number | Azure offer id number |
azure_currency | Currency that is used for billing information(e.g. USD) |
azure_locale | Locale that is used for billing information(e.g. en-US) |
azure_region_info | Region info that is used for billing information(e.g. US) |
azure_datalake_enable | Support of Azure Data Lake (true/false) |
azure_oauth2_enabled | Defines if Azure OAuth2 authentication mechanisms is enabled(true/false) |
azure_validate_permission_scope | Defines if DLab verifies user's permission to the configured resource(scope) during login with OAuth2 (true/false). If Data Lake is enabled default scope is Data Lake Store Account, else Resource Group, where DLab is deployed, is default scope. If user does not have any role in scope he/she is forbidden to log in |
azure_application_id | Azure application ID that is used to log in users in DLab |
azure_ad_group_id | ID of group in Active directory whose members have full access to shared folder in Azure Data Lake Store |
action | In case of SSN node creation, this parameter should be set to “create” |
Note: If the following parameters are not specified, they will be created automatically:
- azure_vpc_nam
- azure_subnet_name
- azure_security_groups_name
Note: Billing configuration:
To know azure_offer_number open Azure Portal, go to Subscriptions and open yours, then click Overview and you should see it under Offer ID property:
Please see RateCard API to get more details about azure_offer_number, azure_currency, azure_locale, azure_region_info. These DLab deploy properties correspond to RateCard API request parameters.
Note: Azure Data Lake usage pre-requisites:
- Configure application in Azure portal and grant proper permissions to it.
- Open Azure Active Directory tab, then App registrations and click New application registration
- Fill in ui form with the following parameters Name - put name of the new application, Application type - select Native, Sign-on URL put any valid url as it will be updated later
- Grant proper permissions to the application. Select the application you just created on App registration view, then click Required permissions, then Add->Select an API-> In search field type MicrosoftAzureQueryService and press Select, then check the box Have full access to the Azure Data Lake service and save the changes. Repeat the same actions for Windows Azure Active Directory API (available on Required permissions->Add->Select an API) and the box Sign in and read user profile
- Get Application ID from application properties it will be used as azure_application_id for deploy_dlap.py script
- Usage of Data Lake resource predicts shared folder where all users can write or read any data. To manage access to this folder please create ot use existing group in Active Directory. All users from this group will have RW access to the shared folder. Put ID(in Active Directory) of the group as azure_ad_group_id parameter to deploy_dlab.py script
- After execution of deploy_dlab.py script go to the application created in step 1 and change Redirect URIs value to the https://SSN_HOSTNAME/ where SSN_HOSTNAME - SSN node hostname
After SSN node deployment following Azure resources will be created:
- Resource group where all DLAb resources will be provisioned
- SSN Virtual machine
- Static public IP address dor SSN virtual machine
- Network interface for SSN node
- Security Group for SSN node (if it was specified, script will attach the provided one)
- Virtual network and Subnet (if they have not been specified) for SSN and EDGE nodes
- Storage account and blob container for necessary further dependencies and configuration files for Notebook nodes (such as .jar files, YARN configuration, etc.)
- Storage account and blob container for collaboration between Dlab users
- If support of Data Lake is enabled: Data Lake and shared directory will be created
Prerequisites:
- IAM user
- Service account and JSON auth file for it. In order to get JSON auth file, Key should be created for service account through Google cloud console.
To build SSN node, following steps should be executed:
- Clone Git repository and make sure that all pre-requisites are installed.
- Go to dlab directory.
- Execute following script:
/usr/bin/python infrastructure-provisioning/scripts/deploy_dlab.py --conf_service_base_name dlab --gcp_region us-west1 --gcp_zone us-west1-a --conf_os_family debian --conf_cloud_provider gcp --key_path /key/path/ --conf_key_name key_name --gcp_ssn_instance_size n1-standard-1 --gcp_project_id project_id --gcp_service_account_path /path/to/auth/file.json --action create
This python script will build front-end and back-end part of DLab, create SSN docker image and run Docker container for creating SSN node.
List of parameters for SSN node deployment:
Parameter | Description/Value |
---|---|
conf_service_base_name | Any infrastructure value (should be unique if multiple SSN’s have been deployed before) |
gcp_region | GCP region |
gcp_zone | GCP zone |
conf_os_family | Name of the Linux distributive family, which is supported by DLab (Debian/RedHat) |
conf_cloud_provider | Name of the cloud provider, which is supported by DLab (GCP) |
gcp_vpc_name | Name of the Virtual Network (VN) |
gcp_subnet_name | Name of the GCP subnet |
gcp_firewall_name | One or more Name`s of GCP Security Groups, which will be assigned to SSN node |
key_path | Path to admin key (without key name) |
conf_key_name | Name of the uploaded SSH key file (without “.pem” extension) |
gcp_service_account_path | Full path to auth json file |
gcp_ssn_instance_size | Instance size of SSN instance in GCP |
gcp_project_id | ID of GCP project |
action | In case of SSN node creation, this parameter should be set to “create” |
Note: If you gonna use Dataproc cluster, be aware that Dataproc has limited availability in GCP regions. Cloud Dataproc availability by Region in GCP
After SSN node deployment following GCP resources will be created:
- SSN VM instance
- External IP address for SSN instance
- IAM role and Service account for SSN
- Security Groups for SSN node (if it was specified, script will attach the provided one)
- VPC, Subnet (if they have not been specified) for SSN and EDGE nodes
- Bucket – its name will be <service_base_name>-ssn-bucket. This bucket will contain necessary dependencies and configuration files for Notebook nodes (such as .jar files, YARN configuration, etc.)
- Bucket for for collaboration between Dlab users. Its name will be <service_base_name>-shared-bucket
Terminating SSN node will also remove all nodes and components related to it. Basically, terminating Self-service node will terminate all DLab’s infrastructure. Example of command for terminating DLab environment:
/usr/bin/python infrastructure-provisioning/scripts/deploy_dlab.py --conf_service_base_name dlab-test --aws_access_key XXXXXXX --aws_secret_access_key XXXXXXXX --aws_region us-west-2 --key_path /root/ --conf_key_name Test --conf_os_family debian --conf_cloud_provider aws --action terminate
List of parameters for SSN node termination:
Parameter | Description/Value |
---|---|
conf_service_base_name | Unique infrastructure value |
aws_access_key | AWS user access key |
aws_secret_access_key | AWS user secret access key |
aws_region | AWS region |
key_path | Path to admin key (without key name) |
conf_key_name | Name of the uploaded SSH key file (without “.pem” extension) |
conf_os_family | Name of the Linux distributive family, which is supported by DLab (Debian/RedHat) |
conf_cloud_provider | Name of the cloud provider, which is supported by DLab (AWS) |
action | terminate |
/usr/bin/python infrastructure-provisioning/scripts/deploy_dlab.py --conf_service_base_name dlab-test --azure_vpc_name vpc-test --azure_resource_group_name resource-group-test --azure_region westus2 --key_path /root/ --conf_key_name Test --conf_os_family debian --conf_cloud_provider azure --azure_auth_path /dir/file.json --action terminate
List of parameters for SSN node termination:
Parameter | Description/Value |
---|---|
conf_service_base_name | Unique infrastructure value |
azure_region | Azure region |
conf_os_family | Name of the Linux distributive family, which is supported by DLab (Debian/RedHat) |
conf_cloud_provider | Name of the cloud provider, which is supported by DLab (Azure) |
azure_vpc_name | Name of the Virtual Network (VN) |
key_path | Path to admin key (without key name) |
conf_key_name | Name of the uploaded SSH key file (without “.pem” extension) |
azure_auth_path | Full path to auth json file |
action | terminate |
/usr/bin/python infrastructure-provisioning/scripts/deploy_dlab.py --gcp_project_id project_id --conf_service_base_name dlab --gcp_region us-west1 --gcp_zone us-west1-a --key_path /root/ --conf_key_name key_name --conf_os_family debian --conf_cloud_provider gcp --gcp_service_account_path /path/to/auth/file.json --action terminate
List of parameters for SSN node termination:
Parameter | Description/Value |
---|---|
conf_service_base_name | Any infrastructure value (should be unique if multiple SSN’s have been deployed before) |
gcp_region | GCP region |
gcp_zone | GCP zone |
conf_os_family | Name of the Linux distributive family, which is supported by DLab (Debian/RedHat) |
conf_cloud_provider | Name of the cloud provider, which is supported by DLab (GCP) |
key_path | Path to admin key (without key name) |
conf_key_name | Name of the uploaded SSH key file (without “.pem” extension) |
gcp_service_account_path | Full path to auth json file |
gcp_project_id | ID of GCP project |
action | In case of SSN node termination, this parameter should be set to “terminate” |
Gateway node (or an Edge node) is an instance(virtual machine) provisioned in a public subnet. It serves as an entry point for accessing user’s personal analytical environment. It is created by an end-user, whose public key will be uploaded there. Only via Edge node, DLab user can access such application resources as notebook servers and dataengine clusters. Also, Edge Node is used to setup SOCKS proxy to access notebook servers via Web UI and SSH. Elastic(Static) IP address is assigned to an Edge Node. In case Edge node instance has been removed by mistake, there is an option to re-create it and Edge node IP address won’t change.
In order to create Edge node using DLab Web UI – login and, click on the button “Upload”. Choose user’s SSH public key and after that click on the button “Create”. Edge node will be deployed and corresponding instance (virtual machine) will be started.
The following AWS resources will be created:
- Edge EC2 instance
- Elastic IP address for Edge EC2 instance
- User's S3 bucket
- Security Group for user's Edge instance
- Security Group for all further user's Notebook instances
- Security Groups for all further user's master nodes of data engine cluster
- Security Groups for all further user's slave nodes of data engine cluster
- IAM Roles and Instance Profiles for user's Edge instance
- IAM Roles and Instance Profiles all further user's Notebook instances
- User private subnet. All further nodes (Notebooks, EMR clusters) will be provisioned in different subnet than SSN.
List of parameters for Edge node creation:
Parameter | Description/Value |
---|---|
conf_resource | edge |
conf_os_family | Name of the Linux distributive family, which is supported by DLAB (debian/redhat) |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Name of the user |
aws_vpc_id | ID of AWS VPC where infrastructure is being deployed |
aws_region | AWS region where infrastructure was deployed |
aws_security_groups_ids | One or more id’s of the SSN instance security group |
aws_subnet_id | ID of the AWS public subnet where Edge will be deployed |
aws_private_subnet_prefix | Prefix of the private subnet |
conf_tag_resource_id | The name of tag for billing reports |
action | create |
The following Azure resources will be created:
- Edge virtual machine
- Static public IP address for Edge virtual machine
- Network interface for Edge node
- Security Group for user's Edge instance
- Security Group for all further user's Notebook instances
- Security Groups for all further user's master nodes of data engine cluster
- Security Groups for all further user's slave nodes of data engine cluster
- User's private subnet. All further nodes (Notebooks, data engine clusters) will be provisioned in different subnet than SSN.
- User's storage account and blob container
List of parameters for Edge node creation:
Parameter | Description/Value |
---|---|
conf_resource | edge |
conf_os_family | Name of the Linux distributive family, which is supported by DLAB (debian/redhat) |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Name of the user |
azure_resource_group_name | Name of the resource group where all DLAb resources are being provisioned |
azure_region | Azure region where infrastructure was deployed |
azure_vpc_name | Name of Azure Virtual network where all infrastructure is being deployed |
azure_subnet_name | Name of the Azure public subnet where Edge will be deployed |
action | create |
The following GCP resources will be created:
- Edge VM instance
- External static IP address for Edge VM instance
- Security Group for user's Edge instance
- Security Group for all further user's Notebook instances
- Security Groups for all further user's master nodes of data engine cluster
- Security Groups for all further user's slave nodes of data engine cluster
- User's private subnet. All further nodes (Notebooks, data engine clusters) will be provisioned in different subnet than SSN.
- User's bucket
List of parameters for Edge node creation:
Parameter | Description/Value |
---|---|
conf_resource | edge |
conf_os_family | Name of the Linux distributive family, which is supported by DLAB (debian/redhat) |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Name of the user |
gcp_region | GCP region where infrastructure was deployed |
gcp_zone | GCP zone where infrastructure was deployed |
gcp_vpc_name | Name of Azure Virtual network where all infrastructure is being deployed |
gcp_subnet_name | Name of the Azure public subnet where Edge will be deployed |
gcp_project_id | ID of GCP project |
action | create |
To start/stop Edge node, click on the button which looks like a cycle on the top right corner, then click on the button which is located in “Action” field and in the drop-down menu click on the appropriate action.
List of parameters for Edge node starting/stopping:
Parameter | Description/Value |
---|---|
conf_resource | edge |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
edge_user_name | Name of the user |
aws_region | AWS region where infrastructure was deployed |
action | start/stop |
List of parameters for Edge node starting:
Parameter | Description/Value |
---|---|
conf_resource | edge |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
edge_user_name | Name of the user |
azure_resource_group_name | Name of the resource group where all DLAb resources are being provisioned |
azure_region | Azure region where infrastructure was deployed |
action | start |
List of parameters for Edge node stopping:
Parameter | Description/Value |
---|---|
conf_resource | edge |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
edge_user_name | Name of the user |
azure_resource_group_name | Name of the resource group where all DLAb resources are being provisioned |
action | stop |
List of parameters for Edge node starting/stopping:
Parameter | Description/Value |
---|---|
conf_resource | edge |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
edge_user_name | Name of the user |
gcp_region | GCP region where infrastructure was deployed |
gcp_zone | GCP zone where infrastructure was deployed |
gcp_project_id | ID of GCP project |
action | start/stop |
In case Edge node was damaged, or terminated manually, there is an option to re-create it.
If Edge node was removed for some reason, to re-create it, click on the status button close to logged in users’s name (top right corner of the screen).Then click on gear icon in Actions column and choose “Recreate”.
List of parameters for Edge node recreation:
Parameter | Description/Value |
---|---|
conf_resource | edge |
conf_os_family | Name of the Linux distributive family, which is supported by DLAB (Debian/RedHat) |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Name of the user |
aws_vpc_id | ID of AWS VPC where infrastructure is being deployed |
aws_region | AWS region where infrastructure was deployed |
aws_security_groups_ids | ID of the SSN instance's AWS security group |
aws_subnet_id | ID of the AWS public subnet where Edge was deployed |
edge_elastic_ip | AWS Elastic IP address which was associated to Edge node |
conf_tag_resource_id | The name of tag for billing reports |
action | Create |
Parameter | Description/Value |
---|---|
conf_resource | edge |
conf_os_family | Name of the Linux distributive family, which is supported by DLAB (Debian/RedHat) |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Name of the user |
azure_vpc_name | NAme of Azure Virtual network where all infrastructure is being deployed |
azure_region | Azure region where all infrastructure was deployed |
azure_resource_group_name | Name of the resource group where all DLAb resources are being provisioned |
azure_subnet_name | Name of the Azure public subnet where Edge was deployed |
action | Create |
Parameter | Description/Value |
---|---|
conf_resource | edge |
conf_os_family | Name of the Linux distributive family, which is supported by DLAB (debian/redhat) |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Name of the user |
gcp_region | GCP region where infrastructure was deployed |
gcp_zone | GCP zone where infrastructure was deployed |
gcp_vpc_name | Name of Azure Virtual network where all infrastructure is being deployed |
gcp_subnet_name | Name of the Azure public subnet where Edge will be deployed |
gcp_project_id | ID of GCP project |
action | create |
Notebook node is an instance (virtual machine), with preinstalled analytical software, needed dependencies and with pre-configured kernels and interpreters. It is the main part of personal analytical environment, which is setup by a data scientist. It can be Created, Stopped and Terminated. To support variety of analytical needs - Notebook node can be provisioned on any of cloud supported instance shape for your particular region. From analytical software, which is already pre-installed on a notebook node, end users can access (read/write) data stored on buckets/containers.
To create Notebook node, click on the “Create new” button. Then, in drop-down menu choose template type (jupyter/rstudio/zeppelin/tensor), enter notebook name and choose instance shape. After clicking the button “Create”, notebook node will be deployed and started.
List of parameters for Notebook node creation:
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_os_family | Name of the Linux distributive family, which is supported by DLAB (debian/redhat) |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
aws_notebook_instance_type | Value of the Notebook EC2 instance shape |
aws_region | AWS region where infrastructure was deployed |
aws_security_groups_ids | ID of the SSN instance's security group |
application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
conf_tag_resource_id | The name of tag for billing reports |
git_creds | User git credentials in JSON format |
action | Create |
Note: For format of git_creds see "Manage git credentials" lower.
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_os_family | Name of the Linux distributive family, which is supported by DLAB (debian/redhat) |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
azure_notebook_instance_size | Value of the Notebook virtual machine shape |
azure_region | Azure region where infrastructure was deployed |
azure_vpc_name | NAme of Azure Virtual network where all infrastructure is being deployed |
azure_resource_group_name | Name of the resource group where all DLAb resources are being provisioned |
application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
git_creds | User git credentials in JSON format |
action | Create |
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_os_family | Name of the Linux distributive family, which is supported by DLAB (debian/redhat) |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
gcp_vpc_name | Name of Azure Virtual network where all infrastructure is being deployed |
gcp_project_id | ID of GCP project |
gcp_notebook_instance_size | Value of the Notebook VM instance size |
gcp_region | GCP region where infrastructure was deployed |
gcp_zone | GCP zone where infrastructure was deployed |
application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
git_creds | User git credentials in JSON format |
action | Create |
In order to stop Notebook node, click on the “gear” button in Actions column. From the drop-down menu click on “Stop” action.
List of parameters for Notebook node stopping:
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
notebook_instance_name | Name of the Notebook instance to terminate |
aws_region | AWS region where infrastructure was deployed |
action | Stop |
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
notebook_instance_name | Name of the Notebook instance to terminate |
azure_resource_group_name | Name of the resource group where all DLAb resources are being provisioned |
action | Stop |
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
notebook_instance_name | Name of the Notebook instance to terminate |
gcp_region | GCP region where infrastructure was deployed |
gcp_zone | GCP zone where infrastructure was deployed |
gcp_project_id | ID of GCP project |
action | Stop |
In order to start Notebook node, click on the button, which looks like gear in “Action” field. Then in drop-down menu choose “Start” action.
List of parameters for Notebook node start:
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
notebook_instance_name | Name of the Notebook instance to terminate |
aws_region | AWS region where infrastructure was deployed |
git_creds | User git credentials in JSON format |
action | start |
Note: For format of git_creds see "Manage git credentials" lower.
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
notebook_instance_name | Name of the Notebook instance to terminate |
azure_resource_group_name | Name of the resource group where all DLAb resources are being provisioned |
azure_region | Azure region where infrastructure was deployed |
git_creds | User git credentials in JSON format |
action | start |
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
notebook_instance_name | Name of the Notebook instance to terminate |
gcp_region | GCP region where infrastructure was deployed |
gcp_zone | GCP zone where infrastructure was deployed |
gcp_project_id | ID of GCP project |
git_creds | User git credentials in JSON format |
action | Stop |
In order to terminate Notebook node, click on the button, which looks like gear in “Action” field. Then in drop-down menu choose “Terminate” action.
List of parameters for Notebook node termination:
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
notebook_instance_name | Name of the Notebook instance to terminate |
aws_region | AWS region where infrastructure was deployed |
action | terminate |
Note: If terminate action is called, all connected data engine clusters will be removed.
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
edge_user_name | Value that previously was used when Edge being provisioned |
notebook_instance_name | Name of the Notebook instance to terminate |
azure_resource_group_name | Name of the resource group where all DLAb resources are being provisioned |
action | terminate |
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
edge_user_name | Value that previously was used when Edge being provisioned |
notebook_instance_name | Name of the Notebook instance to terminate |
gcp_region | GCP region where infrastructure was deployed |
gcp_zone | GCP zone where infrastructure was deployed |
gcp_project_id | ID of GCP project |
git_creds | User git credentials in JSON format |
action | Stop |
In order to list available libraries (OS/Python2/Python3/R/Others) on Notebook node, click on the button, which looks like gear in “Action” field. Then in drop-down menu choose “Manage libraries” action.
List of parameters for Notebook node to get list of available libraries:
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
notebook_instance_name | Name of the Notebook instance to terminate |
aws_region | AWS region where infrastructure was deployed |
application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
action | lib_list |
Note: This operation will return a file with response [edge_user_name]_[application]_[request_id]_all_pkgs.json
Example of available libraries in response (type->library->version):
{
"os_pkg": {"htop": "2.0.1-1ubuntu1", "python-mysqldb": "1.3.7-1build2"},
"pip2": {"requests": "N/A", "configparser": "N/A"},
"pip3": {"configparser": "N/A"},
"r_pkg": {"rmarkdown": "1.5"},
"others": {"Keras": "N/A"}
}
List of parameters for Notebook node to install additional libraries:
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
notebook_instance_name | Name of the Notebook instance to terminate |
aws_region | AWS region where infrastructure was deployed |
application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
libs | List of additional libraries in JSON format with type (os_pkg/pip2/pip3/r_pkg/others) |
action | lib_install |
Example of additional_libs parameter:
{
...
"libs": [
{"group": "os_pkg", "name": "nmap"},
{"group": "os_pkg", "name": "htop"},
{"group": "pip2", "name": "requests"},
{"group": "pip3", "name": "configparser"},
{"group": "r_pkg", "name": "rmarkdown"},
{"group": "others", "name": "Keras"}
]
...
}
List of parameters for Notebook node to get list of available libraries:
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
notebook_instance_name | Name of the Notebook instance to terminate |
azure_resource_group_name | Name of the resource group where all DLAb resources are being provisioned |
application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
action | lib_list |
List of parameters for Notebook node to install additional libraries:
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
notebook_instance_name | Name of the Notebook instance to terminate |
azure_resource_group_name | Name of the resource group where all DLAb resources are being provisioned |
application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
libs | List of additional libraries in JSON format with type (os_pkg/pip2/pip3/r_pkg/others) |
action | lib_install |
List of parameters for Notebook node to get list of available libraries:
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
notebook_instance_name | Name of the Notebook instance to terminate |
application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
gcp_project_id | ID of GCP project |
gcp_zone | GCP zone name |
action | lib_list |
List of parameters for Notebook node to install additional libraries:
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
notebook_instance_name | Name of the Notebook instance to terminate |
gcp_project_id | ID of GCP project |
gcp_zone | GCP zone name |
application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
libs | List of additional libraries in JSON format with type (os_pkg/pip2/pip3/r_pkg/others) |
action | lib_install |
In order to manage git credentials on Notebook node, click on the button “Git credentials”. Then in menu you can add or edit existing credentials.
List of parameters for Notebook node to manage git credentials:
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
notebook_instance_name | Name of the Notebook instance to terminate |
aws_region | AWS region where infrastructure was deployed |
git_creds | User git credentials in JSON format |
action | git_creds |
Example of git_creds parameter:
[{
"username": "Test User",
"email": "[email protected]",
"hostname": "github.com",
"login": "testlogin",
"password": "testpassword"
}, ...]
Note: Fields "username" and "email" are used for commits (displays Author in git log).
Note: Leave "hostname" field empty to apply login/password by default for all services.
Note: Also your can use "Personal access tokens" against passwords.
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
notebook_instance_name | Name of the Notebook instance to terminate |
azure_resource_group_name | Name of the resource group where all DLAb resources are being provisioned |
git_creds | User git credentials in JSON format |
action | git_creds |
Parameter | Description/Value |
---|---|
conf_resource | notebook |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
gcp_project_id | ID of GCP project |
gcp_region | GCP region name |
gcp_zone | GCP zone name |
notebook_instance_name | Name of the Notebook instance to terminate |
git_creds | User git credentials in JSON format |
action | git_creds |
Dataengine-service is a cluster provided by cloud as a service (EMR on AWS) can be created if more computational resources are needed for executing analytical algorithms and models, triggered from analytical tools. Jobs execution will be scaled to a cluster mode increasing the performance and decreasing execution time.
To create dataengine-service cluster click on the “gear” button in Actions column, and click on “Add computational resources”. Specify dataengine-service version, fill in dataengine-service name, specify number of instances and instance shapes. Click on the “Create” button.
List of parameters for dataengine-service cluster creation:
Parameter | Description/Value |
---|---|
conf_resource | dataengine-service |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
emr_timeout | Value of timeout for dataengine-service during build. |
emr_instance_count | Amount of instance in dataengine-service cluster |
emr_master_instance_type | Value for dataengine-service EC2 master instance shape |
emr_slave_instance_type | Value for dataengine-service EC2 slave instances shapes |
emr_version | Available versions of dataengine-service (emr-5.2.0/emr-5.3.1/emr-5.6.0) |
notebook_instance_name | Name of the Notebook dataengine-service will be linked to |
edge_user_name | Value that previously was used when Edge being provisioned |
aws_region | AWS region where infrastructure was deployed |
conf_tag_resource_id | The name of tag for billing reports |
action | create |
Note: If “Spot instances” is enabled, dataengine-service Slave nodes will be created as EC2 Spot instances.
Parameter | Description/Value |
---|---|
conf_resource | dataengine-service |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
notebook_instance_name | Name of the Notebook dataengine-service will be linked to |
edge_user_name | Value that previously was used when Edge being provisioned |
gcp_subnet_name | Name of subnet |
dataproc_version | Version of Dataproc |
dataproc_master_count | Number of master nodes |
dataproc_slave_count | Number of slave nodes |
dataproc_preemptible_count | Number of preemptible nodes |
dataproc_master_instance_type | Size of master node |
dataproc_slave_instance_type | Size of slave node |
gcp_project_id | ID of GCP project |
gcp_region | GCP region where infrastructure was deployed |
gcp_zone | GCP zone name |
conf_tag_resource_id | The name of tag for billing reports |
action | create |
In order to terminate dataengine-service cluster, click on “x” button which is located in “Computational resources” field.
List of parameters for dataengine-service cluster termination:
Parameter | Description/Value |
---|---|
conf_resource | dataengine-service |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
emr_cluster_name | Name of the dataengine-service to terminate |
notebook_instance_name | Name of the Notebook instance which dataengine-service is linked to |
aws_region | AWS region where infrastructure was deployed |
action | Terminate |
Parameter | Description/Value |
---|---|
conf_resource | dataengine-service |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
notebook_instance_name | Name of the Notebook instance which dataengine-service is linked to |
gcp_project_id | ID of GCP project |
gcp_region | GCP region where infrastructure was deployed |
gcp_zone | GCP zone name |
dataproc_cluster_name | Dataproc cluster name |
action | Terminate |
In order to list available libraries (OS/Python2/Python3/R/Others) on Dataengine-service, click on the button, which looks like gear in “Action” field. Then in drop-down menu choose “Manage libraries” action.
List of parameters for Dataengine-service node to get list of available libraries:
Parameter | Description/Value |
---|---|
conf_resource | dataengine-service |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
computational_id | Name of Dataengine-service |
edge_user_name | Value that previously was used when Edge being provisioned |
aws_region | AWS region where infrastructure was deployed |
application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
action | lib_list |
Note: This operation will return a file with response [edge_user_name]_[application]_[request_id]_all_pkgs.json
Example of available libraries in response (type->library->version):
{
"os_pkg": {"htop": "2.0.1-1ubuntu1", "python-mysqldb": "1.3.7-1build2"},
"pip2": {"requests": "N/A", "configparser": "N/A"},
"pip3": {"configparser": "N/A"},
"r_pkg": {"rmarkdown": "1.5"},
"others": {"Keras": "N/A"}
}
List of parameters for Dataengine-service to install additional libraries:
Parameter | Description/Value |
---|---|
conf_resource | dataengine-service |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
computational_id | Name of Dataengine-service |
aws_region | AWS region where infrastructure was deployed |
application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
libs | List of additional libraries in JSON format with type (os_pkg/pip2/pip3/r_pkg/others) |
action | lib_install |
Example of additional_libs parameter:
{
...
"libs": [
{"group": "os_pkg", "name": "nmap"},
{"group": "os_pkg", "name": "htop"},
{"group": "pip2", "name": "requests"},
{"group": "pip3", "name": "configparser"},
{"group": "r_pkg", "name": "rmarkdown"},
{"group": "others", "name": "Keras"}
]
...
}
List of parameters for Dataengine-service node to get list of available libraries:
Parameter | Description/Value |
---|---|
conf_resource | dataengine-service |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
gcp_project_id | ID of GCP project |
gcp_region | GCP region name |
gcp_zone | GCP zone name |
action | lib_list |
List of parameters for Dataengine-service node to install additional libraries:
Parameter | Description/Value |
---|---|
conf_resource | dataengine-service |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
gcp_project_id | ID of GCP project |
gcp_region | GCP region name |
gcp_zone | GCP zone name |
action | lib_install |
Dataengine is cluster based on Standalone Spark framework can be created if more computational resources are needed for executing analytical algorithms, but without additional expenses for cloud provided service.
To create Spark standalone cluster click on the “gear” button in Actions column, and click on “Add computational resources”. Specify dataengine version, fill in dataengine name, specify number of instances and instance shapes. Click on the “Create” button.
List of parameters for dataengine cluster creation:
Parameter | Description/Value |
---|---|
conf_resource | dataengine |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
conf_os_family | Name of the Linux distributive family, which is supported by DLab (Debian/RedHat) |
notebook_instance_name | Name of the Notebook dataengine will be linked to |
dataengine_instance_count | Number of nodes in cluster |
edge_user_name | Value that previously was used when Edge being provisioned |
aws_region | Amazon region where all infrastructure was deployed |
aws_dataengine_master_size | Size of master node |
aws_dataengine_slave_size | Size of slave node |
action | create |
Parameter | Description/Value |
---|---|
conf_resource | dataengine |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
conf_os_family | Name of the Linux distributive family, which is supported by DLab (Debian/RedHat) |
notebook_instance_name | Name of the Notebook dataengine will be linked to |
dataengine_instance_count | Number of nodes in cluster |
edge_user_name | Value that previously was used when Edge being provisioned |
azure_vpc_name | Name of Azure Virtual network where all infrastructure is being deployed |
azure_region | Azure region where all infrastructure was deployed |
azure_dataengine_master_size | Size of master node |
azure_dataengine_slave_size | Size of slave node |
azure_resource_group_name | Name of the resource group where all DLAb resources are being provisioned |
azure_subnet_name | Name of the Azure public subnet where Edge was deployed |
action | create |
Parameter | Description/Value |
---|---|
conf_resource | dataengine |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
conf_os_family | Name of the Linux distributive family, which is supported by DLab (Debian/RedHat) |
notebook_instance_name | Name of the Notebook dataengine will be linked to |
gcp_vpc_name | GCP VPC name |
gcp_subnet_name | GCP subnet name |
dataengine_instance_count | Number of nodes in cluster |
gcp_dataengine_master_size | Size of master node |
gcp_dataengine_slave_size | Size of slave node |
gcp_project_id | ID of GCP project |
gcp_region | GCP region where infrastructure was deployed |
gcp_zone | GCP zone name |
edge_user_name | Value that previously was used when Edge being provisioned |
action | create |
In order to terminate dataengine cluster, click on “x” button which is located in “Computational resources” field.
List of parameters for dataengine cluster termination:
Parameter | Description/Value |
---|---|
conf_resource | dataengine |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
notebook_instance_name | Name of the Notebook instance which dataengine is linked to |
computational_name | Name of cluster |
aws_region | AWS region where infrastructure was deployed |
action | Terminate |
Parameter | Description/Value |
---|---|
conf_resource | dataengine |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
computational_name | Name of cluster |
notebook_instance_name | Name of the Notebook instance which dataengine is linked to |
azure_region | Azure region where infrastructure was deployed |
azure_resource_group_name | Name of the resource group where all DLAb resources are being provisioned |
action | Terminate |
Parameter | Description/Value |
---|---|
conf_resource | dataengine |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
notebook_instance_name | Name of the Notebook instance which dataengine is linked to |
computational_name | Name of cluster |
gcp_project_id | ID of GCP project |
gcp_region | GCP region where infrastructure was deployed |
gcp_zone | GCP zone name |
action | Terminate |
In order to list available libraries (OS/Python2/Python3/R/Others) on Dataengine, click on the button, which looks like gear in “Action” field. Then in drop-down menu choose “Manage libraries” action.
List of parameters for Dataengine node to get list of available libraries:
Parameter | Description/Value |
---|---|
conf_resource | dataengine |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
computational_id | Name of cluster |
application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
action | lib_list |
Note: This operation will return a file with response [edge_user_name]_[application]_[request_id]_all_pkgs.json
Example of available libraries in response (type->library->version):
{
"os_pkg": {"htop": "2.0.1-1ubuntu1", "python-mysqldb": "1.3.7-1build2"},
"pip2": {"requests": "N/A", "configparser": "N/A"},
"pip3": {"configparser": "N/A"},
"r_pkg": {"rmarkdown": "1.5"},
"others": {"Keras": "N/A"}
}
List of parameters for Dataengine node to install additional libraries:
Parameter | Description/Value |
---|---|
conf_resource | dataengine |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
computational_id | Name of cluster |
application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
action | lib_install |
Example of additional_libs parameter:
{
...
"libs": [
{"group": "os_pkg", "name": "nmap"},
{"group": "os_pkg", "name": "htop"},
{"group": "pip2", "name": "requests"},
{"group": "pip3", "name": "configparser"},
{"group": "r_pkg", "name": "rmarkdown"},
{"group": "others", "name": "Keras"}
]
...
}
List of parameters for Dataengine node to get list of available libraries:
Parameter | Description/Value |
---|---|
conf_resource | dataengine |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
azure_resource_group_name | Name of the resource group where all DLAb resources are being provisioned |
computational_id | Name of cluster |
application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
action | lib_list |
List of parameters for Dataengine node to install additional libraries:
Parameter | Description/Value |
---|---|
conf_resource | dataengine |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
azure_resource_group_name | Name of the resource group where all DLAb resources are being provisioned |
computational_id | Name of cluster |
application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
action | lib_install |
List of parameters for Dataengine node to get list of available libraries:
Parameter | Description/Value |
---|---|
conf_resource | dataengine |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
gcp_project_id | ID of GCP project |
gcp_zone | GCP zone name |
computational_id | Name of cluster |
action | lib_list |
List of parameters for Dataengine node to install additional libraries:
Parameter | Description/Value |
---|---|
conf_resource | dataengine |
conf_service_base_name | Unique infrastructure value, specified during SSN deployment |
conf_key_name | Name of the uploaded SSH key file (without ".pem") |
edge_user_name | Value that previously was used when Edge being provisioned |
application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
gcp_project_id | ID of GCP project |
gcp_zone | GCP zone name |
computational_id | Name of cluster |
action | lib_install |
DLab configuration files are located on SSN node by following path:
- /opt/dlab/conf ssn.yml – basic configuration for all java services;
- provisioning.yml – Provisioning Service configuration file;for
- security.yml – Security Service configuration file;
- self-service.yml – Self-Service configuration file.
All DLab services running as OS services and have next syntax for starting and stopping:
sudo supervisorctl {start | stop | status} [all | provserv | secserv | ui]
- start – starting service or services;
- stop – stopping service or services;
- status – show status of service or services;
- all – execute command for all services, this option is default;
- provserv – execute command for Provisioning Service;
- secserv – execute command for Security Service;
- ui – execute command for Self-Service.
DLab self service is listening to the secure 8443 port. This port is used for secure local communication with provisioning service.
There is also Nginx proxy server running on Self-Service node, which proxies remote connection to local 8443 port. Nginx server is listening to both 80 and 443 ports by default. It means that you could access self-service Web UI using non-secure connections (80 port) or secure (443 port).
Establishing connection using 443 port you should take into account that DLab uses self-signed certificate from the box, however you are free to switch Nginx to use your own domain-verified certificate.
To disable non-secure connection please do the following:
- uncomment at /etc/nginx/conf.d/nginx_proxy.conf file rule that rewrites all requests from 80 to 443 port;
- reload/restart Nginx web server.
To use your own certificate please do the following:
- upload your certificate and key to Self-Service node;
- specify at /etc/nginx/conf.d/nginx_proxy.conf file the correct path to your new ssl_certificate and ssl_certificate_key;
- reload/restart Nginx web server.
Billing module is implemented as a separate jar file and can be running in the follow modes:
- part of Self-Service;
- separate system process;
- manual loading or use external scheduler;
The billing module is running as part of the Self-Service (if billing was switched ON before SSN deployment). For details please refer to section Self-Service Node. Otherwise, you should manually configure file billing.yml. See the descriptions how to do this in the configuration file. Please also note, that you should also add an entry in the Mongo database into collection:
{
"_id": "conf_tag_resource_id",
"Value": "<CONF_TAG_RESOURCE_ID>"
}
After you have configured the billing, you can run it as a process of Self-Service. To do this, in the configuration file self-service.yml set the property BillingSchedulerEnabled to true and restart the Self-Service:
sudo supervisorctl stop ui
sudo supervisorctl start ui
If you want to load report manually, or use external scheduler use following command:
java -jar /opt/dlab/webapp/lib/billing/billing-aws.x.y.jar --conf /opt/dlab/conf/billing.yml
or
java -cp /opt/dlab/webapp/lib/billing/billing-aws.x.y.jar com.epam.dlab.BillingTool --conf /opt/dlab/conf/billing.yml
If you want billing to work as a separate process from the Self-Service use following command:
java -cp /opt/dlab/webapp/lib/billing/billing-aws.x.y.jar com.epam.dlab.BillingScheduler --conf /opt/dlab/conf/billing.yml
Billing module is implemented as a separate jar file and can be running in the follow modes:
- part of Self-Service;
- separate system process;
If you want to start billing module as a separate process use the following command:
java -jar /opt/dlab/webapp/lib/billing/billing-azure.x.y.jar /opt/dlab/conf/billing.yml
All DLab configuration files, keys, certificates, jars, database and logs can be saved to backup file.
Scripts for backup and restore is located in dlab_path/tmp/
. Default: /opt/dlab/tmp/
List of parameters for run backup:
Parameter | Description/Value |
---|---|
--dlab_path | Path to DLab. Default: /opt/dlab/ |
--configs | Comma separated names of config files, like "security.yml", etc. Default: all |
--keys | Comma separated names of keys, like "user_name.pub". Default: all |
--certs | Comma separated names of SSL certificates and keys, like "dlab-selfsigned.crt", etc. Also available: skip. Default: all |
--jars | Comma separated names of jar application, like "self-service" (without .jar), etc. Also available: all. Default: skip |
--db | Mongo DB. Key without arguments. Default: disable |
--logs | All logs (include docker). Key without arguments. Default: disable |
List of parameters for run restore:
Parameter | Description/Value |
---|---|
--dlab_path | Path to DLab. Default: /opt/dlab/ |
--configs | Comma separated names of config files, like "security.yml", etc. Default: all |
--keys | Comma separated names of keys, like "user_name.pub". Default: all |
--certs | Comma separated names of SSL certificates and keys, like "dlab-selfsigned.crt", etc. Also available: skip. Default: all |
--jars | Comma separated names of jar application, like "self-service" (without .jar), etc. Also available: all. Default: skip |
--db | Mongo DB. Key without arguments. Default: disable |
--file | Full or relative path to backup file or folder. Required field |
--force | Force mode. Without any questions. Key without arguments. Default: disable |
Note: You can type -h
or --help
for usage details.
Note: Restore process required stopping services.
Own GitLab server can be deployed from SSN node with script, which located in:
dlab_path/tmp/gitlab
. Default: /opt/dlab/tmp/gitlab
All initial configuration parameters located in gitlab.ini
file.
Some of parameters are already setuped from SSN provisioning.
GitLab uses the same LDAP server as DLab.
To deploy Gitlab server, set all needed parameters in gitlab.ini
and run script:
./gitlab_deploy.py --action [create/terminate]
Note: Terminate process uses node_name
to find instance.
Note: GitLab wouldn't be terminated with all environment termination process.
If the parameter dlab_path of configuration file dlab.ini wasn’t changed, the path to DLab service would default to:
- /opt/dlab/ - main directory of DLab service
- /var/opt/dlab/log/ or /var/log/dlab/ - path to log files
To check logs of Docker containers run the following commands:
docker ps -a – to get list of containers which were executed.
...
a85d0d3c27aa docker.dlab-dataengine:latest "/root/entrypoint...." 2 hours ago Exited (0) 2 hours ago infallible_gallileo
6bc2afeb888e docker.dlab-jupyter:latest "/root/entrypoint...." 2 hours ago Exited (0) 2 hours ago practical_cori
51b71c5d4aa3 docker.dlab-zeppelin:latest "/root/entrypoint...." 2 hours ago Exited (0) 2 hours ago determined_knuth
...
docker logs <container_id> – to get log for particular Docker container.
To change Docker images on existing environment, execute following steps:
- SSH to SSN instance
- go to /opt/dlab/sources/
- Modify needed files [4]. [ONLY FOR AZURE] Copy service principal json file with credentials to base/azure_auth.json
- Rebuild proper Docker images, using one or several commands (depending on what files you’ve changed):
docker build --build-arg OS=<os_family> --file general/files/<cloud_provider>/base_Dockerfile -t docker.dlab-base .
docker build --build-arg OS=<os_family> --file general/files/<cloud_provider>/edge_Dockerfile -t docker.dlab-edge .
docker build --build-arg OS=<os_family> --file general/files/<cloud_provider>/jupyter_Dockerfile -t docker.dlab-jupyter .
docker build --build-arg OS=<os_family> --file general/files/<cloud_provider>/rstudio_Dockerfile -t docker.dlab-rstudio .
docker build --build-arg OS=<os_family> --file general/files/<cloud_provider>/zeppelin_Dockerfile -t docker.dlab-zeppelin .
docker build --build-arg OS=<os_family> --file general/files/<cloud_provider>/tensor_Dockerfile -t docker.dlab-tensor .
docker build --build-arg OS=<os_family> --file general/files/<cloud_provider>/deeplearning_Dockerfile -t docker.dlab-deeplearning .
docker build --build-arg OS=<os_family> --file general/files/<cloud_provider>/dataengine_Dockerfile -t docker.dlab-dataengine .
DLab services could be ran in development mode. This mode emulates real work an does not create any resources on cloud provider environment.
dlab
├───infrastructure-provisioning
└───services
├───billing
├───common
├───provisioning-service
├───security-service
├───self-service
└───settings
- infrastructure-provisioning – code of infrastructure-provisioning module;
- services – back-end services source code;
- billing – billing module for AWS cloud provider only;
- common – reusable code for all services;
- provisioning-service – Provisioning Service;
- security-service – Security Service;
- self-service – Self-Service and UI;
- settings – global settings that are stored in mongo database in development mode;
In order to start development of Front-end Web UI part of DLab - Git repository should be cloned and the following packages should be installed:
- Git 1.7 or higher
- Maven 3.3 or higher
- Python 2.7
- Mongo DB 3.0 or higher
- Docker 1.12 - Infrastructure provisioning
- Java Development Kit 8 – Back-end
- Node.js 6.x & 7.x - WebUI
- Angular CLI v1.0.0-rc.1 or higher - WebUI
- TypeScript v2.0 or higher - WebUI
- Angular2 v2.4 – WebUI
- Development IDE (Eclipse or Intellij IDEA)
Common is a module, which wraps set of reusable code over services. Commonly reused functionality is as follows:
- Models
- REST client
- Mongo persistence DAO
- Security models and DAO
Self-Service provides REST based API’s. It tightly interacts with Provisioning Service and Security Service and actually delegates most of user`s requests for execution.
API class name | Supported actions | Description |
---|---|---|
BillingResource | Get billing invoice Export billing invoice in CSV file |
Provides billing information. |
ComputationalResource | Configuration limits Create Terminate |
Used for computational resources management. |
EdgeResource | Start Stop Status |
Manage EDGE node. |
ExploratoryResource | Create Status Start Stop Terminate |
Used for exploratory environment management. |
GitCredsResource | Update credentials Get credentials |
Used for exploratory environment management. |
InfrastructureInfoResource | Get info of environment Get status of environment |
Used for obtaining statuses and additional information about provisioned resources |
InfrastructureTemplatesResource | Get computation resources templates Get exploratory environment templates |
Used for getting exploratory/computational templates |
KeyUploaderResource | Check key Upload key Recover |
Used for Gateway/EDGE node public key upload and further storing of this information in Mongo DB. |
LibExploratoryResource | Lib groups Lib list Lib search Lib install |
User’s authentication. |
SecurityResource | Login Authorize Logout |
User’s authentication. |
UserSettingsResource | Get settings Save settings |
User’s preferences. |
Some class names may have endings like Aws or Azure(e.g. ComputationalResourceAws, ComputationalResourceAzure, etc...). It means that it's cloud specific class with a proper API
The Provisioning Service is key, REST based service for management of cloud specific or Docker based environment resources like computational, exploratory, edge, etc.
API class name | Supported actions | Description |
---|---|---|
ComputationalResource | Create Terminate |
Docker actions for computational resources management. |
DockerResource | Get Docker image Run Docker image |
Requests and describes Docker images and templates. |
EdgeResource | Create Start Stop |
Provides Docker actions for EDGE node management. |
ExploratoryResource | Create Start Stop Terminate |
Provides Docker actions for working with exploratory environment management. |
GitExploratoryResource | Update git greds | Docker actions to provision git credentials to running notebooks |
InfrastructureResource | Status | Docker action for obtaining status of DLab infrastructure instances. |
LibExploratoryResource | Lib list Install lib |
Docker actions to install libraries on netobboks |
Some class names may have endings like Aws or Azure(e.g. ComputationalResourceAws, ComputationalResourceAzure, etc...). It means that it's cloud specific class with a proper API
Security service is REST based service for user authentication against LDAP/LDAP + AWS/Azure OAuth2 depending on module configuration and cloud provider. LDAP only provides with authentication end point that allows to verify authenticity of users against LDAP instance. If you use AWS cloud provider LDAP + AWS authentication could be useful as it allows to combine LDAP authentication and verification if user has any role in AWS account
DLab provides OAuth2(client credentials and authorization code flow) security authorization mechanism for Azure users. This kind of authentication is required when you are going to use Data Lake. If Data Lake is not enabled you have two options LDAP ot OAuth2 If OAuth2 is in use security-service validates user's permissions to configured permission scope(resource in Azure). If Data Lake is enabled default permission scope(can be configured manually after deploy DLab) is Data Lake Store account so only if user has any role in scope of Data Lake Store Account resource he/she will be allowed to log in If Data Lake is disabled but Azure OAuth2 is in use default permission scope will be Resource Group where DLab is created and only users who have any roles in the resource group will be allowed to log in.
Web UI sources are part of Self-Service.
Sources are located in dlab/services/self-service/src/main/resources/webapp
Main pages | Components and Services |
---|---|
Login page | LoginComponent applicationSecurityService handles http calls and stores authentication tokens on the client and attaches the token to authenticated calls; healthStatusService and appRoutingService check instances states and redirect to appropriate page. |
Home page (list of resources) | HomeComponent nested several main components like ResourcesGrid for notebooks data rendering and filtering, using custom MultiSelectDropdown component; multiple modal dialogs components used for new instances creation, displaying detailed info and actions confirmation. |
Health Status page | HealthStatusComponent HealthStatusGridComponent displays list of instances, their types, statutes, ID’s and uses healthStatusService for handling main actions. |
Help pages | Static pages that contains information and instructions on how to access Notebook Server and generate SSH key pair. Includes only NavbarComponent. |
Error page | Simple static page letting users know that opened page does not exist. Includes only NavbarComponent. |
Reporting page | ReportingComponent ReportingGridComponent displays billing detailed info with built-in filtering and DateRangePicker component for custom range filtering; uses BillingReportService for handling main actions and exports report data to .csv file. |
The development environment setup description is written with assumption that user already has installed Java8 (JDK), Maven3 and set environment variables (JAVA_HOME, M2_HOME). The description will cover Mongo installation, Mongo user creation, filling initial data into Mongo, Node.js installation
- Download MongoDB from https://www.mongodb.com/download-center
- Install database based on MongoDB instructions
- Start DB server and create accounts
use admin
db.createUser(
{
user: "admin",
pwd: "<password>",
roles: [ { role: "dbAdminAnyDatabase", db: "admin" },
{ role: "userAdminAnyDatabase", db: "admin" },
{ role: "readWriteAnyDatabase", db: "admin" } ]
}
)
use <database_name>
db.createUser(
{
user: "admin",
pwd: "<password>",
roles: [ "dbAdmin", "userAdmin", "readWrite" ]
}
)
- Load collections form file dlab/services/settings/(aws|azure)/mongo_settings.json
mongoimport -u admin -p <password> -d <database_name> -c settings mongo_settings.json
- Load collections form file dlab/infrastructure-provisioning/src/ssn/files/mongo_roles.json
mongoimport -u admin -p <password> -d <database_name> --jsonArray -c roles mongo_roles.json
- Set option CLOUD_TYPE to aws/azure, DEV_MODE to true, mongo database name and password in configuration file dlab/infrastructure-provisioning/src/ssn/templates/ssn.yml
<#assign CLOUD_TYPE="aws">
...
<#assign DEV_MODE="true">
...
mongo:
database: <database_name>
password: <password>
- Add system environment variable DLAB_CONF_DIR=<dlab_root_folder>/dlab/infrastructure-provisioning/src/ssn/templates/ssn.yml or create two symlinks in dlab/services/provisioning-service and dlab/services/self-service folders for file dlab/infrastructure-provisioning/src/ssn/templates/ssn.yml.
Unix
ln -s ../../infrastructure-provisioning/src/ssn/templates/ssn.yml ssn.yml
Windows
mklink ssn.yml ..\\..\\infrastructure-provisioning\\src\\ssn\\templates\\ssn.yml
- For Unix system create two folders and grant permission for writing:
/var/opt/dlab/log/ssn
/opt/dlab/tmp/result
- Download Node.js from https://nodejs.org/en
- Install Node.js
- Make sure that the installation folder of Node.js has been added to the system environment variable PATH
- Install latest packages
npm install npm@latest -g
- Change folder to dlab/services/self-service/src/main/resources/webapp and install the dependencies from a package.json manifest
npm install
- Replace CLOUD_PROVIDER options with aws|azure in dictionary file
dlab/services/self-service/src/main/resources/webapp/src/dictionary/global.dictionary.ts
import { NAMING_CONVENTION } from './(aws|azure).dictionary';
export * from './(aws|azure).dictionary';
- Build web application
npm run build.prod
To enable a SSL connection the web server should have a Digital Certificate. To create a server certificate, follow these steps:
-
Create the keystore.
-
Export the certificate from the keystore.
-
Sign the certificate.
-
Import the certificate into a truststore: a repository of certificates used for verifying the certificates. A truststore typically contains more than one certificate.
Please find below set of commands to create certificate, depending on OS.
Pay attention that the last command has to be executed with administrative permissions.
keytool -genkeypair -alias dlab -keyalg RSA -storepass KEYSTORE_PASSWORD -keypass KEYSTORE_PASSWORD -keystore ~/keys/dlab.keystore.jks -keysize 2048 -dname "CN=localhost"
keytool -exportcert -alias dlab -storepass KEYSTORE_PASSWORD -file ~/keys/dlab.crt -keystore ~/keys/dlab.keystore.jks
sudo keytool -importcert -trustcacerts -alias dlab -file ~/keys/dlab.crt -noprompt -storepass changeit -keystore ${JRE_HOME}/lib/security/cacerts
Pay attention that the last command has to be executed with administrative permissions. To achieve this the command line (cmd) should be ran with administrative permissions.
"%JRE_HOME%\bin\keytool" -genkeypair -alias dlab -keyalg RSA -storepass KEYSTORE_PASSWORD -keypass KEYSTORE_PASSWORD -keystore <DRIVE_LETTER>:\home\%USERNAME%\keys\dlab.keystore.jks -keysize 2048 -dname "CN=localhost"
"%JRE_HOME%\bin\keytool" -exportcert -alias dlab -storepass KEYSTORE_PASSWORD -file <DRIVE_LETTER>:\home\%USERNAME%\keys\dlab.crt -keystore <DRIVE_LETTER>:\home\%USERNAME%\keys\dlab.keystore.jks
"%JRE_HOME%\bin\keytool" -importcert -trustcacerts -alias dlab -file <DRIVE_LETTER>:\home\%USERNAME%\keys\dlab.crt -noprompt -storepass changeit -keystore "%JRE_HOME%\lib\security\cacerts"
Useful command
"%JRE_HOME%\bin\keytool" -list -alias dlab -storepass changeit -keystore "%JRE_HOME%\lib\security\cacerts"
"%JRE_HOME%\bin\keytool" -delete -alias dlab -storepass changeit -keystore "%JRE_HOME%\lib\security\cacerts"
Where the <DRIVE_LETTER>
must be the drive letter where you run the DLab.
There is a possibility to run Self-Service and Provisioning Service locally. All requests from Provisioning Service to Docker are mocked and instance creation status will be persisted to Mongo (only without real impact on Docker and AWS). Security Service can`t be running on local machine because of local LDAP mocking complexity.
Both services, Self-Service and Provisioning Service are dependent on dlab/provisioning-infrastructure/ssn/templates/ssn.yml configuration file. Both services have main functions as entry point, SelfServiceApplication for Self-Service and ProvisioningServiceApplication for Provisioning Service. Services could be started by running main methods of these classes. Both main functions require two arguments:
- Run mode (“server”)
- Configuration file name (“self-service.yml” or “provisioning.yml” depending on the service). Both files are located in root service directory. These configuration files contain service settings and are ready to use.
The services start up order does matter. Since Self-Service depends on Provisioning Service, the last should be started first and Self-Service afterwards. Services could be started from local IDEA (Eclipse or Intellij Idea) “Run” functionality of toolbox.
Run application flow is following:
- Run provisioning-service passing 2 arguments: server, provisioning.yml
- Run self-service passing 2 arguments: server, self-service.yml
- Try to access self-service Web UI by https://localhost:8443
User: test
Password: <any>
The following list shows common structure of scripts for deploying DLab
dlab
└───infrastructure-provisioning
└───src
├───base
├───dataengine
├───dataengine-service
├───deeplearning
├───edge
├───general
├───jupyter
├───rstudio
├───ssn
├───tensor
└───zeppelin
Each directory except general contains Python scripts, Docker files, templates, files for appropriate Docker image.
- base – Main Docker image. It is a common/base image for other ones.
- edge – Docker image for Edge node.
- dataengine – Docker image for dataengine cluster.
- dataengine-service – Docker image for dataengine-service cluster.
- general – OS and CLOUD dependent common source.
- ssn – Docker image for Self-Service node (SSN).
- jupyter/rstudio/zeppelin/tensor/deeplearning – Docker images for Notebook nodes.
All Python scripts, Docker files and other files, which are located in these directories, are OS and CLOUD independent.
OS, CLOUD dependent and common for few templates scripts, functions, files are located in general directory.
general
├───api – all available API
├───conf – DLab configuration
├───files – OS/Cloud dependent files
├───lib – OS/Cloud dependent functions
├───scripts – OS/Cloud dependent Python scripts
└───templates – OS/Cloud dependent templates
These directories may contain differentiation by operating system (Debian/RedHat) or cloud provider (AWS).
Directories of templates (SSN, Edge etc.) contain only scripts, which are OS and CLOUD independent.
If script/function is OS or CLOUD dependent, it should be located in appropriate directory/library in general folder.
The following table describes mostly used scripts:
Script name/Path | Description |
---|---|
Dockerfile | Used for building Docker images and represents which Python scripts, templates and other files are needed. Required for each template. |
base/entrypoint.py | This file is executed by Docker. It is responsible for setting environment variables, which are passed from Docker and for executing appropriate actions (script in general/api/). |
base/scripts/*.py | Scripts, which are OS independent and are used in each template. |
general/api/*.py | API scripts, which execute appropriate function from fabfile.py. |
template_name/fabfile.py | Is the main file for template and contains all functions, which can be used as template actions. |
template_name/scripts/*.py | Python scripts, which are used for template. They are OS and CLOUD independent. |
general/lib/aws/*.py | Contains all functions related to AWS. |
general/lib/os/ | This directory is divided by type of OS. All OS dependent functions are located here. |
general/lib/os/fab.py | Contains OS independent functions used for multiple templates. |
general/scripts/ | Directory is divided by type of Cloud provider and OS. |
general/scripts/aws/*.py | Scripts, which are executed from fabfiles and AWS-specific. The first part of file name defines to which template this script is related to. For example: common_*.py – can be executed from more than one template. ssn_*.py – are used for SSN template. edge_*.py – are used for Edge template. |
general/scripts/os/*.py | Scripts, which are OS independent and can be executed from more than one template. |
Available Docker images and their actions:
Docker image | Actions |
---|---|
ssn | create, terminate |
edge | create, terminate, status, start, stop, recreate |
jupyter/rstudio/zeppelin/tensor/deeplearning | create, terminate, start, stop, configure, list_libs, install_libs, git_creds |
dataengine/dataengine-service | create, terminate |
- Docker command for building images docker.dlab-base and docker.dlab-ssn:
sudo docker build --build-arg OS=debian --file general/files/aws/base_Dockerfile -t docker.dlab-base . ;
sudo docker build --build-arg OS=debian --file general/files/aws/ssn_Dockerfile -t docker.dlab-ssn . ;
Example of SSN Docker file:
FROM docker.dlab-base:latest
ARG OS
COPY ssn/ /root/
COPY general/scripts/aws/ssn_* /root/scripts/
COPY general/lib/os/${OS}/ssn_lib.py /usr/lib/python2.7/dlab/ssn_lib.py
COPY general/files/aws/ssn_policy.json /root/files/
COPY general/templates/aws/jenkins_jobs /root/templates/jenkins_jobs
RUN chmod a+x /root/fabfile.py; \
chmod a+x /root/scripts/*
RUN mkdir /project_tree
COPY . /project_tree
Using this Docker file, all required scripts and files will be copied to Docker container.
- Docker command for building SSN:
docker run -i -v /root/KEYNAME.pem:/root/keys/KEYNAME.pem –v /web_app:/root/web_app -e "conf_os_family=debian" -e "conf_cloud_provider=aws" -e "conf_resource=ssn" -e "aws_ssn_instance_size=t2.medium" -e "aws_region=us-west-2" -e "aws_vpc_id=vpc-111111" -e "aws_subnet_id=subnet-111111" -e "aws_security_groups_ids=sg-11111,sg-22222,sg-33333" -e "conf_key_name=KEYNAME" -e "conf_service_base_name=dlab_test" -e "aws_access_key=Access_Key_ID" -e "aws_secret_access_key=Secret_Access_Key" -e "conf_tag_resource_id=dlab" docker.dlab-ssn --action create ;
- Docker executes entrypoint.py script with action create. Entrypoint.py will set environment variables, which were provided from Docker and execute general/api/create.py script:
elif args.action == 'create':
with hide('running'):
local("/bin/create.py")
- general/api/create.py will execute Fabric command with run action:
try:
local('cd /root; fab run')
- Function run() in file ssn/fabfile.py will be executed. It will run two scripts general/scripts/aws/ssn_prepare.py and general/scripts/aws/ssn_configure.py:
try:
local("~/scripts/{}.py".format('ssn_prepare'))
except Exception as err:
traceback.print_exc()
append_result("Failed preparing SSN node. Exception: " + str(err))
sys.exit(1)
try:
local("~/scripts/{}.py".format('ssn_configure'))
except Exception as err:
traceback.print_exc()
append_result("Failed configuring SSN node. Exception: " + str(err))
sys.exit(1)
- The scripts general/scripts/<cloud_provider>/ssn_prepare.py an general/scripts/<cloud_provider>/ssn_configure.py will execute other Python scripts/functions for:
- ssn_prepate.py: 1. Creating configuration file (for AWS) 2. Creating Cloud resources.
- ssn_configure.py: 1. Installing prerequisites 2. Installing required packages 3. Configuring Docker 4. Configuring DLab Web UI
- If all scripts/function are executed successfully, Docker container will stop and SSN node will be created.
SSN:
docker run -i -v <key_path><key_name>.pem:/root/keys/<key_name>.pem -e "region=<region>" -e "conf_service_base_name=<Infrastructure_Tag>" -e “conf_resource=ssn" -e "aws_access_key=<Access_Key_ID>" -e "aws_secret_access_key=<Secret_Access_Key>" docker.dlab-ssn --action <action>
All parameters are listed in section "Self-ServiceNode" chapter.
Other images:
docker run -i -v /home/<user>/keys:/root/keys -v /opt/dlab/tmp/result:/response -v /var/opt/dlab/log/<image>:/logs/<image> -e <variable1> –e <variable2> docker.dlab-<image> --action <action>
First of all, a new directory should be created in infrastructure-provisioning/src/.
For example: infrastructure-provisioning/src/my-tool/
The following scripts/directories are required to be created in the template directory:
my-tool
├───scripts
└───fabfile.py
fabfile.py – the main script, which contains main functions for this template such as run, stop, terminate, etc.
Here is example of run() function for Jupyter Notebook node:
Path: infrastructure-provisioning/src/jupyter/fabfile.py
def run():
local_log_filename = "{}_{}_{}.log".format(os.environ['conf_resource'], os.environ['edge_user_name'], os.environ['request_id'])
local_log_filepath = "/logs/" + os.environ['conf_resource'] + "/" + local_log_filename
logging.basicConfig(format='%(levelname)-8s [%(asctime)s] %(message)s',
level=logging.DEBUG,
filename=local_log_filepath)
notebook_config = dict()
notebook_config['uuid'] = str(uuid.uuid4())[:5]
try:
params = "--uuid {}".format(notebook_config['uuid'])
local("~/scripts/{}.py {}".format('common_prepare_notebook', params))
except Exception as err:
traceback.print_exc()
append_result("Failed preparing Notebook node.", str(err))
sys.exit(1)
try:
params = "--uuid {}".format(notebook_config['uuid'])
local("~/scripts/{}.py {}".format('jupyter_configure', params))
except Exception as err:
traceback.print_exc()
append_result("Failed configuring Notebook node.", str(err))
sys.exit(1)
This function describes process of creating Jupyter node. It is divided into two parts – prepare and configure. Prepare part is common for all notebook templates and responsible for creating of necessary cloud resources, such as EC2 instances, etc. Configure part describes how the appropriate services will be installed.
To configure Jupyter node, the script jupyter_configure.py is executed. This script describes steps for configuring Jupyter node. In each step, the appropriate Python script is executed.
For example:
Path: infrastructure-provisioning/src/general/scripts/aws/jupyter_configure.py
try:
logging.info('[CONFIGURE JUPYTER NOTEBOOK INSTANCE]')
print('[CONFIGURE JUPYTER NOTEBOOK INSTANCE]')
params = "--hostname {} --keyfile {} --region {} --spark_version {} --hadoop_version {} --os_user {} --scala_version {}".\
format(instance_hostname, keyfile_name, os.environ['aws_region'], os.environ['notebook_spark_version'],
os.environ['notebook_hadoop_version'], os.environ['conf_os_user'],
os.environ['notebook_scala_version'])
try:
local("~/scripts/{}.py {}".format('configure_jupyter_node', params))
In this step, the script infrastructure-provisioning/src/jupyter/scripts/configure_jupyter_node.py will be executed.
Example of script infrastructure-provisioning/src/jupyter/scripts/configure_jupyter_node.py:
if __name__ == "__main__":
print("Configure connections")
env['connection_attempts'] = 100
env.key_filename = [args.keyfile]
env.host_string = args.os_user + '@' + args.hostname
print("Configuring notebook server.")
try:
if not exists('/home/' + args.os_user + '/.ensure_dir'):
sudo('mkdir /home/' + args.os_user + '/.ensure_dir')
except:
sys.exit(1)
print("Mount additional volume")
prepare_disk(args.os_user)
print("Install Java")
ensure_jre_jdk(args.os_user)
This script call functions for configuring Jupyter node. If this function is OS dependent, it will be placed in infrastructure-provisioning/src/general/lib/<OS_family>/debian/notebook_lib.py
All functions in template directory (e.g. infrastructure-provisioning/src/my-tool/) should be OS and cloud independent.
All OS or cloud dependent functions should be placed in infrastructure-provisioning/src/general/lib/ directory.
The following steps are required for each Notebook node:
- Configure proxy on Notebook instance – the script infrastructure-provisioning/src/general/scripts/os/notebook_configure_proxy.py
- Installing user’s key – the script infrastructure-provisioning/src/base/scripts/install_user_key.py
Other scripts, responsible for configuring Jupyter node are placed in infrastructure-provisioning/src/jupyter/scripts/
-
scripts directory – contains all required configuration scripts.
-
infrastructure-provisioning/src/general/files/<cloud_provider>/my-tool_Dockerfile – used for building template Docker image and describes which files, scripts, templates are required and will be copied to template Docker image.
-
infrastructure-provisioning/src/general/files/<cloud_provider>/my-tool_descriptsion.json – JSON file for DLab Web UI. In this file you can specify:
- exploratory_environment_shapes – list of EC2 shapes
- exploratory_environment_versions – description of template
Example of this file for Jupyter node for AWS cloud:
{
"exploratory_environment_shapes" :
{
"For testing" : [
{"Size": "S", "Description": "Standard_DS1_v2", "Type": "Standard_DS1_v2","Ram": "3.5 GB","Cpu": "1", "Spot": "true", "SpotPctPrice": "70"}
],
"Memory optimized" : [
{"Size": "S", "Description": "Standard_E4s_v3", "Type": "Standard_E4s_v3","Ram": "32 GB","Cpu": "4"},
{"Size": "M", "Description": "Standard_E16s_v3", "Type": "Standard_E16s_v3","Ram": "128 GB","Cpu": "16"},
{"Size": "L", "Description": "Standard_E32s_v3", "Type": "Standard_E32s_v3","Ram": "256 GB","Cpu": "32"}
],
"Compute optimized": [
{"Size": "S", "Description": "Standard_F2s", "Type": "Standard_F2s","Ram": "4 GB","Cpu": "2"},
{"Size": "M", "Description": "Standard_F8s", "Type": "Standard_F8s","Ram": "16.0 GB","Cpu": "8"},
{"Size": "L", "Description": "Standard_F16s", "Type": "Standard_F16s","Ram": "32.0 GB","Cpu": "16"}
]
},
"exploratory_environment_versions" :
[
{
"template_name": "Jupyter notebook 5.2.0",
"description": "Base image with jupyter node creation routines",
"environment_type": "exploratory",
"version": "jupyter_notebook-5.2.0",
"vendor": "Azure"
}
]
}
Additionally, following directories could be created:
-
templates – directory for new templates;
-
files – directory for files used by newly added templates only;
All Docker images are being built while creating SSN node. To add newly created template, add it to the list of images in the following script:
Path: infrastructure-provisioning/src/general/scripts/aws/ssn_configure.py
try:
logging.info('[CONFIGURING DOCKER AT SSN INSTANCE]')
print('[CONFIGURING DOCKER AT SSN INSTANCE]')
additional_config = [{"name": "base", "tag": "latest"},
{"name": "edge", "tag": "latest"},
{"name": "jupyter", "tag": "latest"},
{"name": "rstudio", "tag": "latest"},
{"name": "zeppelin", "tag": "latest"},
{"name": "tensor", "tag": "latest"},
{"name": "emr", "tag": "latest"}]
For example:
...
{"name": "my-tool", "tag": "latest"},
...
There are a few popular LDAP distributions on the market like Active Directory, Open LDap. That’s why some differences in configuration appear. Also depending on customization, there might be differences in attributes configuration. For example the DN(distinguished name) may contain different attributes:
- DN=CN=Name Surname,OU=groups,OU=EPAM,DC=Company,DC=Cloud
- DN=UID=UID#53,OU=groups,OU=Company,DC=Company,DC=Cloud
CN vs UID.
The relation between users and groups also varies from vendor to vendor.
For example, in Open LDAP the group object may contain set (from 0 to many) attributes "memberuid" with values equal to user`s attribute “uid”.
However, in Active Directory the mappings are done based on other attributes. On a group size there is attribute "member" (from 0 to many values) and its value is user`s DN (distinguished name).
To fit the unified way of LDAP usage, we introduced configuration file with set of properties and customized scripts (python and JavaScript based). On backend side, all valuable attributes are further collected and passed to these scripts. To apply some customization it is required to update a few properties in security.yml and customize the scripts.
There are just a few properties based in which the customization could be done:
- ldapBindTemplate: uid=%s,ou=People,dc=example,dc=com
- ldapBindAttribute: uid
- ldapSearchAttribute: uid
Where the:
- ldapBindTemplate is a user`s DN template which should be filed with custom value. Here the template could be changed: uid=%s,ou=People,dc=example,dc=com -> cn=%s,ou=People,dc=example,dc=com.
- ldapBindAttribute - this is a major attribute, on which the DN is based on. Usually it is any of: uid or cn, or email.
- ldapSearchAttribute - another attribute, based on which users will be looked up in LDAP.
There are 3 scripts in security.yml:
- userLookUp (python based) - responsible for user lookup in LDap and returns additional user`s attributes;
- userInfo (python based) - enriches user with additional data;
- groupInfo (javascript based) – responsible for mapping between users and groups;
The scripts above were created to flexibly manage user`s security configuration. They all are part of security.yml configuration. All scripts have following structure: - name - cache - expirationTimeMsec - scope - attributes - timeLimit - base - filter - searchResultProcessor: - language - code
Major properties are:
- attributes - list of attributes that will be retrieved from LDAP (-name, -cn, -uid, -member, etc);
- filter - the filter, based on which the object will be retrieved from LDAP;
- searchResultProcessor - optional. If only LDAP object attributes retrieving is required, this property should be empty. For example, “userLookup” script only retrieves list of "attributes". Otherwise, code customization (like user enrichment, user to groups matching, etc.) should be added into sub-properties below:
- language - the script language - "python" or "JavaScript"
- code - the script code.
Configuration properties:
- ldapBindTemplate: 'cn=%s,ou=users,ou=alxn,dc=alexion,dc=cloud'
- ldapBindAttribute: cn
- ldapSearchAttribute: mail
Script code:
name: userLookUp
cache: true
expirationTimeMsec: 600000
scope: SUBTREE
attributes:
- cn
- gidNumber
- mail
- memberOf
timeLimit: 0
base: ou=users,ou=alxn,dc=alexion,dc=cloud
filter: "(&(objectCategory=person)(objectClass=user)(mail=%mail%))"
In the example above, the user login passed from GUI is a mail (ldapSearchAttribute: mail) and based on the filer (filter: "(&(objectCategory=person)(objectClass=user)(mail=%mail%))") so, the service would search user by its “mail”. If corresponding users are found - the script will return additional user`s attributes:
- cn
- gidNumber
- memberOf
User`s authentication into LDAP would be done for DN with following template ldapBindTemplate: 'cn=%s,ou=users,ou=alxn,dc=alexion,dc=cloud', where CN is attribute retrieved by “userLookUp” script.
DLab supports OAuth2 authentication that is configured automatically in Security Service and Self Service after DLab deployment. Please see explanation details about configuration parameters for Self Service and Security Service below. DLab supports client credentials(username + password) and authorization code flow for authentication.
azureLoginConfiguration:
useLdap: false
tenant: xxxx-xxxx-xxxx-xxxx
authority: https://login.microsoftonline.com/
clientId: xxxx-xxxx-xxxx-xxxx
redirectUrl: https://dlab.azure.cloudapp.azure.com/
responseMode: query
prompt: consent
silent: true
loginPage: https://dlab.azure.cloudapp.azure.com/
maxSessionDurabilityMilliseconds: 288000000
where:
- useLdap - defines if LDAP authentication is enabled(true/false). If false Azure OAuth2 takes place with configuration properties below
- tenant - tenant id of your company
- authority - Microsoft login endpoint
- clientId - id of the application that users log in through
- redirectUrl - redirect URL to DLab application after try to login to Azure using OAuth2
- responseMode - defines how Azure sends authorization code or error information to DLab during log in procedure
- prompt - defines kind of prompt during Oauth2 login
- silent - defines if DLab tries to log in user without interaction(true/false), if false DLab tries to login user with configured prompt
- loginPage - start page of DLab application
- maxSessionDurabilityMilliseconds - max user session durability. user will be asked to login after this period of time and when he/she creates ot starts notebook/cluster. This operation is needed to update refresh_token that is used by notebooks to access Data Lake Store
To get more info about responseMode, prompt parameters please visit Authorize access to web applications using OAuth 2.0 and Azure Active Directory
azureLoginConfiguration:
useLdap: false
tenant: xxxx-xxxx-xxxx-xxxx
authority: https://login.microsoftonline.com/
clientId: xxxx-xxxx-xxxx-xxxx
redirectUrl: https://dlab.azure.cloudapp.azure.com/
validatePermissionScope: true
permissionScope: subscriptions/xxxx-xxxx-xxxx-xxxx/resourceGroups/xxxx-xxxx/providers/Microsoft.DataLakeStore/accounts/xxxx/providers/Microsoft.Authorization/
managementApiAuthFile: /dlab/keys/azure_authentication.json
where:
- useLdap - defines if LDAP authentication is enabled(true/false). If false Azure OAuth2 takes place with configuration properties below
- tenant - tenant id of your company
- authority - Microsoft login endpoint
- clientId - id of the application that users log in through
- redirectUrl - redirect URL to DLab application after try to login to Azure using OAuth2
- validatePermissionScope - defines(true/false) if user's permissions should be validated to resource that is provided in permissionScope parameter. User will be logged in onlu in case he/she has any role in resource IAM described with permissionScope parameter
- permissionScope - describes Azure resource where user should have any role to pass authentication. If user has no role in resource IAM he/she will not be logged in
- managementApiAuthFile - authentication file that is used to query Microsoft Graph API to check user roles in resource described in permissionScope