This project is made to house statistics data, as well as the software to fetch such data from Prometheus on some clusters provided as a part of Operate First.
You have to sign up to operate first as a membeer of our group:
-
Go to operate-first/apps and fork the repository.
-
In the forked repository, go to the file
cluster-scope/base/user.openshift.io/groups/prometheus-ai/group.yaml
and add your GitHub username under users (no capital letters, no extra spaces). It is recommended that you do this step in GitHub (without cloning locally) to avoid unecessary complexities (dos2unix differences and whatnot) -
In your forked version of the repository you should have a button to open a pull request to the original 'apps' repository. Open a pull request and explain who you are and why are you signing up, here's an example of our pull request
Notes:
-
Your PR should have only one commit. If you make more than one, squash them or remove the PR and make a new one encompassing all your changes in one commit.
-
The PR that created the group in the first place (in case you want to make yor own group) you can find here
-
The PR that added the project inside the group you can find here
-
Furthermore, PR that asked for more storage can be found here
-
An issue to
operate-first/apps
that contained much of how this process was discovered can be found here
So you want to see ask more questions, Good!
-
Join the Slack workspace here. The link can also be found on operate first's website in the top right corner as of writing this.
-
Join the channel
prometheus-ai
.
To use some of our software you need to aquire a 'personal access token' an official tutorial on how to do that can be found here. However, we found it to be unclear so we made our own:
- Go to operate first's cosole (this link is specific to smaug).
- Click on
operate-first
when asked what to login with. - Click on your name in the TOP RIGHT corner.
- Then Click on
Copy login command
. - Click on
operate-first
when asked what to login with. - Click on
Display Token
. - Copy what's under
Your API token is
, should look something like thissha256~nIHsUTlKA2QnDQOmgyzWjaEx5-2xav2e4EsXit-dJFk
Here is a repository that much of this information is taken from:
Change directory to src/
and then run 01_fetch_data.py
like so :
cd src/
python 01_fetch_data.py
Install packages when required.
Everytime the script runs,
it begins fetching data starting from the previous hour
backwards.
For example if you ran the script on 09/06/2022 at 12:52
it would start fetching data in these time slots:
- 09/06/2022 at 11:00 - 09/06/2022 at 12:00
- 09/06/2022 at 10:00 - 09/06/2022 at 11:00
- 09/06/2022 at 9:00 - 09/06/2022 at 10:00
- ...
And so on for about 10 days back.
What data are we pulling?
-
Memory-usage data for each container using this Prometheus query
sum(container_memory_working_set_bytes{ name!~".*prometheus.*", image!="", container!="POD", cluster="moc/smaug"}) by (container, pod, namespace, node)
. Notice that the query is filtering for containers in the "smaug" cluster, that are not a part of Prometheus, that have non-empty images, that are not run by the pod itself. Then the query simply groups the containers with the same name, pod, namespace and node. -
CPU-usage data for each container using this Prometheus query
sum(rate(container_cpu_usage_seconds_total{name !~".*prometheus.*", image!="", container!="POD", cluster="moc/smaug"}[5m])) by (container, pod, namespace, node)
. Notice that the query is filtering for containers in the "smaug" cluster, that are not a part of Prometheus, that have non-empty images, that are not run by the pod itself.container_cpu_usage_seconds_total
is a metric that counts how many cpu seconds in total a container used, performing rate over that gives us the usage in some time interval. Then the query simply groups the containers with the same name, pod, namespace and node. -
Memory-usage percentage data for each node using this Prometheus query
node_memory_Active_bytes/ node_memory_MemTotal_bytes*100
.
After having ran the previous step for months to fetch data from Prometheus, we now have thousands of csv files for each metric, each file corresponds to some hour in that time period. This step is meant to merge all the data into one single csv file.
This script takes the data in the format above and merges it. The merging process takes into account that some hours may have been missed when fetching for months, and leaves csv files that have continuous hours only.
Those csv files may still have missing data in them, and require further processing.
This script takes a merged data file from the previous step and makes a dataset out of that data. This data can then be used with deep learning models.
A script that imports and uses the data is available in the repository mentioned bellow.
The project is part of a project in intelligent system at the "Technion - Israel Institute of Technology". You can get more information about the project in the project's repository : sirandreww/236754_project_in_intelligent_systems
Here is a link to where the data can be viewed: