Standardize JSON format #57

PeterSulcs · 2021-05-20T05:25:42Z

❯ curl https://machinedatahub.ai/datasets.json | jq ".[0] | ."
{
  "id": "1",
  "Rank": 1,
  "Name": "Combined Cycle Power Plant",
  "Owner": "University of California Irvine",
  "URL": [
    "https://archive.ics.uci.edu/ml/machine-learning-databases/00294/CCPP.zip"
  ],
  "Short Summary": "The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant. A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the Vacuum is collected from and has effect on the Steam Turbine, he other three of the ambient variables effect the GT performance. For comparability with our baseline studies, and to allow 5x2 fold statistical tests be carried out, we provide the data shuffled five times. For each shuffling 2-fold CV is carried out and the resulting 10 measurements are used for statistical testing.",
  "One Line": "Data collected from a Combined Power Plant working full load over 6 years.",
  "File Type": "???",
  "Sector": "Power",
  "ML Type": [
    "Regression"
  ],
  "Labeled": "Yes",
  "Time Series": "No",
  "Simulation (Yes/No)": "N/A",
  "Attributes": 4,
  "Instances": 9568,
  "Downloads": 191037,
  "Likes": 0,
  "File Size": "3.7 MB",
  "img_link": "https://www.miga.org/sites/default/files/2018-06/power-plant-bright-blue-sky.jpg",
  "Datasets": [
    {
      "Name": "Dataset 1",
      "URL": "https://archive.ics.uci.edu/ml/machine-learning-databases/00294/CCPP.zip",
      "Likes": 0,
      "Downloads": 191037,
      "File Size": "3.7 MB"
    }
  ]
}

I would recommend the following refactor (note that this will have to be done across the web app and the python lib because the schema is pretty embedded into the python code).

Make JSON keys camelCase without spaces unless the key is an abbreviation (id is fine, URL is fine)
JSON keys should not include special characters (i.e. Simulation (Yes/No))
Make URL just a string in all cases. currently it is an array except for one dataset where it is a string. It is mostly an array of length 1, but sometimes is the URL for each of the files. Since the files are tracked separately now in the nested format, I would recommend this URL be a link to the human readable webpage where data is accessible.
When a value is N/A or Unknown or blank the normal way of representing this in JSON is to use null
One of the ids is represented as a string rather than a number

 curl https://machinedatahub.ai/datasets.json -s | jq ".[] | .URL | select(length > 1) | ."
[
  "http://archive.ics.uci.edu/ml/machine-learning-databases/00198/Faults.NNA",
  "http://archive.ics.uci.edu/ml/machine-learning-databases/00198/Faults27x7_var"
]
[
  "https://ti.arc.nasa.gov/c/5/, https://ti.arc.nasa.gov/c/9/",
  "https://ti.arc.nasa.gov/c/14/",
  "https://ti.arc.nasa.gov/c/15/",
  "https://ti.arc.nasa.gov/c/16/",
  "https://ti.arc.nasa.gov/c/17/"
]
[
  "https://ti.arc.nasa.gov/c/21/",
  "https://ti.arc.nasa.gov/c/20/",
  "https://ti.arc.nasa.gov/c/19/"
]
[
  "https://ti.arc.nasa.gov/c/6/",
  "https://ti.arc.nasa.gov/c/47/"
]
[
  "https://ti.arc.nasa.gov/c/25/",
  "https://ti.arc.nasa.gov/c/26/",
  "https://ti.arc.nasa.gov/c/27/",
  "https://ti.arc.nasa.gov/c/28/",
  "https://ti.arc.nasa.gov/c/29/",
  "https://ti.arc.nasa.gov/c/30/"
]
[
  "https://ti.arc.nasa.gov/c/33/",
  "https://ti.arc.nasa.gov/c/34/",
  "https://ti.arc.nasa.gov/c/35/"
]
[
  "https://ti.arc.nasa.gov/c/38/",
  "https://ti.arc.nasa.gov/c/39/",
  "https://ti.arc.nasa.gov/c/40/",
  "https://ti.arc.nasa.gov/c/41/",
  "https://ti.arc.nasa.gov/c/43/",
  "https://ti.arc.nasa.gov/c/44/"
]
[
  "https://ti.arc.nasa.gov/c/45/",
  "https://ti.arc.nasa.gov/c/46/"
]
"http://archive.ics.uci.edu/ml/machine-learning-databases/secom/"

The text was updated successfully, but these errors were encountered:

PeterSulcs added the invalid This doesn't seem right label May 20, 2021

PeterSulcs assigned cbarnes7, mattsul and pjw901015 May 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize JSON format #57

Standardize JSON format #57

PeterSulcs commented May 20, 2021 •

edited by mattsul

Loading

Standardize JSON format #57

Standardize JSON format #57

Comments

PeterSulcs commented May 20, 2021 • edited by mattsul Loading

PeterSulcs commented May 20, 2021 •

edited by mattsul

Loading