Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize JSON format #57

Open
4 of 5 tasks
PeterSulcs opened this issue May 20, 2021 · 0 comments
Open
4 of 5 tasks

Standardize JSON format #57

PeterSulcs opened this issue May 20, 2021 · 0 comments
Assignees
Labels
invalid This doesn't seem right

Comments

@PeterSulcs
Copy link
Contributor

PeterSulcs commented May 20, 2021

❯ curl https://machinedatahub.ai/datasets.json | jq ".[0] | ."
{
  "id": "1",
  "Rank": 1,
  "Name": "Combined Cycle Power Plant",
  "Owner": "University of California Irvine",
  "URL": [
    "https://archive.ics.uci.edu/ml/machine-learning-databases/00294/CCPP.zip"
  ],
  "Short Summary": "The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant. A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the Vacuum is collected from and has effect on the Steam Turbine, he other three of the ambient variables effect the GT performance. For comparability with our baseline studies, and to allow 5x2 fold statistical tests be carried out, we provide the data shuffled five times. For each shuffling 2-fold CV is carried out and the resulting 10 measurements are used for statistical testing.",
  "One Line": "Data collected from a Combined Power Plant working full load over 6 years.",
  "File Type": "???",
  "Sector": "Power",
  "ML Type": [
    "Regression"
  ],
  "Labeled": "Yes",
  "Time Series": "No",
  "Simulation (Yes/No)": "N/A",
  "Attributes": 4,
  "Instances": 9568,
  "Downloads": 191037,
  "Likes": 0,
  "File Size": "3.7 MB",
  "img_link": "https://www.miga.org/sites/default/files/2018-06/power-plant-bright-blue-sky.jpg",
  "Datasets": [
    {
      "Name": "Dataset 1",
      "URL": "https://archive.ics.uci.edu/ml/machine-learning-databases/00294/CCPP.zip",
      "Likes": 0,
      "Downloads": 191037,
      "File Size": "3.7 MB"
    }
  ]
}

I would recommend the following refactor (note that this will have to be done across the web app and the python lib because the schema is pretty embedded into the python code).

  • Make JSON keys camelCase without spaces unless the key is an abbreviation (id is fine, URL is fine)
  • JSON keys should not include special characters (i.e. Simulation (Yes/No))
  • Make URL just a string in all cases. currently it is an array except for one dataset where it is a string. It is mostly an array of length 1, but sometimes is the URL for each of the files. Since the files are tracked separately now in the nested format, I would recommend this URL be a link to the human readable webpage where data is accessible.
  • When a value is N/A or Unknown or blank the normal way of representing this in JSON is to use null
  • One of the ids is represented as a string rather than a number
 curl https://machinedatahub.ai/datasets.json -s | jq ".[] | .URL | select(length > 1) | ."
[
  "http://archive.ics.uci.edu/ml/machine-learning-databases/00198/Faults.NNA",
  "http://archive.ics.uci.edu/ml/machine-learning-databases/00198/Faults27x7_var"
]
[
  "https://ti.arc.nasa.gov/c/5/, https://ti.arc.nasa.gov/c/9/",
  "https://ti.arc.nasa.gov/c/14/",
  "https://ti.arc.nasa.gov/c/15/",
  "https://ti.arc.nasa.gov/c/16/",
  "https://ti.arc.nasa.gov/c/17/"
]
[
  "https://ti.arc.nasa.gov/c/21/",
  "https://ti.arc.nasa.gov/c/20/",
  "https://ti.arc.nasa.gov/c/19/"
]
[
  "https://ti.arc.nasa.gov/c/6/",
  "https://ti.arc.nasa.gov/c/47/"
]
[
  "https://ti.arc.nasa.gov/c/25/",
  "https://ti.arc.nasa.gov/c/26/",
  "https://ti.arc.nasa.gov/c/27/",
  "https://ti.arc.nasa.gov/c/28/",
  "https://ti.arc.nasa.gov/c/29/",
  "https://ti.arc.nasa.gov/c/30/"
]
[
  "https://ti.arc.nasa.gov/c/33/",
  "https://ti.arc.nasa.gov/c/34/",
  "https://ti.arc.nasa.gov/c/35/"
]
[
  "https://ti.arc.nasa.gov/c/38/",
  "https://ti.arc.nasa.gov/c/39/",
  "https://ti.arc.nasa.gov/c/40/",
  "https://ti.arc.nasa.gov/c/41/",
  "https://ti.arc.nasa.gov/c/43/",
  "https://ti.arc.nasa.gov/c/44/"
]
[
  "https://ti.arc.nasa.gov/c/45/",
  "https://ti.arc.nasa.gov/c/46/"
]
"http://archive.ics.uci.edu/ml/machine-learning-databases/secom/"
@PeterSulcs PeterSulcs added the invalid This doesn't seem right label May 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

4 participants