-
Notifications
You must be signed in to change notification settings - Fork 10
How to publish Open Data using github
This article is translated from http://manzanamecanica.org/2013/04/como_publicar_open_data_con_github_una_guia_paso_a_paso.html*
Many organizations have little resources for "extra" activities, such as create their own Open Data portals. For example, small municipalities and towns do not have the time, personel or budget to expose datasets that could be useful and relevant for the community. In order to alleviate this situation, I present a simple method to publish Open Data using Github.
##What is GitHub?
Github is a company that offers hosting for software projects, based on the GIT version control system. The service is free for Open Source projects. In general, there is no problem for projects smaller than 1GB in size, which is more than enough to host a large amount of datasets.
##If it is for software, why usiing it for data?
For the sake of this example, data and software are the same :-) GIT provides a versioning system for files that makes it easy to add and keep track of changes in datasets. Also, GitHub provides a feature called forks, where other users can replicate an existing repository. This copy can be modified and it is independent of the original version. Eventually, it is possible to apply the changes to the original repository (e.g., a patch fixing a bug). Github makes this proces easy, by providing pull requests, i.e., it is possible to ask the owner of the original repository to accept and adopt the changes made in the fork. Since GIT is a versioning control system, any change will be recorded and can be seen in the historical logs. All of the above makes GIT a nice system to publish and keep track of Open Data.
##Ok, you convinced me, what's next?
First, it is necessary to got to http://github.com and create an account if you don't have one already. Second, it is highly recommended to download the Github cliente (only for Mac and Windows), which provides a nice GUI.
##Create a repository
In the client, we can create a new repository. For this example I will call it github-open-data-portal
:
Inside the repository I'll create a new folder called data
:
##Add datasets
For this example, I simply took some datasets from http://datos.gob.cl: Datasets 3877 (http://datos.gob.cl/datasets/ver/3877) and Dataset 3870 (http://datos.gob.cl/datasets/ver/3870). I took these files and put them in the data
directory I just created.
In the GitHub client, we can see that the files have been detected
In order to upload them, we need to do a commit
, that is, we store the current state of the repository: The new files are selected by default and we only need to describe these new additions to the repository:
Finally, we press on commit & sync
, and the client will upload the changes to GitHub
##Modifying datasets
In the case of the file Transporte.csv
, the last 3 lines are garbage and we want to remove them. We open this file with a text editor and remove the last 3 lines.
After saving the file, we can go again to the GitHub cliente and find that it detected the changes made to Transporte.csv
. The lines in red indicate that they have been removed. Green lines may indicate lines that has been added.
We create a new commit, this time indicating that we have remove some lines from a dataset
Everytime we add or modify datasets, we need to create a new commit and sync with the github repository, to upload the changes. GitHub will keep the whole log of changes, files added/removed, who made each change, etc.
##License
Es importante dejar en claro cómo estamos publicando los datos en términos legales. Una medida sencilla y razonable es tener un archivo LICENCIA (o LICENSE en inglés) que describa estos términos. En nuestro ejemplo, los datasets publicados por datos.gob.cl están licenciados bajo CC-3.0-BY, por lo que agregamos esa licencia en nuestro repositorio.
#¿Cómo acceder a los datos?
Para que un tercero pueda acceder a nuestros datos, sólo es necesario proveer la URL del repositorio, en este caso https://github.com/alangrafu/github-open-data-portal/tree/master/data. Desde ahí es posible bajar cada dataset. También es posible bajar TODO el repositorio accediendo a la copia master de éste que genera github, en nuestro caso https://github.com/alangrafu/github-open-data-portal/archive/master.zip. Finalmente, cualquiera puede clonar nuestro repositorio, obteniendo así todos los datos y las versiones anteriores de estos usando el cliente provisto por GitHub. Demo
Para simplificar el proceso de mostrar el contenido del repositorio en otros sitios, hice una pequeña aplicación que usa la API de GitHub donde se lista todos los archivos disponibles en el directorio data del repositorio a definir. Para hacer esto, es necesario copiar el siguiente código dentro de una página web:
<div id="datasets"></div>
<script src="https://raw.github.com/alangrafu/github-open-data-portal/master/app/js/jquery.min.js"></script>
<script src='https://raw.github.com/alangrafu/github-open-data-portal/master/app/js/main.js'></script>
<script>
GDP.config.userName = 'alangrafu'; //Nombre del usuario
GDP.config.projectName = 'github-open-data-portal'; //nombre del repositorio
GDP.render("#datasets");
</script>
##Demo Un ejemplo del script en acción puede ser visto en http://graves.cl/example-github-opendata
##En resumen
Cada vez se hace más fácil publicar datos en la Web. Utilizando este método es posible publicar datos en 10 minutos o menos. Los pasos a seguir serían:
- Crear cuenta en github y bajar cliente
- Crear repositorio (o hacer un fork al de ejempo github-open-data-portal)
- Agregar todos los datasets que se quiera en directorio data
- Opcionalmente, usar el sistema de listado disponible en el ejemplo para listar los archivos disponibles en otras páginas web