DjangoDataSet #1371

ghost · 2022-03-25T15:21:39Z

ghost
Mar 25, 2022

Hi everybody
I'm trying to integrate Kedro with django and was in need for a custom dataset, that integrates with the Django ORM nicely.

I had to put Django's initialization code in one of the __init__.py-files to make it work with kedro ipython (don't know if it is the right location though).

Well anyways. Here is my first AbstractDataSet. I wanted to share it with you and am interested in your thoughts.

# my_kedro_project/src/my_kedro_project/__init__.py
import os
import sys
import django

sys.path.append("my/absolute/path/to/the/django/project")
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "my_django_project_name.settings")
django.setup()


# my_kedro_project/src/my_kedro_project/extras/datasets.py
import importlib
import pandas as pd
from kedro.io import AbstractDataSet, DataSetError

class DjangoDataSet(AbstractDataSet):
    def __init__(self, app_name: str, model_name: str, **kwargs):
        self._app_name = app_name.lower()
        self._model_name = model_name.capitalize()
        self._filepath = kwargs.get("filepath", None)
        self._filter_conditions = kwargs.get("filters", None)
        self._as_iterator = kwargs.get("as_iterator", None)

    def _load(self) -> pd.DataFrame:
        self._module = importlib.import_module(f"{self._app_name}.models")
        self._model = self._module.__getattribute__(self._model_name)

        if self._filter_conditions:
            self._queryset = self._model.objects.filter(**self._filter_conditions)
        else:
            self._queryset = self._model.objects.all()
        if self._as_iterator:
            return self._queryset.iterator()
        else:
            return pd.DataFrame(self._queryset.values())

    def _save(self, data: pd.DataFrame) -> None:
        try:
            data.to_csv(self._filepath, index=False)
        except DataSetError:
            raise
        except (FileNotFoundError, NotADirectoryError):
            raise

    def _describe(self) -> dict:
        return self.__dict__

And the catalog.yml file looks like this:

my_django_dataset:
  type: kedrohub.extras.datasets.DjangoDataSet
  app_name: my_django_app_name
  model_name: my_django_model_name
  # filters:
  #   myfield__isnull: False
  filepath: data/01_raw/my_django_model.csv
  # as_iterator: True

The idea behind filters is that you can filter your django dataset in a "django-way" beforehand. as_iterator returns the loaded dataset as Queryset.iterator() which you can use in any python script and loop over it (e.g. load a pipeline on the iterator) if your dataset is too big for your memory, otherwise it will be loaded as pandas.DataFrame in form of a MemoryDataset. The filepath of course is optional if you do want to store any processed data in a .csv-file afterwards.

What do you think?

datajoely · 2022-03-25T15:41:55Z

datajoely
Mar 25, 2022
Collaborator

@fantasticle this is really cool - I'm not super familiar with the Django ecosystem so would love to hear what other people think!

0 replies

antonymilne · 2022-03-25T17:40:43Z

antonymilne
Mar 25, 2022

I'm not at all familiar with Django but this looks really cool. Could you say any more about how you use kedro and django together?

I had to put Django's initialization code in one of the __init__.py-files to make it work with kedro ipython (don't know if it is the right location though).

Really this belongs in the DjangoDataSet definition itself. Not sure why that wouldn't be compatible with kedro ipython but hopefully #1355, which will be in the soon-to-be-released 0.18, will fix this so it can go in the right place.

1 reply

ghost Mar 28, 2022

Thanks for your replies!

I changed the __init__.py location and it works just as well. So far the DjangoDataSet works great for my needs. I hope someone else might use it too!

avan-sh · 2022-03-29T15:48:14Z

avan-sh
Mar 29, 2022

Hi, @fantasticle,
This looks cool. 😎
I've only briefly worked with Django, so wanted to confirm if I understood this correctly.

If I understood this right, the dataset would get data from a Django Backend using the ORM and allows for us to add filters. This also removes the effort of connecting to the DB using normal credentials and lets Django handle it for us. I feel like I'm definitely missing something here.

I'm confused about the save method though, it seems like this is only to write data given out by a pipeline to a particular CSV file and doesn't have anything related to Django. So, in which situations would we prefer using DjangoDataset over CSVDataset.

1 reply

ghost Mar 30, 2022

Hi @avan-sh
Thank you for your insights!

Let me explain my idea behind the DjangoDataSet, which is of course only a first draft and totally up for any improvement! At the place where I am working Django is used for lots of projects supporting an SAP infrastructure (I work in the health sector). Decentralized data analysis is often not possible due to strict GDPR guidelines. That's why I am developing an ETL tool that will use Kedro and its pipelines to manage workflows but also has the possibility to connect to your data lakes via Django for data extraction.

To use Django with Kedro together should give the opportunity to develop a frontend so that less tech-savvy folk (which is pretty much everyone in health care) could use the mapper one day via a browser as well as build quick APIs (using for example django-ninja) to connect to the further infrastructure. Credential management is then done in the base Django project.

You are right - the save method implements only the export to a .csv-file and does not change the underlying dataset in the database, although I guess this could be easily implemented. A limitation with Django is as far as I can see (and I'm no expert in Django or databases) that load times are horrendous, especially using large datasets. Just to get an idea of what I mean I plotted different insert times here:

Some of the databases I am working with have close to 300.000.000 entries. So you see, Django is simply not feasible and I think one has to revert to standard .csv-files to be able to add data (that has been transformed) via native SQL COPY/LOAD statements to your target database. I have heard about Dask and (Py)Spark but haven't had the time or resources to experiment with them. I'm sure they will provide a solution for large scale datasets but from my point of view, I consider the DjangoDataSet somewhat of a poor man's solution to be able to implement Django with Kedro projects :-)

EDIT: One thing I forgot to mention is that Django's bulk_update() in my experience is not very fast either. This might be something to consider when saving to the original database.

I will be happy for further feedback, your thoughts and comments!

afuetterer · 2022-05-11T12:16:48Z

afuetterer
May 11, 2022

Hi,

in case "all you need to do" is to save a Django model's content to a csv file, might the django-import-export library be useful?
You could use the DjangoDataSet to configure an ExportRessource to export the model's data to a csv file.
https://django-import-export.readthedocs.io/en/latest/getting_started.html#exporting-data

Also you could use library to import data from csv to your database. Of course if you operate on 300.000.000 records, this might not be feasible.

What do you to export in case of relations, the foreign keys? Are those useful in the exported csv file or do you need to traverse the relations?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DjangoDataSet #1371

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

DjangoDataSet #1371

ghost Mar 25, 2022

Replies: 4 comments · 2 replies

datajoely Mar 25, 2022 Collaborator

antonymilne Mar 25, 2022

ghost Mar 28, 2022

avan-sh Mar 29, 2022

ghost Mar 30, 2022

afuetterer May 11, 2022

ghost
Mar 25, 2022

Replies: 4 comments 2 replies

datajoely
Mar 25, 2022
Collaborator

antonymilne
Mar 25, 2022

avan-sh
Mar 29, 2022

afuetterer
May 11, 2022