Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize unrenderable content size query #12966

Open
bjester opened this issue Dec 20, 2024 · 3 comments
Open

Optimize unrenderable content size query #12966

bjester opened this issue Dec 20, 2024 · 3 comments
Labels
APP: Device Re: Device App (content import/export, facility-syncing, user permissions, etc.) DEV: backend Python, databases, networking, filesystem... TAG: cloud Issues specific to running Kolibri in a cloud environment TAG: performance User-facing performance

Comments

@bjester
Copy link
Member

bjester commented Dec 20, 2024

Observed behavior

When importing or managing a channel, a couple of queries are run to determine the total file content size. In order to achieve that, the calculation needs account for content nodes that have files with presets that are not supported. The resulting query runs poorly for a channel with large amounts of content, like one that has 30GB of resources. Below is a snippet of the resulting query.

SELECT SUM("__col1")
FROM (
    SELECT DISTINCT
        "content_localfile"."id"        AS Col1,
        "content_localfile"."extension" AS Col2,
        "content_localfile"."available" AS Col3,
        "content_localfile"."file_size" AS Col4,
        "content_localfile"."file_size" AS "__col1"
    FROM "content_localfile"
        INNER JOIN "content_file"
            ON ("content_localfile"."id" = "content_file"."local_file_id")
    WHERE "content_file"."contentnode_id" IN (
        SELECT V0."id"
        FROM "content_contentnode" V0
        WHERE (V0."channel_id" = '<CHANNEL_ID>'
            AND NOT (
                (
                    EXISTS(
                        SELECT (1) AS "a"
                        FROM "content_file" U1
                        WHERE (U1."preset" = 'html5_zip' AND U1."contentnode_id" = V0."id")
                        LIMIT 1
                    ) OR
                    EXISTS(
                        SELECT (1) AS "a"
                        FROM "content_file" U1
                        WHERE (U1."preset" = 'h5p' AND U1."contentnode_id" = V0."id")
                        LIMIT 1
                    ) OR
                    EXISTS(
                        SELECT (1) AS "a"
                        FROM "content_file" U1
                        WHERE (U1."preset" = 'imscp_zip' AND U1."contentnode_id" = V0."id")
                        LIMIT 1
                    ) OR
                    EXISTS(
                        SELECT (1) AS "a"
                        FROM "content_file" U1
                        WHERE (U1."preset" = 'epub' AND U1."contentnode_id" = V0."id")
                        LIMIT 1
                    ) OR
                    EXISTS(
                        SELECT (1) AS "a"
                        FROM "content_file" U1
                        WHERE (U1."preset" = 'slideshow_manifest' AND U1."contentnode_id" = V0."id")
                        LIMIT 1
                    ) OR
                    EXISTS(
                        SELECT (1) AS "a"
                        FROM "content_file" U1
                        WHERE (U1."preset" = 'exercise' AND U1."contentnode_id" = V0."id")
                        LIMIT 1
                    ) OR
                    EXISTS(
                        SELECT (1) AS "a"
                        FROM "content_file" U1
                        WHERE (U1."preset" = 'audio' AND U1."contentnode_id" = V0."id")
                        LIMIT 1
                    ) OR
                    EXISTS(
                        SELECT (1) AS "a"
                        FROM "content_file" U1
                        WHERE (U1."preset" = 'high_res_video' AND U1."contentnode_id" = V0."id")
                        LIMIT 1
                    ) OR
                    EXISTS(
                        SELECT (1) AS "a"
                        FROM "content_file" U1
                        WHERE (U1."preset" = 'low_res_video' AND U1."contentnode_id" = V0."id")
                        LIMIT 1
                    ) OR
                    EXISTS(
                        SELECT (1) AS "a"
                        FROM "content_file" U1
                        WHERE (U1."preset" = 'document' AND U1."contentnode_id" = V0."id")
                        LIMIT 1
                    )
                )
            )
        )
    )
) subquery

Query plan (partial)
Image

Expected behavior

The query should be optimized to reduce its complexity and streamline performance. For example, the following is a first pass at optimizing the query and reduces its complexity, but further optimization of the query and possibly how the total size is calculated could be further improved.

SELECT SUM("__col1")
FROM (
    SELECT DISTINCT
        "content_localfile"."id"        AS Col1,
        "content_localfile"."extension" AS Col2,
        "content_localfile"."available" AS Col3,
        "content_localfile"."file_size" AS Col4,
        "content_localfile"."file_size" AS "__col1"
    FROM "content_localfile"
        INNER JOIN "content_file"
            ON ("content_localfile"."id" = "content_file"."local_file_id")
        INNER JOIN "content_contentnode"
            ON ("content_file"."contentnode_id" = "content_contentnode"."id")
                AND "content_contentnode"."channel_id" = '<CHANNEL_ID>'
    WHERE NOT EXISTS(
        SELECT 1
        FROM "content_file" U1
        WHERE U1."contentnode_id" = "content_file"."contentnode_id"
            AND U1."preset" IN ('html5_zip', 'h5p', 'imscp_zip', 'epub', 'slideshow_manifest', 'exercise', 'audio', 'high_res_video', 'low_res_video')
    )
) AS subquery;

Care should be taken in regards to how the presets are used with the query. The origin of the presets comes from dynamic logic that determines what presets are supported by the Kolibri plugins that are enabled (see renderable_contentnodes_without_topics_q_filter). There could be other important queries that rely on how those are prepared as Django queryset filters.

User-facing consequences

Requests to import or manage a large channel can encounter timeouts, and the user encounters the screen "Sorry! Something went wrong". In cloud environments, timeouts are important in managing request load and attention should be given to optimizing the app prior to increasing timeouts.

Steps to reproduce

Attempt to import a large channel, like one with 30GB of resources, while using Kolibri configured with postgresql and minimal resources given to the database (1 CPU and 4GB of RAM)

Context

Kolibri 0.17.5
w/ PostgreSQL v13
(BCK)

@bjester bjester added TAG: performance User-facing performance DEV: backend Python, databases, networking, filesystem... APP: Device Re: Device App (content import/export, facility-syncing, user permissions, etc.) TAG: cloud Issues specific to running Kolibri in a cloud environment labels Dec 20, 2024
@Mamatha1718
Copy link

Hi @bjester , I would like to tackle this issue, I simplified the query by combining multiple EXISTS checks into a single NOT EXISTS clause with IN, ensuring unsupported preset values are handled efficiently. I also added database indexes on key columns and tested the optimized query with large datasets, significantly improving performance.

Let me know if there’s anything else you'd like me to address.
Thank You,

@akolson
Copy link
Member

akolson commented Dec 27, 2024

Hi @Mamatha1718!

Thank you for your interest in contributing to this issue! However, this issue is not open for contribution. Please be sure to look out for issues marked with the label "help wanted" that are more suitable for contribution. You can find them here.

Thanks again and we are looking forward to your contribution.

@Mamatha1718
Copy link

Thank you for your response. I will check and contribute issues mention as help wanted. Thank you for showing that type of issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
APP: Device Re: Device App (content import/export, facility-syncing, user permissions, etc.) DEV: backend Python, databases, networking, filesystem... TAG: cloud Issues specific to running Kolibri in a cloud environment TAG: performance User-facing performance
Projects
None yet
Development

No branches or pull requests

3 participants