Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FluentBit is unable to recover from too many accumulated buffers with filesystem storage type #882

Open
mohannara opened this issue Dec 18, 2024 · 3 comments

Comments

@mohannara
Copy link

mohannara commented Dec 18, 2024

Describe the question/issue

Fluentbit is sending data to s3 and it works fine for sometime and then its stuck and not sending data to S3.

Configuration

apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluent-bit
  namespace: amazon-cloudwatch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fluent-bit-role
rules:
  - nonResourceURLs:
      - /metrics
    verbs:
      - get
  - apiGroups: [""]
    resources:
      - namespaces
      - pods
      - pods/logs
      - nodes
      - nodes/proxy
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: fluent-bit-role-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: fluent-bit-role
subjects:
  - kind: ServiceAccount
    name: fluent-bit
    namespace: amazon-cloudwatch
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: amazon-cloudwatch
  labels:
    k8s-app: fluent-bit
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush                     5
        Grace                     30
        Log_Level                 info
        Daemon                    off
        Parsers_File              parsers.conf
        HTTP_Server               ${HTTP_SERVER}
        HTTP_Listen               0.0.0.0
        HTTP_Port                 ${HTTP_PORT}
        storage.path              /var/fluent-bit/state/flb-storage/
        storage.sync              normal
        storage.checksum          off
        storage.backlog.mem_limit 1024M
        storage.max_chunks_up     1024
        storage.metrics           On

    @INCLUDE application-log.conf
    @INCLUDE dataplane-log.conf
    @INCLUDE host-log.conf

  application-log.conf: |
    [INPUT]
        Name                tail
        Tag                 application.*
        Exclude_Path        /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
        Path                /var/log/containers/*.log
        multiline.parser    docker, cri
        DB                  /var/fluent-bit/state/flb_container.db
        Mem_Buf_Limit       600MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Rotate_Wait         30
        storage.type        filesystem
        Read_from_Head      ${READ_FROM_HEAD}
        #storage.pause_on_chunks_overlimit   true
        

    [INPUT]
        Name                tail
        Tag                 fluentbit.*
        Path                /var/log/containers/fluent-bit*
        multiline.parser    docker, cri
        DB                  /var/fluent-bit/state/flb_log.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [INPUT]
        Name                tail
        Tag                 application.*
        Path                /var/log/containers/cloudwatch-agent*
        multiline.parser    docker, cri
        DB                  /var/fluent-bit/state/flb_cwagent.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [FILTER]
        Name                  multiline
        match                 application.*
        multiline.key_content log
        multiline.parser      java_multiline
        #emitter_mem_buf_limit 4048MB
        emitter_storage.type  filesystem
        buffer                On


    [FILTER]
        Name                kubernetes
        Match               application.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_Tag_Prefix     application.var.log.containers.
        Merge_Log           On
        Merge_Log_Key       log_processed
        K8S-Logging.Parser  On
        K8S-Logging.Exclude Off
        Labels              Off
        Annotations         Off
        Use_Kubelet         On
        Kubelet_Port        10250
        Buffer_Size         0

    # [OUTPUT]
    #     Name                cloudwatch_logs
    #     Match               application.*
    #     region              ${AWS_REGION}
    #     log_group_name      /aws/containerinsights/${CLUSTER_NAME}/application
    #     log_stream_prefix   ${HOST_NAME}-
    #     auto_create_group   true
    #     extra_user_agent    container-insights

    [OUTPUT]
        Name s3
        Match application.*
        bucket dte-crm-subsystem-prod-stdout
        region ${AWS_REGION}
        store_dir /var/log/s3-dte
        store_dir_limit_size 300MB
        #total_file_size 5MB
        upload_timeout 30s


    # [OUTPUT]
    #     name                  stdout
    #     match                 application.*


  dataplane-log.conf: |
    [INPUT]
        Name                systemd
        Tag                 dataplane.systemd.*
        Systemd_Filter      _SYSTEMD_UNIT=docker.service
        Systemd_Filter      _SYSTEMD_UNIT=containerd.service
        Systemd_Filter      _SYSTEMD_UNIT=kubelet.service
        DB                  /var/fluent-bit/state/systemd.db
        Path                /var/log/journal
        Read_From_Tail      ${READ_FROM_TAIL}

    [INPUT]
        Name                tail
        Tag                 dataplane.tail.*
        Path                /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
        multiline.parser    docker, cri
        DB                  /var/fluent-bit/state/flb_dataplane_tail.db
        Mem_Buf_Limit       50MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Rotate_Wait         30
        storage.type        filesystem
        Read_from_Head      ${READ_FROM_HEAD}

    [FILTER]
        Name                modify
        Match               dataplane.systemd.*
        Rename              _HOSTNAME                   hostname
        Rename              _SYSTEMD_UNIT               systemd_unit
        Rename              MESSAGE                     message
        Remove_regex        ^((?!hostname|systemd_unit|message).)*$

    [FILTER]
        Name                aws
        Match               dataplane.*
        imds_version        v2

    [OUTPUT]
        Name                cloudwatch_logs
        Match               dataplane.*
        region              ${AWS_REGION}
        log_group_name      /aws/containerinsights/${CLUSTER_NAME}/dataplane
        log_stream_prefix   ${HOST_NAME}-
        auto_create_group   true
        extra_user_agent    container-insights

  host-log.conf: |
    [INPUT]
        Name                tail
        Tag                 host.dmesg
        Path                /var/log/dmesg
        Key                 message
        DB                  /var/fluent-bit/state/flb_dmesg.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [INPUT]
        Name                tail
        Tag                 host.messages
        Path                /var/log/messages
        Parser              syslog
        DB                  /var/fluent-bit/state/flb_messages.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [INPUT]
        Name                tail
        Tag                 host.secure
        Path                /var/log/secure
        Parser              syslog
        DB                  /var/fluent-bit/state/flb_secure.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [FILTER]
        Name                aws
        Match               host.*
        imds_version        v2

    [OUTPUT]
        Name                cloudwatch_logs
        Match               host.*
        region              ${AWS_REGION}
        log_group_name      /aws/containerinsights/${CLUSTER_NAME}/host
        log_stream_prefix   ${HOST_NAME}.
        auto_create_group   true
        extra_user_agent    container-insights

  parsers.conf: |
    [PARSER]
        Name                syslog
        Format              regex
        Regex               ^(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
        Time_Key            time
        Time_Format         %b %d %H:%M:%S

    [PARSER]
        Name                container_firstline
        Format              regex
        Regex               (?<log>(?<="log":")\S(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
        Time_Key            time
        Time_Format         %Y-%m-%dT%H:%M:%S.%LZ

    [PARSER]
        Name                cwagent_firstline
        Format              regex
        Regex               (?<log>(?<="log":")\d{4}[\/-]\d{1,2}[\/-]\d{1,2}[ T]\d{2}:\d{2}:\d{2}(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
        Time_Key            time
        Time_Format         %Y-%m-%dT%H:%M:%S.%LZ

    [MULTILINE_PARSER]
        Name          java_multiline
        type          regex
        flush_timeout 5000
        
        # Regex rules for multiline parsing
        # ---------------------------------
        
        # configuration hints:
        
        #  - first state always has the name: start_state
        #  - every field in the rule must be inside double quotes
        
        # rules |   state name  | regex pattern                  | next state
        # ------|---------------|--------------------------------------------
        rule      "start_state"   "(^\[?\d{4}-\d{1,2}-\d{1,2}[T\s]\d{1,2}:\d{1,2}:\d{1,2}.*)$"  "cont"
        rule      "cont"          "/^(?!(\[?\d{4}-\d{1,2}-\d{1,2}).*)/"                     "cont"

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: amazon-cloudwatch
  labels:
    k8s-app: fluent-bit
    version: v1
    kubernetes.io/cluster-service: "true"
spec:
  selector:
    matchLabels:
      k8s-app: fluent-bit
  template:
    metadata:
      labels:
        k8s-app: fluent-bit
        version: v1
        kubernetes.io/cluster-service: "true"
    spec:
      containers:
      - name: fluent-bit
        image: public.ecr.aws/aws-observability/aws-for-fluent-bit:2.32.2.20241008
        imagePullPolicy: Always
        env:
            - name: AWS_REGION
              valueFrom:
                configMapKeyRef:
                  name: fluent-bit-cluster-info
                  key: logs.region
            - name: CLUSTER_NAME
              valueFrom:
                configMapKeyRef:
                  name: fluent-bit-cluster-info
                  key: cluster.name
            - name: HTTP_SERVER
              valueFrom:
                configMapKeyRef:
                  name: fluent-bit-cluster-info
                  key: http.server
            - name: HTTP_PORT
              valueFrom:
                configMapKeyRef:
                  name: fluent-bit-cluster-info
                  key: http.port
            - name: READ_FROM_HEAD
              valueFrom:
                configMapKeyRef:
                  name: fluent-bit-cluster-info
                  key: read.head
            - name: READ_FROM_TAIL
              valueFrom:
                configMapKeyRef:
                  name: fluent-bit-cluster-info
                  key: read.tail
            - name: HOST_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: HOSTNAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: CI_VERSION
              value: "k8s/1.3.26"
        resources:
            limits:
              cpu: 2000m
              memory: 6144Mi
            requests:
              cpu: 500m
              memory: 600Mi
        volumeMounts:
        # Please don't change below read-only permissions
        - name: fluentbitstate
          mountPath: /var/fluent-bit/state
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
        - name: runlogjournal
          mountPath: /run/log/journal
          readOnly: true
        - name: dmesg
          mountPath: /var/log/dmesg
          readOnly: true
      terminationGracePeriodSeconds: 10
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      volumes:
      - name: fluentbitstate
        hostPath:
          path: /var/fluent-bit/state
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: fluent-bit-config
        configMap:
          name: fluent-bit-config
      - name: runlogjournal
        hostPath:
          path: /run/log/journal
      - name: dmesg
        hostPath:
          path: /var/log/dmesg
      serviceAccountName: fluent-bit
      nodeSelector:
        kubernetes.io/os: linux

Fluent Bit Log Output

[2024/12/18 12:33:43] [ info] [input:storage_backlog:storage_backlog.8] queueing tail.0:1-1734160287.631424024.flb
[2024/12/18 12:33:43] [ info] [input:storage_backlog:storage_backlog.8] queueing tail.0:1-1734160331.698451330.flb
[2024/12/18 12:33:43] [ info] [input:storage_backlog:storage_backlog.8] queueing tail.0:1-1734160364.698626403.flb
[2024/12/18 12:33:43] [ info] [input:storage_backlog:storage_backlog.8] queueing tail.0:1-1734160430.908663992.flb
[2024/12/18 12:33:43] [ info] [input:storage_backlog:storage_backlog.8] queueing tail.0:1-1734160462.846875919.flb
[2024/12/18 12:33:43] [ info] [input:storage_backlog:storage_backlog.8] queueing tail.0:1-1734160468.165929808.flb

chunks are pending in below locations.

/var/fluent-bit/state/flb-storage/tail.0
sh-4.2$ ls -ltr |wc
9902
sh-4.2$ ls -ltr| more
-rw------- 1 root root 2048984 Dec 12 18:46 1-1734028879.597843005.flb
-rw------- 1 root root 2048788 Dec 12 18:48 1-1734028951.455391458.flb
-rw------- 1 root root 2049120 Dec 12 18:48 1-1734028999.60364959.flb
-rw------- 1 root root 2048756 Dec 12 18:49 1-1734028949.226740813.flb
-rw------- 1 root root 2059566 Dec 12 18:50 1-1734029020.484330725.flb

/var/fluent-bit/state/flb-storage/emitter.9
sh-4.2$ ls -ltr|wc
1104

Fluent Bit Version Info

Fluent Bit is running as a Daemonset
public.ecr.aws/aws-observability/aws-for-fluent-bit:2.32.2.20241008
[fluent bit] version=1.9.10, commit=eba89f4660
EKS 1.30

Application Details

Fluentbit should concatenate stack traces or application logs print in multiple lines.

Metrices

sh-4.2$ curl -s http://127.0.0.1:2020/api/v1/storage | jq
{
"storage_layer": {
"chunks": {
"total_chunks": 12510,
"mem_chunks": 9,
"fs_chunks": 12501,
"fs_chunks_up": 3158,
"fs_chunks_down": 9343
}
},
"input_chunks": {
"tail.0": {
"status": {
"overlimit": true,
"mem_size": "5.7G",
"mem_limit": "572.2M"
},
"chunks": {
"total": 0,
"up": 0,
"down": 0,
"busy": 0,
"busy_size": "0b"
}
},
"tail.1": {
"status": {
"overlimit": false,
"mem_size": "38.5K",
"mem_limit": "4.8M"
},
"chunks": {
"total": 1,
"up": 1,
"down": 0,
"busy": 0,
"busy_size": "0b"
}
},
"tail.2": {
"status": {
"overlimit": false,
"mem_size": "0b",
"mem_limit": "4.8M"
},
"chunks": {
"total": 0,
"up": 0,
"down": 0,
"busy": 0,
"busy_size": "0b"
}
},
"systemd.3": {
"status": {
"overlimit": false,
"mem_size": "64.6K",
"mem_limit": "0b"
},
"chunks": {
"total": 5,
"up": 5,
"down": 0,
"busy": 5,
"busy_size": "64.6K"
}
},
"tail.4": {
"status": {
"overlimit": false,
"mem_size": "0b",
"mem_limit": "47.7M"
},
"chunks": {
"total": 1,
"up": 0,
"down": 1,
"busy": 0,
"busy_size": "0b"
}
},
"tail.5": {
"status": {
"overlimit": false,
"mem_size": "0b",
"mem_limit": "4.8M"
},
"chunks": {
"total": 0,
"up": 0,
"down": 0,
"busy": 0,
"busy_size": "0b"
}
},
"tail.6": {
"status": {
"overlimit": false,
"mem_size": "111.3K",
"mem_limit": "4.8M"
},
"chunks": {
"total": 3,
"up": 3,
"down": 0,
"busy": 2,
"busy_size": "75.0K"
}
},
"tail.7": {
"status": {
"overlimit": false,
"mem_size": "0b",
"mem_limit": "4.8M"
},
"chunks": {
"total": 0,
"up": 0,
"down": 0,
"busy": 0,
"busy_size": "0b"
}
},
"storage_backlog.8": {
"status": {
"overlimit": false,
"mem_size": "0b",
"mem_limit": "0b"
},
"chunks": {
"total": 3008,
"up": 3008,
"down": 0,
"busy": 1890,
"busy_size": "3.6G"
}
},
"emitter_for_multiline.0": {
"status": {
"overlimit": false,
"mem_size": "7.9M",
"mem_limit": "9.5M"
},
"chunks": {
"total": 200,
"up": 150,
"down": 50,
"busy": 150,
"busy_size": "7.9M"
}
}
}
}

@mohannara
Copy link
Author

Can someone support on this case? A bit urgent.

@swapneils
Copy link
Contributor

swapneils commented Jan 3, 2025

From only the information provided, I'm not sure what is happening in your application.

Can you follow all of the below recommendations, or let us know if any of them end up solving your issue?

  1. Review the full fluent bit logs to see if our debugging guide applies to anything there
  2. Check if there are lines saying "failed to flush chunk", and if so trace the specific plugin worker that saw that failure to see why the flush failed
  3. If this behavior occurs at some point after the output buffer is filled, provide the fluent bit logs a few minutes around the point where the buffer is filled
  4. Provide the fluent bit logs from a few minutes before to a few minutes after after the point where the S3 plugin starts misbehaving
  5. Check (e.g. via utilization metrics) if the filesystem and memory buffers are static or steadily increasing (if the former, it's likely Fluent Bit is not receiving any logs in the first place, in which case the issue would likely be elsewhere)
  6. Clarify the meaning of "not sending data"; i.e. is there 0 data sent, is there a trickle of data but not at the same rate as the input, does Fluent Bit start sending data again if you cut off the input data and let it process its backlog?
  7. Verify that logs are being sent to the fluent bit process at all, or if your application itself is the source of the misbehavior
  8. Get reproduction steps from a clean start for the issue
  9. Try using other input plugins (e.g. tcp) to see if this issue occurs there as well

@mohannara
Copy link
Author

Hi @swapneils ,
I was able to pinpoint the issue. if we remove multiline filter and use above multiline parser (java_multiline) in the tail plugin (like below) then there won't be any filesystem buffers piling up. Also, fluentbit pod memory maintain under 100MB.
But, time to time incorrect records formats are popping up. It seems cri parser not applying before the multiline.
Could you please check the attached configmap and send your recommendations?
fluent-bit-config-ticket.txt

    [INPUT]
        Name                tail
        Tag                 application.*
        Exclude_Path        /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
        Path                /var/log/containers/*.log
        multiline.parser    docker, cri, java_multiline

Sample multiline log -->
2025-01-05T06:27:21.661666737Z stdout F 2025-01-05 11:57:21.661 [scheduling-1] ERROR l.d.s.repository.impl.OmDaoImpl getAllSalesChannelSpreadOrderIds DWEmail_cb919ac6-3dac-42de-919a-c63dacf2de05 - StatementCallback; bad SQL grammar [SELECT T.ORDER_ID FROM SALES_APP_LOG T,ORDERS O,OFFER OFR WHERE T.ORDER_ID=O.ORDER_ID AND T.OFFER_CODE=OFR.OFFER_CODE AND O.ORDER_CHANNEL = 'ANDROID_APP' AND O.ORDER_STATUS = 'SUBMITTED_TO_POS' AND OFR.OFFER_TYPE IN ('SPREAD','SPREAD_HP') AND T.SPREAD_CREATION_STATUS IS NULL]
2025-01-05T06:27:21.661706419Z stdout F java.sql.SQLSyntaxErrorException: ORA-00904: "T"."SPREAD_CREATION_STATUS": invalid identifier
2025-01-05T06:27:21.661712069Z stdout F
2025-01-05T06:27:21.661717809Z stdout F at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:509)
2025-01-05T06:27:21.66172284Z stdout F at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:461)
2025-01-05T06:27:21.66172831Z stdout F at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1104)
2025-01-05T06:27:21.66173264Z stdout F at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:550)

Please note that fluentbit is running as a DaemonSet and there are over 200 pods are running inside a worker node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants