解决内存使用上的一些问题 #34

Sunldon · 2024-09-19T07:04:24Z

master进程中调用了ngx_http_health_detect_add_or_update_node_on_local，并创建了内存池 temp_pool = ngx_create_pool(...); 当reload nginx时，这部分内存分配的node节点不会释放，会继续创建新的内存池，继续新的分配。解决办法是在ngx_http_health_detect_create_main_conf中调用ngx_http_health_detect_clear_peers_events进行释放内存
先打印相关信息，再进行free_node，避免内存释放后还继续使用

…了内存池 temp_pool = ngx_create_pool(...); 当reload nginx时，这部分内存分配的node节点不会释放，会继续创建新的内存池，继续新的分配。解决办法是在ngx_http_health_detect_create_main_conf中调用ngx_http_health_detect_clear_peers_events进行释放内存 2. 先打印相关信息，再进行free_node，避免内存释放后还继续使用

alexzzh · 2024-10-17T10:24:54Z

你好，想问下这个mr有经过测试吗?

从代码结构上看，ngx_http_health_detect_create_main_conf 一般只适合做一些配置相关的工作，像ngx_http_health_detect_clear_peers_events这种业务性的代码不适合放在ngx_http_health_detect_create_main_conf中，另外ngx_http_health_detect_create_main_conf中调用 ngx_http_health_detect_clear_peers_events 应该也没什么效果，因为拿不到上次的本地节点信息。

alexzzh · 2024-10-17T10:31:07Z

你说的这个问题可能存在，方便的话可以描述一下你那边看到的问题现象。

针对这个问题，目前的想法是:

针对静态节点信息：master直接使用全局cf->pool来存放节点信息，当reload时，老的cycle的cf->pool会被自动释放
针对动态节点信息: 因为只会在worker进程，所以还是沿用之前的方案，这样可以避免频繁增删改查探测节点时，不会持续占用cf->pool中的内存

相关修改见 #37 有时间可以review此PR，或者直接做出调整并提交PR。

Sunldon · 2024-10-31T08:33:39Z

我的那个mr经过自己的压测，并且是编译期加了sanitize去测试内存泄漏。
ngx_http_health_detect_create_main_conf中调用 ngx_http_health_detect_clear_peers_events 是有效果的，master进程是会持有node的信息，尽管reload，master也不会重启，每次reload就会清除之前的node内存部分，就不会发生泄漏，worker进程reload就无所谓了，是新的进程。
我看了修复主进程可能存在的内存问题 #37 的提交，测试下来是没有内存泄露的，不过有一些存在的问题。nginx第一次启动时（还未出现reload的情况），所有worker进程持有的node内存是从cf->pool分配的，因为这个健康检查模块是支持增删改查探测节点，当删除节点时，对应的内存是无法从cf->pool进行删除的，nginx的机制就是这样的。这种是否是问题也存在模糊的界定，假设频繁的增加删除节点，对应的内存肯定是一直上涨的，无法得到释放，但是实际需要增加很多节点才会出现无空闲内存的情况
我还是希望能够把ngx_http_health_detect_exit_process和ngx_http_health_detect_exit_master加上去，代码的结构也比较完整，虽然进程已经结束，无所谓内存是否释放，但是对于加了sanitize去测试内存泄漏的情况，可以减少报错。

alexzzh · 2024-11-01T08:54:37Z

我的那个mr经过自己的压测，并且是编译期加了sanitize去测试内存泄漏。

ngx_http_health_detect_create_main_conf中调用 ngx_http_health_detect_clear_peers_events 是有效果的，master进程是会持有node的信息，尽管reload，master也不会重启，每次reload就会清除之前的node内存部分，就不会发生泄漏，worker进程reload就无所谓了，是新的进程。

我看了修复主进程可能存在的内存问题 #37 的提交，测试下来是没有内存泄露的，不过有一些存在的问题。nginx第一次启动时（还未出现reload的情况），所有worker进程持有的node内存是从cf->pool分配的，因为这个健康检查模块是支持增删改查探测节点，当删除节点时，对应的内存是无法从cf->pool进行删除的，nginx的机制就是这样的。这种是否是问题也存在模糊的界定，假设频繁的增加删除节点，对应的内存肯定是一直上涨的，无法得到释放，但是实际需要增加很多节点才会出现无空闲内存的情况

我还是希望能够把ngx_http_health_detect_exit_process和ngx_http_health_detect_exit_master加上去，代码的结构也比较完整，虽然进程已经结束，无所谓内存是否释放，但是对于加了sanitize去测试内存泄漏的情况，可以减少报错。

hello 针对以上几点，我们进一步展开讨论:
针对第2点: ngx_http_health_detect_create_main_conf中调用 ngx_http_health_detect_clear_peers_events 应该也没什么效果，我这边测试了你的mr，reload之后会return出去，不清楚你那边测试为什么正常，可以贴一下你的配置文件。
另外在ngx_http_health_detect_create_main_conf中调用ngx_http_health_detect_clear_peers_events 感觉也有点违反直觉。

static void
ngx_http_health_detect_clear_peers_events()
{
    ngx_rbtree_node_t *node;
    ngx_rbtree_node_t *sentinel;
    ngx_health_detect_peers_t *peers;

    ngx_log_debug1(NGX_LOG_DEBUG_HTTP, ngx_cycle->log, 0,
        "clear all the events on %P ", ngx_pid);

    if (peers_manager_ctx == NULL || peers_manager_ctx->peers == NULL) {
        //goto here
        return;
    }

针对第3点: 只有静态节点才会使用cf->pool中的内存，会持久占用内存，但一般静态的节点不会频繁删除，而且数量有限，所以内存占用是可以接受的，如果是动态增删的节点，还是使用自己申请的内存pool，不会使用cf->pool。

针对第4点: 你说的很对，建议另起一个mr,提交相关修改，这样对sanitize这种内存管理工具更加友好。

最后，看到您使用了 sanitize 来检查，方便的话可以提交一个mr，修改编译流程以及README.md，增加sanitize编译选项支持，默认不使用sanitize编译，感谢！

Sunldon · 2024-11-04T07:10:35Z

第二点,我的配置是：

user root;
worker_processes 1;
worker_rlimit_nofile 65535;
error_log logs/error.log debug_http;
events {
    use epoll;
    worker_connections 65535;
}

http {
    log_format  main '"$remote_addr","$remote_port","$time_local","server_addr=$server_addr",'
'"$server_port","up=$upstream_addr","$server_name","$host",'
'"$scheme","$request_method","$request_uri","$server_protocol","$status","$body_bytes_sent",'
'"total=$request_time","$upstream_response_time","$upstream_connect_time","$upstream_header_time",'
'"$request_length","$http_referer",$http_user_agent","xff=$http_x_forwarded_for","$limit_rate","$pid",'
'"rtt=$tcpinfo_rtt","$tcpinfo_rttvar","$tcpinfo_snd_cwnd","$tcpinfo_rcv_space",'
'"$ssl_protocol","$ssl_server_name","$upstream_status"';
    access_log logs/access.log main;
    error_log logs/error.log debug_http;
	
    upstream waf_server {
        health_detect_check interval=2000 rise=3 fall=5 timeout=1000 type=tcp ;
        server 127.0.0.1:12443 weight=100;
    }
    server {
        listen 2443;
        location / {
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Real-PORT $remote_port;
            proxy_pass HTTP://waf_server;
        }                  
    }
}

是否你是针对你新修改的代码进行测试的，有这一个条件导致无法进入!peer->using_cf_pool，否则新增的打印destroy pool是可以打印出来的。

    if (peer->temp_pool != NULL && !peer->using_cf_pool) {
        ngx_log_debug1(NGX_LOG_DEBUG_HTTP, ngx_cycle->log, 0,
            "destroy pool %P ", peer->temp_pool);
        ngx_destroy_pool(peer->temp_pool);
    }

第三点，可能我没有描述清楚动态增加节点的情况，我是用这个健康检查的模块，是配合使用动态域名解析的模块的(https://github.com/Sunldon/nginx-http-upstream-dynamic-servers)
这个模块我看之前你也有提到，每次域名变化时会重新构建upstream peer的内存，域名解析的旧结果对应健康检查的node，是不会被删除的，我更新了这个模块存在的bug（https://github.com/Sunldon/nginx-http-upstream-dynamic-servers/blob/main/doc/nginx_dynamic_server.md）
合理的做法是当域名变化时，重新构建upstream peer的内存，并且调用ngx_http_health_detect_free_node删除对应的旧node，因为这部分代码需要配合多个模块使用，所以没有更新到https://github.com/Sunldon/nginx-http-upstream-dynamic-servers。
这个时候就会存在我说的第三点的情况，nginx第一次启动尚未reload，所有worker进程持有的node内存是从cf->pool分配的，尽管这时候域名变化时调用ngx_http_health_detect_free_node删除对应的旧node，内存也无法释放。如果存在多个域名，并且dns变化较为多次，内存泄露就会比较多了。

alexzzh · 2024-11-04T09:34:26Z

第二点,我的配置是：
user root;
worker_processes 1;
worker_rlimit_nofile 65535;
error_log logs/error.log debug_http;
events {
    use epoll;
    worker_connections 65535;
}

http {
    log_format  main '"$remote_addr","$remote_port","$time_local","server_addr=$server_addr",'
'"$server_port","up=$upstream_addr","$server_name","$host",'
'"$scheme","$request_method","$request_uri","$server_protocol","$status","$body_bytes_sent",'
'"total=$request_time","$upstream_response_time","$upstream_connect_time","$upstream_header_time",'
'"$request_length","$http_referer",$http_user_agent","xff=$http_x_forwarded_for","$limit_rate","$pid",'
'"rtt=$tcpinfo_rtt","$tcpinfo_rttvar","$tcpinfo_snd_cwnd","$tcpinfo_rcv_space",'
'"$ssl_protocol","$ssl_server_name","$upstream_status"';
    access_log logs/access.log main;
    error_log logs/error.log debug_http;
	
    upstream waf_server {
        health_detect_check interval=2000 rise=3 fall=5 timeout=1000 type=tcp ;
        server 127.0.0.1:12443 weight=100;
    }
    server {
        listen 2443;
        location / {
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Real-PORT $remote_port;
            proxy_pass HTTP://waf_server;
        }                  
    }
}
是否你是针对你新修改的代码进行测试的，有这一个条件导致无法进入!peer->using_cf_pool，否则新增的打印destroy pool是可以打印出来的。
    if (peer->temp_pool != NULL && !peer->using_cf_pool) {
        ngx_log_debug1(NGX_LOG_DEBUG_HTTP, ngx_cycle->log, 0,
            "destroy pool %P ", peer->temp_pool);
        ngx_destroy_pool(peer->temp_pool);
    }
第三点，可能我没有描述清楚动态增加节点的情况，我是用这个健康检查的模块，是配合使用动态域名解析的模块的(https://github.com/Sunldon/nginx-http-upstream-dynamic-servers) 这个模块我看之前你也有提到，每次域名变化时会重新构建upstream peer的内存，域名解析的旧结果对应健康检查的node，是不会被删除的，我更新了这个模块存在的bug（https://github.com/Sunldon/nginx-http-upstream-dynamic-servers/blob/main/doc/nginx_dynamic_server.md）合理的做法是当域名变化时，重新构建upstream peer的内存，并且调用ngx_http_health_detect_free_node删除对应的旧node，因为这部分代码需要配合多个模块使用，所以没有更新到https://github.com/Sunldon/nginx-http-upstream-dynamic-servers。这个时候就会存在我说的第三点的情况，nginx第一次启动尚未reload，所有worker进程持有的node内存是从cf->pool分配的，尽管这时候域名变化时调用ngx_http_health_detect_free_node删除对应的旧node，内存也无法释放。如果存在多个域名，并且dns变化较为多次，内存泄露就会比较多了。

hello,
域名变化的时候，三方模块会调用 ngx_http_health_detect_upstream_add_peer，此时 ngx_process == NGX_PROCESS_WORKER 成立，应该不会使用 cf->pool 中的内存池

  if (ngx_process == NGX_PROCESS_WORKER) {
        rc = ngx_http_health_detect_add_or_update_node(policy);
    } else {
        rc = ngx_http_health_detect_add_or_update_node_on_local(policy, 0);
    }

Sunldon · 2024-11-04T09:43:33Z

那这个方案也是可以的

alexzzh · 2024-11-04T10:10:20Z

hello, Sunldon!
你的这个mr核心调整是在ngx_http_health_detect_create_main_conf中调用ngx_http_health_detect_clear_peers_events进行释放内存, 我上面想说的是: reload过程，在 ngx_http_health_detect_create_main_conf中调用ngx_http_health_detect_clear_peers_events会直接return。比如: 修改代码直接return时，输出"ngx_http_health_detect_clear_peers_events1"，否则输出"ngx_http_health_detect_clear_peers_events2"

使用上述你贴出来的nginx.conf进行测试，每次reload都是打印"ngx_http_health_detect_clear_peers_events1"，其实没起到效果

Sunldon · 2024-11-05T03:16:06Z

这是我结合命令行的输出和error.log日志的打印，在图上进行了标注，并没有只看终端的打印，你试试看。

alexzzh added the Further confirmation is required Further confirmation is required label Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

解决内存使用上的一些问题 #34

解决内存使用上的一些问题 #34

Sunldon commented Sep 19, 2024

alexzzh commented Oct 17, 2024

alexzzh commented Oct 17, 2024 •

edited

Loading

Sunldon commented Oct 31, 2024

alexzzh commented Nov 1, 2024 •

edited

Loading

Sunldon commented Nov 4, 2024

alexzzh commented Nov 4, 2024

Sunldon commented Nov 4, 2024

alexzzh commented Nov 4, 2024

Sunldon commented Nov 5, 2024

解决内存使用上的一些问题 #34

Are you sure you want to change the base?

解决内存使用上的一些问题 #34

Conversation

Sunldon commented Sep 19, 2024

alexzzh commented Oct 17, 2024

alexzzh commented Oct 17, 2024 • edited Loading

Sunldon commented Oct 31, 2024

alexzzh commented Nov 1, 2024 • edited Loading

Sunldon commented Nov 4, 2024

alexzzh commented Nov 4, 2024

Sunldon commented Nov 4, 2024

alexzzh commented Nov 4, 2024

Sunldon commented Nov 5, 2024

alexzzh commented Oct 17, 2024 •

edited

Loading

alexzzh commented Nov 1, 2024 •

edited

Loading